raboof commented on code in PR #48870:
URL: https://github.com/apache/arrow/pull/48870#discussion_r2695128361


##########
docs/source/format/Security.rst:
##########
@@ -0,0 +1,150 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _format_security:
+
+***********************
+Security Considerations
+***********************
+
+How to read this
+================
+
+Hereafter we try list potential security concerns when dealing with the various
+Arrow specifications. Some of these concerns will apply directly to users of
+Arrow through existing implementations. Others should only be relevant for the
+implementors of Arrow libraries: by this, we mean libraries that provide APIs
+abstracting away from the details of the Arrow formats and protocols.
+
+Columnar Format
+===============
+
+The Arrow :ref:`columnar format <_format_columnar>` involves direct access to 
the
+process' address space. As such, in-memory Arrow data should not be accessed
+without care.
+
+Invalid data
+------------
+
+Reading and interpreting Arrow data involves reading into several buffers,
+sometimes in non-trivial ways. This may for instance involve data-dependent
+indirect addressing: to read a value from a Binary array, you need to
+1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
+these offsets in buffer #3. If the offsets are invalid (deliberately or not),
+then step 2) can access invalid memory (potentially crashing the process) or
+memory unrelated to Arrow (potentially allowing an attacker to exfiltrate
+confidential data).
+
+.. TODO:
+   For each layout, we should list the associated security risks and the 
recommended
+   steps to validate (perhaps in Columnar.rst)
+
+Advice for users
+''''''''''''''''
+
+If you receive Arrow in-memory data from an untrusted source, it is
+**extremely recommended** that you first validate the data for structural
+soundness before reading it. Many Arrow implementations provide APIs to do
+such validation.
+
+.. TODO: link to some validation APIs for the main implementations here?
+
+Advice for implementors
+'''''''''''''''''''''''
+
+It is **recommended** that you provide APIs to validate Arrow data, so that 
users
+can assert whether data coming from untrusted sources can be safely accessed.
+
+Uninitialized data
+------------------
+
+A less obvious pitfall is when some parts of an Arrow array are left 
uninitialized.
+For example, if a element of a primitive Arrow array is marked null through its
+validity bitmap, the corresponding value in the values buffer can be ignored 
for all
+purposes. It is therefore tempting, when creating an array with null values, to
+not initialize the corresponding value slots.
+
+However, this then introduces a serious security if the Arrow data is 
serialized

Review Comment:
   ```suggestion
   However, this then introduces a serious security risk if the Arrow data is 
serialized
   ```



##########
docs/source/format/Security.rst:
##########
@@ -0,0 +1,150 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _format_security:
+
+***********************
+Security Considerations
+***********************
+
+How to read this
+================
+
+Hereafter we try list potential security concerns when dealing with the various
+Arrow specifications. Some of these concerns will apply directly to users of
+Arrow through existing implementations. Others should only be relevant for the
+implementors of Arrow libraries: by this, we mean libraries that provide APIs
+abstracting away from the details of the Arrow formats and protocols.
+
+Columnar Format
+===============
+
+The Arrow :ref:`columnar format <_format_columnar>` involves direct access to 
the
+process' address space. As such, in-memory Arrow data should not be accessed
+without care.
+
+Invalid data
+------------
+
+Reading and interpreting Arrow data involves reading into several buffers,
+sometimes in non-trivial ways. This may for instance involve data-dependent
+indirect addressing: to read a value from a Binary array, you need to
+1) read its offsets in buffer #2, and 2) read the range of bytes delimited by
+these offsets in buffer #3. If the offsets are invalid (deliberately or not),
+then step 2) can access invalid memory (potentially crashing the process) or
+memory unrelated to Arrow (potentially allowing an attacker to exfiltrate
+confidential data).
+
+.. TODO:
+   For each layout, we should list the associated security risks and the 
recommended
+   steps to validate (perhaps in Columnar.rst)
+
+Advice for users
+''''''''''''''''
+
+If you receive Arrow in-memory data from an untrusted source, it is
+**extremely recommended** that you first validate the data for structural
+soundness before reading it. Many Arrow implementations provide APIs to do
+such validation.
+
+.. TODO: link to some validation APIs for the main implementations here?
+
+Advice for implementors
+'''''''''''''''''''''''
+
+It is **recommended** that you provide APIs to validate Arrow data, so that 
users
+can assert whether data coming from untrusted sources can be safely accessed.
+
+Uninitialized data
+------------------
+
+A less obvious pitfall is when some parts of an Arrow array are left 
uninitialized.
+For example, if a element of a primitive Arrow array is marked null through its

Review Comment:
   ```suggestion
   For example, if an element of a primitive Arrow array is marked null through 
its
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to