[ 
https://issues.apache.org/jira/browse/ARROW-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16175753#comment-16175753
 ] 

Wes McKinney commented on ARROW-1589:
-------------------------------------

Could you clarify what kinds of malformed input you are talking about? I am not 
sure it is a requirement for the stream reader to be able to consistently 
return errors on random bytes input. 

In Arrow we need to distinguish between "can't fail" and "can fail" errors. The 
"can't fail" errors you detect in debug builds with DCHECK assertions. These 
are the kinds of errors that can only occur if the library developer (for 
example, an Arrow Java developer or an Arrow C++ developer) has implemented 
something incorrectly. Unit tests or integration tests must be written to 
exercise relevant code paths to catch these issues. I have found the debug 
assertions are especially helpful when refactoring, and they cost nothing in 
release builds.

In the case of reading record batches from a stream, i.e. according to the 
encapsulated message format described in http://arrow.apache.org/docs/ipc.html, 
if you are able to read the indicated number of metadata bytes from the stream, 
then it is assumed to be a valid Flatbuffer, and the sender has respected 
invariants that are detectable in an integration test -- we may check do some 
sanity checks of invariants such as the number of buffers in a record batch. 
Same goes for the message body.

If a Flatbuffer is truly malformed in some way in a way that cannot be detected 
with debug assertions, I am unsure whether we can protect ourselves from 
segfaults. The sender of a record batch stream must be assumed to be trusted 
(i.e. you have adequate integration tests against it to catch "can't fail" 
exceptions) to proceed with reading a stream at all.

> Fuzzing for certain input formats
> ---------------------------------
>
>                 Key: ARROW-1589
>                 URL: https://issues.apache.org/jira/browse/ARROW-1589
>             Project: Apache Arrow
>          Issue Type: Test
>            Reporter: Marco Neumann
>            Assignee: Marco Neumann
>
> The arrow lib should have fuzzing tests for certain input formats, e.g. for 
> reading record batches from streams. Ideally, malformed input must not crash 
> the system but must report a proper error. This could easily be implemented 
> e.g. w/ [libfuzzer|https://llvm.org/docs/LibFuzzer.html] in combination with 
> address sanitizer (that's already implemented by Arrow's build system).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to