[ 
https://issues.apache.org/jira/browse/PARQUET-172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355894#comment-14355894
 ] 

Ryan Blue commented on PARQUET-172:
-----------------------------------

This was based on a [reported parquet-scrooge 
bug|https://github.com/laurencer/parquet-mr-bug/commit/d09126e03e2dc9f60eb5fd7b13b8166ab4d52ba0]
 and an incomplete analysis on my part. I looked only at the Schema converter, 
which doesn't show support for binary but doesn't need it. Support for binary 
already exists; 
[{{ParquetWriteProtocol}}|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ParquetWriteProtocol.java#L298]
 allows binary data to be written as either a String or binar, and the 
[{{ThriftRecordConverter}}|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ThriftRecordConverter.java#L413]
 generates correct protocol events that handle both.

It turns out that the original bug report wasn't correct. The summary does 
appear to have corrupt data:
{code}
Parquet Example
---------------

    ./sbt "run-main com.rouesnel.parquetmr.bug.ParquetExample"

This should print the following:

    Parquet
    =======

    Parquet File written to /some/random/location/test-324324-foo.parquet

    After encoding - binary field is equal to original binary field: false
    After encoding - binary field is equal to UTF8 encoded binary field: false


    Original
    -----
    binaryField:         -123, 20, 33
    stringField:         foo
    binaryAsStringField: -17, -65, -67, 20, 33



    Thrift Serialized
    -----
    binaryField:         3, 0, 0, 0, -123, 20, 33
    stringField:         foo
    binaryAsStringField: -17, -65, -67, 20, 33
{code}

Tests show that the underlying byte buffers are correct and the error is caused 
by assuming the ByteBuffer's backing array contains only the bytes serialized. 
When restricted to the position and limit, the data is correct.

> Add support for non-String binary in parquet-thrift
> ---------------------------------------------------
>
>                 Key: PARQUET-172
>                 URL: https://issues.apache.org/jira/browse/PARQUET-172
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.5.0
>            Reporter: Ryan Blue
>
> Thrift [considers binary a "special" 
> type|https://thrift.apache.org/docs/types] that isn't in the official spec 
> but is "to provide better interoperability with java". The parquet-thrift 
> side doesn't currently support binary because Thrift String fields are 
> converted to UTF8-annotated binary. The result is that binary fields get 
> mangled when stored in Parquet because Parquet assumes they are UTF8.
> I think some storage layer in Java Thrift must know about binary and pass the 
> unencoded bytes, but that Parquet hasn't implemented a similar hack. (The 
> [type 
> conversion|https://github.com/apache/incubator-parquet-mr/blob/master/parquet-thrift/src/main/java/parquet/thrift/ThriftSchemaConverter.java#L86]
>  code at least has no entry for binary.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to