kyle created ORC-1116:
-------------------------

             Summary: Csv-import tool exported field become empty
                 Key: ORC-1116
                 URL: https://issues.apache.org/jira/browse/ORC-1116
             Project: ORC
          Issue Type: Bug
          Components: tools
    Affects Versions: 1.7.3
            Reporter: kyle
         Attachments: CSVFileImport.dif

When exporting a orc file with schema like "struct<a:string,b:binary>", if the 
data in column "b" has very long bytes (over 4MB), the process could 
segmentation fault or exported data in column "a" becomes empty string.

Here is me trying to explain the code, maybe totally not correct, please bear 
with me.

Following the code in CSVFileImport.cc, when writing to a orc file, all string 
type columns is using one databuffer inside function fillStringValues(). When 
one data length is larger than the buffer, the buffer will be resized. The 
resize() operation will cause all references and iterators into buffer.data() 
become invalid. 

In this case, when field "a" finished writing data into buffer, field "b" begin 
writing will resize the buffer, invalidate previous buffer.data(), so field 
"a"'s stringBatch pointing to buffer.data() is no longer valid.

A workaround could use different databuffers for each string type column, 
however requires allocating 4MB memory each. (As the attached file) Or let all 
previous stringBatch re-points to new databuffer's address.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to