Edward Seidl created ARROW-13965:
------------------------------------

             Summary: [C++] dynamic_casts in parquet TypedColumnWriterImpl 
impacting performance
                 Key: ARROW-13965
                 URL: https://issues.apache.org/jira/browse/ARROW-13965
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
         Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 
11.5.2 (clang 11.0.0)
            Reporter: Edward Seidl
         Attachments: arrow_downcast.patch

The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), 
and WriteValuesSpaced() in TypedColumnWriterImpl 
(cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ 
object to either DictEncoder or ValueEncoderType pointers.  When calling 
WriteBatch() with a large number of values this is ok, but when writing batches 
of 1 (as when using the stream api), these dynamic casts can consume a great 
deal of cpu.  Using gperftools against code I wrote to do a log structured 
merge of several parquet files, I measured the dynamic_casts taking as much as 
25% of execution time.

By modifying TypedColumnWriterImpl to save downcasted observer pointers of the 
appropriate types, I was able to cut my execution time from 32 to 24 seconds, 
validating the gpertools results.  I've attached a patch to show what I did.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to