Re: Apache ORC 1.7.0 QA status

Kyle Bendickson Sun, 25 Jul 2021 23:21:22 -0700

Thanks for the summary of QA for 1.7.0 Dongjoon =)

Wanted to follow up on a few things I’ve come across testing several different 
versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case others have insight 
into the issues I’ve faced:

1) I wanted to mention that the NPE in the Iceberg tests for 1.7.0-SNAPSHOT did 
not occur for one of the earlier versions of the snapshot 1.7.0-SNAPSHOT.  I am 
still working to determine which version it was working with, but here’s a link 
<https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where I am 
experiencing the NPE in a Flink ORC read/write test suite when using a large 
mixed type / nested schema if others want to see the failed test suites (we use 
GitHub Actions so it should be familiar).

The NPE occurs when calling `close` and attempting to write the file metadata. 
This occurs when calling 
org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when 
calling getPos on the private field `rawWriter` of type FSDataOututStream, so 
it’s possible that there is some potential configuration issues when setting 
the filesystem for the writer (e.g. in the Iceberg code). However several other 
tests pass, about 5 in total all of which run this exact same test code using 
different schemas, so I’m inclined to think there’s some sort of issue with 
where we read / write to the output stream that is surfacing with a large 
number of serialized columns or something. I am still investigating and will 
report as soon as I know more but wanted to bring it to everyone’s attention =)

2) For the same test which reads and writes 1000 records to a file using Flink 
(though mostly via Iceberg API) with a large / nested schema, ,we have to 
change the writer’s configuration to use the old default value of 
`orc.dictionary.implementation`, i.e. `rbtree` instead of the new default value 
of `hash`. When using `hash`, there is an OOM coming from the relatively 
recently added `StringHashTableDictionary`. I noticed there is a comment about 
not needing large buckets, as very many collisions would mean that the "table 
is too small or the function isn’t good”. 
<https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2]
 This test is specifically for testing large and nested schemas (there’s over 
20 columns with map and list types). But this data isn’t necessarily so far 
outside of the realm of realworld datasets.

Are there any existing performance tests using the new hash based dictionary 
implementation or any guidance we have for users for configuring their files 
(strips, etc) to make effective use of the new hash implementation? I’m 
wondering if users will have a difficult time transitioning due to having “a 
function that isn’t good or a table that’s too small"

[1] Github PR with currently failing Iceberg Flink tests for ORC 
1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 
<https://github.com/kbendick/iceberg/pull/42>
[2] ORC StringHashTableDictionary class, with hash bucket instantiation with 
comments about bucket size w.r.t collisions: 
https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89

<https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89>

Thanks for summarizing the on going work and please let me know if I can be of 
any help outside of the current investigation into the Iceberg issues. I’d love 
to possibly help set up JMH benchmark testing (which is admittedly not 
perfect), as I know we have had several PRs recently to optimize some code 
paths. Would be great to have a way to measure the impact somewhat (while 
recognizing that JMH isn’t exactly perfect).

Best,
Kyle Bendickson

On 2021/07/20 20:33:14, Dongjoon Hyun <d...@gmail.com> wrote: 
> Thank you all for your participation.> 
> 
> Here is a brief summary of the recent ORC community activity for Apache ORC> 
> 1.7.0 QA.> 
> 
> 1. After branching branch-1.7, I shared Apache Iceberg integration test> 
> failure.> 
> 2. Kyle Bendickson reported the same issue. In addition, when he reverted> 
> to rbtree, he met NPE in some cases. He is investigating it.> 
> 3. David Mollitor has been working on several nice code refactoring and> 
> perf improvement. Currently, his patches are landed on the `main` branch> 
> for 1.8.0, but I'm considering backporting them to branch-1.7 if needed.> 
> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed it.> 
> 5. Yukihiro Okada made a patch to fix his `head` command contribution and> 
> Panos/William reviewed and merged.> 
> 
> In addition to the above, Apache ORC community is waiting for a new hive> 
> storage api release.> 
> 
> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan> 
> Lanka, Owen and I voted.> 
> -> 
> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E>
>  
> 
> Apache ORC 1.7.0 is coming! Please test it in your environments and give us> 
> your feedback.> 
> 
> Best,> 
> Dongjoon.> 
>  



Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendick...@apple.com

This email and any attachments may be privileged and may contain confidential 
information intended only for the recipient(s) named above. Any other 
distribution, forwarding, copying or disclosure of this message is strictly 
prohibited. If you have received this email in error, please notify me 
immediately by telephone or return email, and delete this message from your 
system.

Re: Apache ORC 1.7.0 QA status

Reply via email to