@Panos, Here is the link to the specific failure for NPE. https://github.com/kbendick/iceberg/runs/3156762084#step:6:7499
I will also try to take a look at this today to see what is causing this. Regards, Pavan > On Jul 26, 2021, at 8:30 AM, Panos Garefalakis <panga...@gmail.com> wrote: > > Thanks both for keeping us posted! > Seems like we are going to need another Storage-api release to fix the > dependency issue that was discovered recerntly. > @Kyle Regarding the HT dictionary regression -- seems to be quite critical. > Can you please point to the actual failing test? I could only find build > failures in the logs.. > Regarding the JMH benchmarks, I am a happy user myself -- > *ORCWriterBenchMark* was created for this very reason. > Maybe it would make sense to extend that for nested schemas as described > above? > > Cheers, > Panagiotis > > > On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson > <kbendick...@apple.com.invalid> wrote: > >> Thanks for the summary of QA for 1.7.0 Dongjoon =) >> >> Wanted to follow up on a few things I’ve come across testing several >> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case >> others have insight into the issues I’ve faced: >> >> 1) I wanted to mention that the NPE in the Iceberg tests for >> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the >> snapshot 1.7.0-SNAPSHOT. I am still working to determine which version it >> was working with, but here’s a link < >> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where >> I am experiencing the NPE in a Flink ORC read/write test suite when using a >> large mixed type / nested schema if others want to see the failed test >> suites (we use GitHub Actions so it should be familiar). >> >> The NPE occurs when calling `close` and attempting to write the file >> metadata. This occurs when calling >> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when >> calling getPos on the private field `rawWriter` of type FSDataOututStream, >> so it’s possible that there is some potential configuration issues when >> setting the filesystem for the writer (e.g. in the Iceberg code). However >> several other tests pass, about 5 in total all of which run this exact same >> test code using different schemas, so I’m inclined to think there’s some >> sort of issue with where we read / write to the output stream that is >> surfacing with a large number of serialized columns or something. I am >> still investigating and will report as soon as I know more but wanted to >> bring it to everyone’s attention =) >> >> 2) For the same test which reads and writes 1000 records to a file using >> Flink (though mostly via Iceberg API) with a large / nested schema, ,we >> have to change the writer’s configuration to use the old default value of >> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new default >> value of `hash`. When using `hash`, there is an OOM coming from the >> relatively recently added `StringHashTableDictionary`. I noticed there is a >> comment about not needing large buckets, as very many collisions would mean >> that the "table is too small or the function isn’t good”. < >> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2] >> This test is specifically for testing large and nested schemas (there’s >> over 20 columns with map and list types). But this data isn’t necessarily >> so far outside of the realm of realworld datasets. >> >> Are there any existing performance tests using the new hash based >> dictionary implementation or any guidance we have for users for configuring >> their files (strips, etc) to make effective use of the new hash >> implementation? I’m wondering if users will have a difficult time >> transitioning due to having “a function that isn’t good or a table that’s >> too small" >> >> [1] Github PR with currently failing Iceberg Flink tests for ORC >> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 < >> https://github.com/kbendick/iceberg/pull/42> >> [2] ORC StringHashTableDictionary class, with hash bucket instantiation >> with comments about bucket size w.r.t collisions: >> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 >> < >> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89 >>> >> >> Thanks for summarizing the on going work and please let me know if I can >> be of any help outside of the current investigation into the Iceberg >> issues. I’d love to possibly help set up JMH benchmark testing (which is >> admittedly not perfect), as I know we have had several PRs recently to >> optimize some code paths. Would be great to have a way to measure the >> impact somewhat (while recognizing that JMH isn’t exactly perfect). >> >> Best, >> Kyle Bendickson >> >> >> On 2021/07/20 20:33:14, Dongjoon Hyun <d...@gmail.com> wrote: >>> Thank you all for your participation.> >>> >>> Here is a brief summary of the recent ORC community activity for Apache >> ORC> >>> 1.7.0 QA.> >>> >>> 1. After branching branch-1.7, I shared Apache Iceberg integration test> >>> failure.> >>> 2. Kyle Bendickson reported the same issue. In addition, when he >> reverted> >>> to rbtree, he met NPE in some cases. He is investigating it.> >>> 3. David Mollitor has been working on several nice code refactoring and> >>> perf improvement. Currently, his patches are landed on the `main` >> branch> >>> for 1.8.0, but I'm considering backporting them to branch-1.7 if >> needed.> >>> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed >> it.> >>> 5. Yukihiro Okada made a patch to fix his `head` command contribution >> and> >>> Panos/William reviewed and merged.> >>> >>> In addition to the above, Apache ORC community is waiting for a new >> hive> >>> storage api release.> >>> >>> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan> >>> Lanka, Owen and I voted.> >>> -> >>> >> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E> >> >>> >>> Apache ORC 1.7.0 is coming! Please test it in your environments and give >> us> >>> your feedback.> >>> >>> Best,> >>> Dongjoon.> >>> >> >> >> >> Kyle Bendickson >> Software Engineer >> Apple >> ACS Data >> One Apple Park Way, >> Cupertino, CA 95014, USA >> kbendick...@apple.com >> >> This email and any attachments may be privileged and may contain >> confidential information intended only for the recipient(s) named above. >> Any other distribution, forwarding, copying or disclosure of this message >> is strictly prohibited. If you have received this email in error, please >> notify me immediately by telephone or return email, and delete this message >> from your system. >> >> >>