@Panos,

Here is the link to the specific failure for NPE.
https://github.com/kbendick/iceberg/runs/3156762084#step:6:7499

I will also try to take a look at this today to see what is causing this.

Regards,
Pavan

> On Jul 26, 2021, at 8:30 AM, Panos Garefalakis <panga...@gmail.com> wrote:
> 
> Thanks both for keeping us posted!
> Seems like we are going to need another Storage-api release to fix the
> dependency issue that was discovered recerntly.
> @Kyle Regarding the HT dictionary regression -- seems to be quite critical.
> Can you please point to the actual failing test? I could only find build
> failures in the logs..
> Regarding the JMH benchmarks, I am a happy user myself --
> *ORCWriterBenchMark* was created for this very reason.
> Maybe it would make sense to extend that for nested schemas as described
> above?
> 
> Cheers,
> Panagiotis
> 
> 
> On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson
> <kbendick...@apple.com.invalid> wrote:
> 
>> Thanks for the summary of QA for 1.7.0 Dongjoon =)
>> 
>> Wanted to follow up on a few things I’ve come across testing several
>> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case
>> others have insight into the issues I’ve faced:
>> 
>> 1) I wanted to mention that the NPE in the Iceberg tests for
>> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the
>> snapshot 1.7.0-SNAPSHOT.  I am still working to determine which version it
>> was working with, but here’s a link <
>> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where
>> I am experiencing the NPE in a Flink ORC read/write test suite when using a
>> large mixed type / nested schema if others want to see the failed test
>> suites (we use GitHub Actions so it should be familiar).
>> 
>> The NPE occurs when calling `close` and attempting to write the file
>> metadata. This occurs when calling
>> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when
>> calling getPos on the private field `rawWriter` of type FSDataOututStream,
>> so it’s possible that there is some potential configuration issues when
>> setting the filesystem for the writer (e.g. in the Iceberg code). However
>> several other tests pass, about 5 in total all of which run this exact same
>> test code using different schemas, so I’m inclined to think there’s some
>> sort of issue with where we read / write to the output stream that is
>> surfacing with a large number of serialized columns or something. I am
>> still investigating and will report as soon as I know more but wanted to
>> bring it to everyone’s attention =)
>> 
>> 2) For the same test which reads and writes 1000 records to a file using
>> Flink (though mostly via Iceberg API) with a large / nested schema, ,we
>> have to change the writer’s configuration to use the old default value of
>> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new default
>> value of `hash`. When using `hash`, there is an OOM coming from the
>> relatively recently added `StringHashTableDictionary`. I noticed there is a
>> comment about not needing large buckets, as very many collisions would mean
>> that the "table is too small or the function isn’t good”. <
>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2]
>> This test is specifically for testing large and nested schemas (there’s
>> over 20 columns with map and list types). But this data isn’t necessarily
>> so far outside of the realm of realworld datasets.
>> 
>> Are there any existing performance tests using the new hash based
>> dictionary implementation or any guidance we have for users for configuring
>> their files (strips, etc) to make effective use of the new hash
>> implementation? I’m wondering if users will have a difficult time
>> transitioning due to having “a function that isn’t good or a table that’s
>> too small"
>> 
>> [1] Github PR with currently failing Iceberg Flink tests for ORC
>> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 <
>> https://github.com/kbendick/iceberg/pull/42>
>> [2] ORC StringHashTableDictionary class, with hash bucket instantiation
>> with comments about bucket size w.r.t collisions:
>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
>> <
>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
>>> 
>> 
>> Thanks for summarizing the on going work and please let me know if I can
>> be of any help outside of the current investigation into the Iceberg
>> issues. I’d love to possibly help set up JMH benchmark testing (which is
>> admittedly not perfect), as I know we have had several PRs recently to
>> optimize some code paths. Would be great to have a way to measure the
>> impact somewhat (while recognizing that JMH isn’t exactly perfect).
>> 
>> Best,
>> Kyle Bendickson
>> 
>> 
>> On 2021/07/20 20:33:14, Dongjoon Hyun <d...@gmail.com> wrote:
>>> Thank you all for your participation.>
>>> 
>>> Here is a brief summary of the recent ORC community activity for Apache
>> ORC>
>>> 1.7.0 QA.>
>>> 
>>> 1. After branching branch-1.7, I shared Apache Iceberg integration test>
>>> failure.>
>>> 2. Kyle Bendickson reported the same issue. In addition, when he
>> reverted>
>>> to rbtree, he met NPE in some cases. He is investigating it.>
>>> 3. David Mollitor has been working on several nice code refactoring and>
>>> perf improvement. Currently, his patches are landed on the `main`
>> branch>
>>> for 1.8.0, but I'm considering backporting them to branch-1.7 if
>> needed.>
>>> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed
>> it.>
>>> 5. Yukihiro Okada made a patch to fix his `head` command contribution
>> and>
>>> Panos/William reviewed and merged.>
>>> 
>>> In addition to the above, Apache ORC community is waiting for a new
>> hive>
>>> storage api release.>
>>> 
>>> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan>
>>> Lanka, Owen and I voted.>
>>> ->
>>> 
>> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E>
>> 
>>> 
>>> Apache ORC 1.7.0 is coming! Please test it in your environments and give
>> us>
>>> your feedback.>
>>> 
>>> Best,>
>>> Dongjoon.>
>>> 
>> 
>> 
>> 
>> Kyle Bendickson
>> Software Engineer
>> Apple
>> ACS Data
>> One Apple Park Way,
>> Cupertino, CA 95014, USA
>> kbendick...@apple.com
>> 
>> This email and any attachments may be privileged and may contain
>> confidential information intended only for the recipient(s) named above.
>> Any other distribution, forwarding, copying or disclosure of this message
>> is strictly prohibited. If you have received this email in error, please
>> notify me immediately by telephone or return email, and delete this message
>> from your system.
>> 
>> 
>> 

Reply via email to