Hi @Pavan @Panos,

In case that link isn’t working, here’s the stack trace from the NPE during the 
specific test. I was trying to get my debugger to step through it, but I’m 
having issues with IntelliJ using the correct sources when I use the debugger. 
I might have to nuke all of my gradle dependencies and then start over.

Here’s a link to the specific code for the test that is throwing the NPE 
<https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkOrcReaderWriter.java#L81-L92>

And here’s a link to the actual test in the super class that is causing it. 
<https://github.com/kbendick/iceberg/blob/b58d9064b2839989d15b9e0ca0192eb077367c33/data/src/test/java/org/apache/iceberg/data/DataTest.java#L134-L171>

If you want to test it out yourself, you can clone my repo, check out the 
`upgrade-orc-to-170-test` branch, and then run the test.
 
$ git clone g...@github.com:kbendick/iceberg.git && cd iceberg && git checkout 
upgrade-orc-to-170-test && ./gradlew iceberg-flink:test

I will try to get my debugger working so I can actually further investigate. 
The test does fail in CLIl, but my debugger environment is having issues and 
using older sources.

Thanks, Kyle

Stack trace of failure:

org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter > testMixedTypes FAILED
7550
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7550>
    java.lang.NullPointerException
7551
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7551>
        at 
org.apache.orc.impl.PhysicalFsWriter.writeFileMetadata(PhysicalFsWriter.java:416)
7552
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7552>
        at org.apache.orc.impl.WriterImpl.writeMetadata(WriterImpl.java:575)
7553
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7553>
        at org.apache.orc.impl.WriterImpl.writeFooter(WriterImpl.java:640)
7554
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7554>
        at org.apache.orc.impl.WriterImpl.close(WriterImpl.java:732)
7555
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7555>
        at 
org.apache.iceberg.orc.OrcFileAppender.close(OrcFileAppender.java:127)
7556
 
<https://github.com/kbendick/iceberg/pull/42/checks?check_run_id=3095034630#step:6:7556>
        at 
org.apache.iceberg.flink.data.TestFlinkOrcReaderWriter.writeAndValidate(TestFlinkOrcReaderWriter.java:61)




Kyle Bendickson
Software Engineer
Apple
ACS Data
One Apple Park Way,
Cupertino, CA 95014, USA
kbendick...@apple.com

This email and any attachments may be privileged and may contain confidential 
information intended only for the recipient(s) named above. Any other 
distribution, forwarding, copying or disclosure of this message is strictly 
prohibited. If you have received this email in error, please notify me 
immediately by telephone or return email, and delete this message from your 
system.


> On Jul 26, 2021, at 9:18 AM, Pavan Lanka <pla...@apple.com.INVALID> wrote:
> 
> @Panos,
> 
> Here is the link to the specific failure for NPE.
> https://github.com/kbendick/iceberg/runs/3156762084#step:6:7499
> 
> I will also try to take a look at this today to see what is causing this.
> 
> Regards,
> Pavan
> 
>> On Jul 26, 2021, at 8:30 AM, Panos Garefalakis <panga...@gmail.com> wrote:
>> 
>> Thanks both for keeping us posted!
>> Seems like we are going to need another Storage-api release to fix the
>> dependency issue that was discovered recerntly.
>> @Kyle Regarding the HT dictionary regression -- seems to be quite critical.
>> Can you please point to the actual failing test? I could only find build
>> failures in the logs..
>> Regarding the JMH benchmarks, I am a happy user myself --
>> *ORCWriterBenchMark* was created for this very reason.
>> Maybe it would make sense to extend that for nested schemas as described
>> above?
>> 
>> Cheers,
>> Panagiotis
>> 
>> 
>> On Mon, Jul 26, 2021 at 9:21 AM Kyle Bendickson
>> <kbendick...@apple.com.invalid> wrote:
>> 
>>> Thanks for the summary of QA for 1.7.0 Dongjoon =)
>>> 
>>> Wanted to follow up on a few things I’ve come across testing several
>>> different versions of ORC 1.7.0-SNAPSHOT with Apache Iceberg, in case
>>> others have insight into the issues I’ve faced:
>>> 
>>> 1) I wanted to mention that the NPE in the Iceberg tests for
>>> 1.7.0-SNAPSHOT did not occur for one of the earlier versions of the
>>> snapshot 1.7.0-SNAPSHOT.  I am still working to determine which version it
>>> was working with, but here’s a link <
>>> https://github.com/kbendick/iceberg/pull/42>[1] to the branch / PR where
>>> I am experiencing the NPE in a Flink ORC read/write test suite when using a
>>> large mixed type / nested schema if others want to see the failed test
>>> suites (we use GitHub Actions so it should be familiar).
>>> 
>>> The NPE occurs when calling `close` and attempting to write the file
>>> metadata. This occurs when calling
>>> org.apache.orc.impl.PhysicalFsWriter#writeFileMetadata, specifically when
>>> calling getPos on the private field `rawWriter` of type FSDataOututStream,
>>> so it’s possible that there is some potential configuration issues when
>>> setting the filesystem for the writer (e.g. in the Iceberg code). However
>>> several other tests pass, about 5 in total all of which run this exact same
>>> test code using different schemas, so I’m inclined to think there’s some
>>> sort of issue with where we read / write to the output stream that is
>>> surfacing with a large number of serialized columns or something. I am
>>> still investigating and will report as soon as I know more but wanted to
>>> bring it to everyone’s attention =)
>>> 
>>> 2) For the same test which reads and writes 1000 records to a file using
>>> Flink (though mostly via Iceberg API) with a large / nested schema, ,we
>>> have to change the writer’s configuration to use the old default value of
>>> `orc.dictionary.implementation`, i.e. `rbtree` instead of the new default
>>> value of `hash`. When using `hash`, there is an OOM coming from the
>>> relatively recently added `StringHashTableDictionary`. I noticed there is a
>>> comment about not needing large buckets, as very many collisions would mean
>>> that the "table is too small or the function isn’t good”. <
>>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L84-L87>[2]
>>> This test is specifically for testing large and nested schemas (there’s
>>> over 20 columns with map and list types). But this data isn’t necessarily
>>> so far outside of the realm of realworld datasets.
>>> 
>>> Are there any existing performance tests using the new hash based
>>> dictionary implementation or any guidance we have for users for configuring
>>> their files (strips, etc) to make effective use of the new hash
>>> implementation? I’m wondering if users will have a difficult time
>>> transitioning due to having “a function that isn’t good or a table that’s
>>> too small"
>>> 
>>> [1] Github PR with currently failing Iceberg Flink tests for ORC
>>> 1.7.0-SNAPSHOT: https://github.com/kbendick/iceberg/pull/42 <
>>> https://github.com/kbendick/iceberg/pull/42>
>>> [2] ORC StringHashTableDictionary class, with hash bucket instantiation
>>> with comments about bucket size w.r.t collisions:
>>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
>>> <
>>> https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/StringHashTableDictionary.java#L81-L89
>>>> 
>>> 
>>> Thanks for summarizing the on going work and please let me know if I can
>>> be of any help outside of the current investigation into the Iceberg
>>> issues. I’d love to possibly help set up JMH benchmark testing (which is
>>> admittedly not perfect), as I know we have had several PRs recently to
>>> optimize some code paths. Would be great to have a way to measure the
>>> impact somewhat (while recognizing that JMH isn’t exactly perfect).
>>> 
>>> Best,
>>> Kyle Bendickson
>>> 
>>> 
>>> On 2021/07/20 20:33:14, Dongjoon Hyun <d...@gmail.com> wrote:
>>>> Thank you all for your participation.>
>>>> 
>>>> Here is a brief summary of the recent ORC community activity for Apache
>>> ORC>
>>>> 1.7.0 QA.>
>>>> 
>>>> 1. After branching branch-1.7, I shared Apache Iceberg integration test>
>>>> failure.>
>>>> 2. Kyle Bendickson reported the same issue. In addition, when he
>>> reverted>
>>>> to rbtree, he met NPE in some cases. He is investigating it.>
>>>> 3. David Mollitor has been working on several nice code refactoring and>
>>>> perf improvement. Currently, his patches are landed on the `main`
>>> branch>
>>>> for 1.8.0, but I'm considering backporting them to branch-1.7 if
>>> needed.>
>>>> 4. William (Jeongseok) Hyun also found an orc-tools CNFE bug and fixed
>>> it.>
>>>> 5. Yukihiro Okada made a patch to fix his `head` command contribution
>>> and>
>>>> Panos/William reviewed and merged.>
>>>> 
>>>> In addition to the above, Apache ORC community is waiting for a new
>>> hive>
>>>> storage api release.>
>>>> 
>>>> Yesterday, Panos started the Hive Storage API 2.8.0-rc0 vote and Pavan>
>>>> Lanka, Owen and I voted.>
>>>> ->
>>>> 
>>> https://lists.apache.org/thread.html/r28201af1d35c67c72d3e846c26c77c38d2a51437f0b1bd31ddcf12b8%40%3Cdev.hive.apache.org%3E>
>>> 
>>>> 
>>>> Apache ORC 1.7.0 is coming! Please test it in your environments and give
>>> us>
>>>> your feedback.>
>>>> 
>>>> Best,>
>>>> Dongjoon.>
>>>> 
>>> 
>>> 
>>> 
>>> Kyle Bendickson
>>> Software Engineer
>>> Apple
>>> ACS Data
>>> One Apple Park Way,
>>> Cupertino, CA 95014, USA
>>> kbendick...@apple.com
>>> 
>>> This email and any attachments may be privileged and may contain
>>> confidential information intended only for the recipient(s) named above.
>>> Any other distribution, forwarding, copying or disclosure of this message
>>> is strictly prohibited. If you have received this email in error, please
>>> notify me immediately by telephone or return email, and delete this message
>>> from your system.
>>> 
>>> 
>>> 
> 

Reply via email to