[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
[ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229239#comment-15229239 ] Owen O'Malley commented on HIVE-13345: -- The current leaking of the OrcProto objects outside of the reader implementation is problematic and should be fixed. For fast loading, we should create a ReaderImpl constructor that takes a serialized file tail. The C++ implementation uses: // The contents of the file tail that must be serialized. message FileTail { optional PostScript postscript = 1; optional Footer footer = 2; optional uint64 fileLength = 3; optional uint64 postscriptLength = 4; } I assume you aren't proposing doing hand rolled serialization, which would be very error prone. If I'd seen flatbuffers before I started ORC, I would have been tempted to go that way. Now it would be too much pain for too little gain. > LLAP: metadata cache takes too much space, esp. with bloom filters, due to > Java/protobuf overhead > - > > Key: HIVE-13345 > URL: https://issues.apache.org/jira/browse/HIVE-13345 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > We cache java objects currently; these have high overhead, average stripe > metadata takes 200-500Kb on real files, and with bloom filters blowing up > more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. > That is undesirable. > We should either create better objects for ORC (might be good in general) or > store serialized metadata and deserialize when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
[ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214825#comment-15214825 ] Sergey Shelukhin commented on HIVE-13345: - I think the problem is/was that ORC readers were created with proto objects. Anyway, I'll take a look at how complex both approaches are at some point (this week?) > LLAP: metadata cache takes too much space, esp. with bloom filters, due to > Java/protobuf overhead > - > > Key: HIVE-13345 > URL: https://issues.apache.org/jira/browse/HIVE-13345 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > We cache java objects currently; these have high overhead, average stripe > metadata takes 200-500Kb on real files, and with bloom filters blowing up > more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. > That is undesirable. > We should either create better objects for ORC (might be good in general) or > store serialized metadata and deserialize when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
[ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214816#comment-15214816 ] Prasanth Jayachandran commented on HIVE-13345: -- IMO we should store the serialized representation of metadata. Deserialized representation of metadata (Proto objects) are supposed to be short-lived. We have POJOs for all protobuf equivalents. BloomFilter, ColumnStatistics, StripeInformation etc. which creates POJOs from Proto objects. If we are caching the deserialized representation then we should cache the equivalent POJOs and not the proto objects. > LLAP: metadata cache takes too much space, esp. with bloom filters, due to > Java/protobuf overhead > - > > Key: HIVE-13345 > URL: https://issues.apache.org/jira/browse/HIVE-13345 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > We cache java objects currently; these have high overhead, average stripe > metadata takes 200-500Kb on real files, and with bloom filters blowing up > more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. > That is undesirable. > We should either create better objects for ORC (might be good in general) or > store serialized metadata and deserialize when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
[ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214728#comment-15214728 ] Sergey Shelukhin commented on HIVE-13345: - [~gopalv] [~prasanth_j] [~owen.omalley] opinions on the best approach? I am leaning towards changing ORC to use POJOs instead of OrcProto stuff, but as an alternative we can change metadata cache in LLAP to store serialized metadata. The cost of deserializing every time in LLAP vs the cost of copying fields/converting some things (e.g. OrcProto stores bloom filters as List, which aside from being horrible on pure merits, offends my engineering sensibilities, so I might be biased here). > LLAP: metadata cache takes too much space, esp. with bloom filters, due to > Java/protobuf overhead > - > > Key: HIVE-13345 > URL: https://issues.apache.org/jira/browse/HIVE-13345 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > We cache java objects currently; these have high overhead, average stripe > metadata takes 200-500Kb on real files, and with bloom filters blowing up > more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. > That is undesirable. > We should either create better objects for ORC (might be good in general) or > store serialized metadata and deserialize when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead
[ https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209551#comment-15209551 ] Sergey Shelukhin commented on HIVE-13345: - [~gopalv] [~prasanth_j] fyi > LLAP: metadata cache takes too much space, esp. with bloom filters, due to > Java/protobuf overhead > - > > Key: HIVE-13345 > URL: https://issues.apache.org/jira/browse/HIVE-13345 > Project: Hive > Issue Type: Bug >Reporter: Sergey Shelukhin > > We cache java objects currently; these have high overhead, average stripe > metadata takes 200-500Kb on real files, and with bloom filters blowing up > more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. > That is undesirable. > We should either create better objects for ORC (might be good in general) or > store serialized metadata and deserialize when needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)