[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead

2016-04-06 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229239#comment-15229239
 ] 

Owen O'Malley commented on HIVE-13345:
--

The current leaking of the OrcProto objects outside of the reader 
implementation is problematic and should be fixed.

For fast loading, we should create a ReaderImpl constructor that takes a 
serialized file tail. The C++ implementation uses:

// The contents of the file tail that must be serialized.
message FileTail {
  optional PostScript postscript = 1;
  optional Footer footer = 2;
  optional uint64 fileLength = 3;
  optional uint64 postscriptLength = 4;
}

I assume you aren't proposing doing hand rolled serialization, which would be 
very error prone. If I'd seen flatbuffers before I started ORC, I would have 
been tempted to go that way. Now it would be too much pain for too little gain.



> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead

2016-03-28 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214825#comment-15214825
 ] 

Sergey Shelukhin commented on HIVE-13345:
-

I think the problem is/was that ORC readers were created with proto objects. 
Anyway, I'll take a look at how complex both approaches are at some point (this 
week?)

> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead

2016-03-28 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214816#comment-15214816
 ] 

Prasanth Jayachandran commented on HIVE-13345:
--

IMO we should store the serialized representation of metadata. Deserialized 
representation of metadata (Proto objects) are supposed to be short-lived. We 
have POJOs for all protobuf equivalents. BloomFilter, ColumnStatistics, 
StripeInformation etc. which creates POJOs from Proto objects. If we are 
caching the deserialized representation then we should cache the equivalent 
POJOs and not the proto objects.

> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead

2016-03-28 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214728#comment-15214728
 ] 

Sergey Shelukhin commented on HIVE-13345:
-

[~gopalv] [~prasanth_j] [~owen.omalley] opinions on the best approach? I am 
leaning towards changing ORC to use POJOs instead of OrcProto stuff, but as an 
alternative we can change metadata cache in LLAP to store serialized metadata. 
The cost of deserializing every time in LLAP vs the cost of copying 
fields/converting some things (e.g. OrcProto stores bloom filters as 
List, which aside from being horrible on pure merits, offends my 
engineering sensibilities, so I might be biased here).


> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13345) LLAP: metadata cache takes too much space, esp. with bloom filters, due to Java/protobuf overhead

2016-03-23 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15209551#comment-15209551
 ] 

Sergey Shelukhin commented on HIVE-13345:
-

[~gopalv] [~prasanth_j] fyi

> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)