[ 
https://issues.apache.org/jira/browse/HIVE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578587#comment-13578587
 ] 

Phabricator commented on HIVE-3874:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-3874 [jira] Create a new 
Optimized Row Columnar file format for Hive".

  A couple of minor style comments, according to the style guide 
https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CodingConvention
 :

  There are a number of places in the code where you are missing spaces around 
+ operators (e.g. line 58 in DynamicByteArray), you're missing a space between 
for and ( (e.g. line 63 in DynamicByteArray), and you're missing a space before 
a : in a for-each loop (e.g. line 191 in OrcStruct).

  Mentioning these now as I don't want them to hold up a commit later.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcInputFormat.java:149-151 Is this 
loop necessary?  result is a boolean array so all of these entries will default 
to false anyway
  ql/src/java/org/apache/hadoop/hive/ql/orc/OutStream.java:136-140 I'm a little 
confused by this, if compressed is null, why aren't you initializing overflow 
as well?
  ql/src/java/org/apache/hadoop/hive/ql/orc/OrcStruct.java:307 I saw issues 
with this, and TypeInfoUtils expecting array instead of list.
  ql/src/java/org/apache/hadoop/hive/ql/orc/WriterImpl.java:561-562 As far as I 
can tell, by storing the intermediate string data in these structures which do 
not write to a stream until writeStripe is called, the size of string columns 
is not being accounted for at all when determining whether or not to write out 
the stripe.  (This could be fixed as a follow up)

REVISION DETAIL
  https://reviews.facebook.net/D8529

To: JIRA, omalley
Cc: kevinwilfong

                
> Create a new Optimized Row Columnar file format for Hive
> --------------------------------------------------------
>
>                 Key: HIVE-3874
>                 URL: https://issues.apache.org/jira/browse/HIVE-3874
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: hive.3874.2.patch, HIVE-3874.D8529.1.patch, 
> OrcFileIntro.pptx, orc.tgz
>
>
> There are several limitations of the current RC File format that I'd like to 
> address by creating a new format:
> * each column value is stored as a binary blob, which means:
> ** the entire column value must be read, decompressed, and deserialized
> ** the file format can't use smarter type-specific compression
> ** push down filters can't be evaluated
> * the start of each row group needs to be found by scanning
> * user metadata can only be added to the file when the file is created
> * the file doesn't store the number of rows per a file or row group
> * there is no mechanism for seeking to a particular row number, which is 
> required for external indexes.
> * there is no mechanism for storing light weight indexes within the file to 
> enable push-down filters to skip entire row groups.
> * the type of the rows aren't stored in the file

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to