Re: ACID ORC file reader issue with uncompacted data

Alan Gates Thu, 14 May 2015 10:28:48 -0700

Ok, I think I understand now. I also get why OrcSplit.getPath returnsjust up to the partition keys and not the delta directories. In mostcases there will be more than one delta directory, so which one would itpick?

It seems you already know the file type you are working on before youcall this (since you're calling OrcSplit.getPath rather thanFileSplit.getPath). The best way forward might be to make a utilitymethod in Hive that takes the file type and the result of getPath andthen returns you the partition keys. That way you're not left puttingORC specific code in Cascading.


Alan.

Elliot West <mailto:tea...@gmail.com>
May 1, 2015 at 3:04

Yes and no :-) We're initially using OrcFile.createReader to create aReader so that we can obtain the schema (StructTypeInfo) from thefile. I don't believe this is possible with OrcInputFormat.getReader(?):


    Reader orcReader = OrcFile.createReader(path,
    OrcFile.readerOptions(conf));

    ObjectInspector inspector = orcReader.getObjectInspector();
    StructTypeInfo typeInfo = (StructTypeInfo)
    TypeInfoUtils.getTypeInfoFromObjectInspector(inspector);

In the case of transactional datasets we've worked around this bygenerating the StructTypeInfo from schema data retrieved from the metastore as we need to interact with the meta store anyway to correctread the data. Even if OrcFile.createReader were to transparently readdelta only datasets, It wouldn't get us much further currently as thedelta files lack the correct column names and the Reader would thusreturn an unusable StructTypeInfo.

The org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() issue iscurrently our biggest pain point as it requires us to place Orc+Atomicspecific code in what should be a general framework. To illustrate theproblem further, somewhere in cascading there is some code thatextracts partition keys from split paths. It extracts keys by choppingoff the 'part' leaf and removing the preceding parent:


*Text etc:*

OrcSplit.getPath() returns:'warehouse/test_table/continent=Asia/country=India/part-000001'

Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc base+delta:*

OrcSplit.getPath() returns:warehouse/test_table/continent=Asia/country=India/base_0000006'

Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc delta only etc:*

OrcSplit.getPath() returns:warehouse/test_table/continent=Asia/country=India

Partition keys derived as: 'continent=Asia' (INCORRECT)

Cheers - Elliot.

On 30 April 2015 at 17:40, Alan Gates <alanfga...@gmail.com<mailto:alanfga...@gmail.com>> wrote:


    Are you using OrcInputFormat.getReader to get a reader?  If so, it
    should take care of these anomalies for you and mask your need to
    worry about delta versus base files.

    Alan.

    Elliot West <mailto:tea...@gmail.com>
    April 29, 2015 at 9:40
    Hi,

    I'm implementing a tap to read Hive ORC ACID date into Cascading
    jobs and I've hit a couple of issues for a particular scenario.
    The case I have is when data has been written into a
    transactional table and a compaction has not yet occurred. This
    can be recreated like so:

        CREATE TABLE test_table ( id int, message string )
          PARTITIONED BY ( continent string, country string )
          CLUSTERED BY (id) INTO 1 BUCKETS
          STORED AS ORC
          TBLPROPERTIES ('transactional' = 'true')
        );

        INSERT INTO TABLE test_table
        PARTITION (continent = 'Asia', country = 'India')
        VALUES (1, 'x'), (2, 'y'), (3, 'z');


    This results in a dataset that contains only a delta file:

        
warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000


    I'm assuming that this scenario is valid - a user might insert
    new data into a table and want to read it back at a time prior to
    the first compaction. I can select the data back from this table
    in Hive with no problem. However, for a number of reasons I'm
    finding it rather tricky to do so programmatically. At this point
    I should mention that reading base files or base+deltas is
    trouble free. The issues I've encountered are as follows:

     1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
        ReaderOptions) fails if the directory specified by the path
        ('warehouse/test_table/continent=Asia/country=India' in this
        case) contains only a delta. Specifically it attempts to
        access 'delta_0000060_0000060' as if it were a file and
        therefore fails. It appears to function correctly if the
        directory also contains a base. We use this method to extract
        the typeInfo from the ORCFile and build a mapping between the
        user's declared fields.
     2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is
        seemingly inconsistent in that it returns the path of the
        base if present, otherwise the parent. This presents issues
        within cascading (and I assume other frameworks) that expect
        the paths returned by splits to be at the same depth and for
        them to contain some kind of 'part' file leaf. In my example
        the path returned is
        'warehouse/test_table/continent=Asia/country=India', if I had
        also had a base I'd have seen
        'warehouse/test_table/continent=Asia/country=India/base_0000006'.
     3. The footers of the delta files do not contain the true field
        names of the table. In my example I see
        '_col0:int,_col1:string' where I'd expect
        'id:int,message:string'. A base file, if present correctly
        declares the field names. We chose to access values by field
        name rather than position so that users of our reader do not
        need to declare the full schema to read partial data, however
        this behaviour trips this up.

    I have (horrifically :) worked around issues 1 and 2 in my own
    code and have some ideas to circumvent 3 but I wanted to get a
    feeling as to whether I'm going against the tide and if my life
    might be easier if I approached this another way.

    Thanks - Elliot.


Alan Gates <mailto:alanfga...@gmail.com>
April 30, 2015 at 9:40

Are you using OrcInputFormat.getReader to get a reader? If so, itshould take care of these anomalies for you and mask your need toworry about delta versus base files.


Alan.

Elliot West <mailto:tea...@gmail.com>
April 29, 2015 at 9:40
Hi,

I'm implementing a tap to read Hive ORC ACID date into Cascading jobsand I've hit a couple of issues for a particular scenario. The case Ihave is when data has been written into a transactional table and acompaction has not yet occurred. This can be recreated like so:


    CREATE TABLE test_table ( id int, message string )
      PARTITIONED BY ( continent string, country string )
      CLUSTERED BY (id) INTO 1 BUCKETS
      STORED AS ORC
      TBLPROPERTIES ('transactional' = 'true')
    );

    INSERT INTO TABLE test_table
    PARTITION (continent = 'Asia', country = 'India')
    VALUES (1, 'x'), (2, 'y'), (3, 'z');


This results in a dataset that contains only a delta file:

    
warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000

I'm assuming that this scenario is valid - a user might insert newdata into a table and want to read it back at a time prior to thefirst compaction. I can select the data back from this table in Hivewith no problem. However, for a number of reasons I'm finding itrather tricky to do so programmatically. At this point I shouldmention that reading base files or base+deltas is trouble free. Theissues I've encountered are as follows:


 1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
    ReaderOptions) fails if the directory specified by the path
    ('warehouse/test_table/continent=Asia/country=India' in this case)
    contains only a delta. Specifically it attempts to access
    'delta_0000060_0000060' as if it were a file and therefore fails.
    It appears to function correctly if the directory also contains a
    base. We use this method to extract the typeInfo from the ORCFile
    and build a mapping between the user's declared fields.
 2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
    inconsistent in that it returns the path of the base if present,
    otherwise the parent. This presents issues within cascading (and I
    assume other frameworks) that expect the paths returned by splits
    to be at the same depth and for them to contain some kind of
    'part' file leaf. In my example the path returned is
    'warehouse/test_table/continent=Asia/country=India', if I had also
    had a base I'd have seen
    'warehouse/test_table/continent=Asia/country=India/base_0000006'.
 3. The footers of the delta files do not contain the true field names
    of the table. In my example I see '_col0:int,_col1:string' where
    I'd expect 'id:int,message:string'. A base file, if present
    correctly declares the field names. We chose to access values by
    field name rather than position so that users of our reader do not
    need to declare the full schema to read partial data, however this
    behaviour trips this up.

I have (horrifically :) worked around issues 1 and 2 in my own codeand have some ideas to circumvent 3 but I wanted to get a feeling asto whether I'm going against the tide and if my life might be easierif I approached this another way.


Thanks - Elliot.

Re: ACID ORC file reader issue with uncompacted data

Reply via email to