Elliot West <mailto:tea...@gmail.com>
May 1, 2015 at 3:04
Yes and no :-) We're initially using OrcFile.createReader to create a
Reader so that we can obtain the schema (StructTypeInfo) from the
file. I don't believe this is possible with OrcInputFormat.getReader(?):
Reader orcReader = OrcFile.createReader(path,
OrcFile.readerOptions(conf));
ObjectInspector inspector = orcReader.getObjectInspector();
StructTypeInfo typeInfo = (StructTypeInfo)
TypeInfoUtils.getTypeInfoFromObjectInspector(inspector);
In the case of transactional datasets we've worked around this by
generating the StructTypeInfo from schema data retrieved from the meta
store as we need to interact with the meta store anyway to correct
read the data. Even if OrcFile.createReader were to transparently read
delta only datasets, It wouldn't get us much further currently as the
delta files lack the correct column names and the Reader would thus
return an unusable StructTypeInfo.
The org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() issue is
currently our biggest pain point as it requires us to place Orc+Atomic
specific code in what should be a general framework. To illustrate the
problem further, somewhere in cascading there is some code that
extracts partition keys from split paths. It extracts keys by chopping
off the 'part' leaf and removing the preceding parent:
*Text etc:*
OrcSplit.getPath() returns:
'warehouse/test_table/continent=Asia/country=India/part-000001'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)
*Orc base+delta:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India/base_0000006'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)
*Orc delta only etc:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India
Partition keys derived as: 'continent=Asia' (INCORRECT)
Cheers - Elliot.
On 30 April 2015 at 17:40, Alan Gates <alanfga...@gmail.com
<mailto:alanfga...@gmail.com>> wrote:
Are you using OrcInputFormat.getReader to get a reader? If so, it
should take care of these anomalies for you and mask your need to
worry about delta versus base files.
Alan.
Elliot West <mailto:tea...@gmail.com>
April 29, 2015 at 9:40
Hi,
I'm implementing a tap to read Hive ORC ACID date into Cascading
jobs and I've hit a couple of issues for a particular scenario.
The case I have is when data has been written into a
transactional table and a compaction has not yet occurred. This
can be recreated like so:
CREATE TABLE test_table ( id int, message string )
PARTITIONED BY ( continent string, country string )
CLUSTERED BY (id) INTO 1 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional' = 'true')
);
INSERT INTO TABLE test_table
PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');
This results in a dataset that contains only a delta file:
warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000
I'm assuming that this scenario is valid - a user might insert
new data into a table and want to read it back at a time prior to
the first compaction. I can select the data back from this table
in Hive with no problem. However, for a number of reasons I'm
finding it rather tricky to do so programmatically. At this point
I should mention that reading base files or base+deltas is
trouble free. The issues I've encountered are as follows:
1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
ReaderOptions) fails if the directory specified by the path
('warehouse/test_table/continent=Asia/country=India' in this
case) contains only a delta. Specifically it attempts to
access 'delta_0000060_0000060' as if it were a file and
therefore fails. It appears to function correctly if the
directory also contains a base. We use this method to extract
the typeInfo from the ORCFile and build a mapping between the
user's declared fields.
2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is
seemingly inconsistent in that it returns the path of the
base if present, otherwise the parent. This presents issues
within cascading (and I assume other frameworks) that expect
the paths returned by splits to be at the same depth and for
them to contain some kind of 'part' file leaf. In my example
the path returned is
'warehouse/test_table/continent=Asia/country=India', if I had
also had a base I'd have seen
'warehouse/test_table/continent=Asia/country=India/base_0000006'.
3. The footers of the delta files do not contain the true field
names of the table. In my example I see
'_col0:int,_col1:string' where I'd expect
'id:int,message:string'. A base file, if present correctly
declares the field names. We chose to access values by field
name rather than position so that users of our reader do not
need to declare the full schema to read partial data, however
this behaviour trips this up.
I have (horrifically :) worked around issues 1 and 2 in my own
code and have some ideas to circumvent 3 but I wanted to get a
feeling as to whether I'm going against the tide and if my life
might be easier if I approached this another way.
Thanks - Elliot.
Alan Gates <mailto:alanfga...@gmail.com>
April 30, 2015 at 9:40
Are you using OrcInputFormat.getReader to get a reader? If so, it
should take care of these anomalies for you and mask your need to
worry about delta versus base files.
Alan.
Elliot West <mailto:tea...@gmail.com>
April 29, 2015 at 9:40
Hi,
I'm implementing a tap to read Hive ORC ACID date into Cascading jobs
and I've hit a couple of issues for a particular scenario. The case I
have is when data has been written into a transactional table and a
compaction has not yet occurred. This can be recreated like so:
CREATE TABLE test_table ( id int, message string )
PARTITIONED BY ( continent string, country string )
CLUSTERED BY (id) INTO 1 BUCKETS
STORED AS ORC
TBLPROPERTIES ('transactional' = 'true')
);
INSERT INTO TABLE test_table
PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');
This results in a dataset that contains only a delta file:
warehouse/test_table/continent=Asia/country=India/delta_0000060_0000060/bucket_00000
I'm assuming that this scenario is valid - a user might insert new
data into a table and want to read it back at a time prior to the
first compaction. I can select the data back from this table in Hive
with no problem. However, for a number of reasons I'm finding it
rather tricky to do so programmatically. At this point I should
mention that reading base files or base+deltas is trouble free. The
issues I've encountered are as follows:
1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
ReaderOptions) fails if the directory specified by the path
('warehouse/test_table/continent=Asia/country=India' in this case)
contains only a delta. Specifically it attempts to access
'delta_0000060_0000060' as if it were a file and therefore fails.
It appears to function correctly if the directory also contains a
base. We use this method to extract the typeInfo from the ORCFile
and build a mapping between the user's declared fields.
2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
inconsistent in that it returns the path of the base if present,
otherwise the parent. This presents issues within cascading (and I
assume other frameworks) that expect the paths returned by splits
to be at the same depth and for them to contain some kind of
'part' file leaf. In my example the path returned is
'warehouse/test_table/continent=Asia/country=India', if I had also
had a base I'd have seen
'warehouse/test_table/continent=Asia/country=India/base_0000006'.
3. The footers of the delta files do not contain the true field names
of the table. In my example I see '_col0:int,_col1:string' where
I'd expect 'id:int,message:string'. A base file, if present
correctly declares the field names. We chose to access values by
field name rather than position so that users of our reader do not
need to declare the full schema to read partial data, however this
behaviour trips this up.
I have (horrifically :) worked around issues 1 and 2 in my own code
and have some ideas to circumvent 3 but I wanted to get a feeling as
to whether I'm going against the tide and if my life might be easier
if I approached this another way.
Thanks - Elliot.