[jira] Issue Comment Edited: (HIVE-333) Add TFileTransport deserializer

Joydeep Sen Sarma (JIRA) Sun, 12 Apr 2009 00:02:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698174#action_12698174
 ]


Joydeep Sen Sarma edited comment on HIVE-333 at 4/12/09 12:00 AM:
------------------------------------------------------------------

this turned out to be way more complicated than i had thought. Here's the 
rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on 
this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift 
trunk + thrift-377 (so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses 
com.facebook namespace that is not compatible with thrift trunk. All of 
contrib/thrift uses the latest thrift trunk version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate 
attachment from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one 
called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer) 
that uses the thrift trunk library (instead of the old one). this is required 
to read thrift objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and 
TFileTransportRecordReader - this allows processing of TFileTransport files as 
inputs to hadoop map-reduce. it will split files so that the splits are aligned 
with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows 
how to convert a TFileTransport into a SequenceFile with thrift objects 
embedded inside BytesWritable objects. This example does not do any reduction - 
but you can extend this example to hash/reduce  on specific key (which is what 
we do at Facebook). Also output compression can be controlled by command line 
options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a 
single jar file by contrib/thrift (see 
build/contrib-thrift/hive_contrib-thrift.jar - should be produced by ant jar or 
ant package).

ie. the current work done so far allows conversion of files in TFileTransport 
format into SequenceFile +BytesWritable formats (and also provides the serde to 
read these files) that are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar 
build/contrib-thrift/hive_contrib-thrift.jar 
org.apache.hadoop.hive.thrift.TConverter 
-Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass -inputpath 
/tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be 
added using more -Dkey=value options.
// u will need to add the jar file for TestClass in this example also to the 
libjars switch

Once the files are converted - it's trivial to create a Hive table with the 
right properties so that these files can be queries. a few points about hive 
integration:
- need to ask Prasad about the exact cli statements to create these tables - 
will post instructions once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in 
hive execution environment. This can be arranged by copying them into auxlib/ 
under the hive distribution directory. i haven't integrated this into ant yet.
- also jar files for the classes that are serialized into the sequencefile and 
need to be queries by hive need to be deposited into auxlibs/ as well.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not 
that hard (since the inputformat is now done) - but it needs some more work and 
testing and more new code. 

CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious records 
at the beginning of each tfiletransport chunk when trying to read 
tfiletransport files produced in c++ land from java land (and only when seeking 
to split boundaries). I just don't have the time to debug anymore. the simple 
workaround is to <b> disable splitting of tfiletransport files by setting 
mapred.min.split.size to an infinite value</b>. if the files are not spit - 
there's no problem.

I am hoping you can take things from here. if we really really need hive to 
query tfiletransport directly - it's probably another couple of hours worth of 
work - but i will wait for ur input and see if this is required (seems to me 
that SequenceFiles are a better long term data container in hadoop since they 
allow compression).

      was (Author: jsensarma):
    this turned out to be way more complicated than i had thought. Here's the 
rundown:

- thrift-377 - i have attached the tfiletransport java ports in it. more on 
this later

- hive-333 - contains a new contrib/thrift module that has:
  * lib/libthrift_asf.jar - this contains a thrift jar created from thrift 
trunk + thrift-377 (so includes tfiletransport)
     I had to enter a new libthrift into hive because the current one uses 
com.facebook namespace that is not compatible with thrift trunk. All of 
contrib/thrift uses the latest thrift trunk version.

     note that contrib/thrift/lib/libthrift_asf.jar is submitted as a separate 
attachment from the patch

  * provides a trivial re-write of the existing thrift serde in Hive (new one 
called org.apache.hadoop.hive.serde.asfthrift.ThriftBytesWritabledeserializer) 
that uses the thrift trunk library (instead of the old one). this is required 
to read thrift objects embedded inside ByteWritables in hive.

  * contrib/thrift also has a TFileTransportInputFormat and 
TFileTransportRecordReader - this allows processing of TFileTransport files as 
inputs to hadoop map-reduce. it will split files so that the splits are aligned 
with tfiletransport chunk boundaries.

  * it also has an example map-reduce program (TConverter/TMapper) - that shows 
how to convert a TFileTransport into a SequenceFile with thrift objects 
embedded inside BytesWritable objects. This example does not do any reduction - 
but you can extend this example to hash/reduce  on specific key (which is what 
we do at Facebook). Also output compression can be controlled by command line 
options (extends Tool - more on usage later).

  * aside from libthrift_asf.jar - the rest of the stuff is produced as a 
single jar file by contrib/thrift (see 
build/contrib-thrift/hive_contrib-thrift.jar - should be produced by ant jar or 
ant package).

ie. the current work done so far allows conversion of files in TFileTransport 
format into SequenceFile +BytesWritable formats (and also provides the serde to 
read these files) that are Hive friendly. example run of TConverter:

hadoop jar -libjars contrib/thrift/lib/libthrift_asf.jar,build/ql/hive_exec.jar 
build/contrib-thrift/hive_contrib-thrift.jar 
org.apache.hadoop.hive.thrift.TConverter 
-Dthrift.filetransport.classname=org.apache.hadoop.thrift.TestClass -inputpath 
/tmp/tfiletransportfile -output /tmp/sequencefile

// more options (including those to get compressed sequencefiles) can simply be 
added using more -Dkey=value options.

Once the files are converted - it's trivial to create a Hive table with the 
right properties so that these files can be queries. a few points about hive 
integration:
- need to ask Prasad about the exact cli statements to create these tables - 
will post instructions once i have them.
- the jar file hive_contrib-thrift.jar and libthrift_asf.jar will need to be in 
hive execution environment. This can be arranged by copying them into auxlib/ 
under the hive distribution directory. i haven't integrated this into ant yet.

Two more options exist:
- convert thrift files into text using TConverter type programs
- alternatively we can arrange Hive to query TFileTransport directly. It's not 
that hard (since the inputformat is not done) - but it needs some more work and 
testing and more new code. 

BIG CAVEAT regarding thrift-377 - i am finding a few (1-5) empty spurious 
records at the beginning of each tfiletransport chunk when trying to read 
tfiletransport files produced in c++ land from java land (and only when seeking 
to split boundaries). I just don't have the time to debug anymore. the simple 
workaround is to disable splitting of tfiletransport files by setting 
mapred.min.split.size to an infinite value. if the files are not spit - there's 
no problem.

I am hoping you can take things from here. if we really really need hive to 
query tfiletransport directly - it's probably another couple of hours worth of 
work - but i will wait for ur input and see if this is required (seems to me 
that SequenceFiles are a better long term data container in hadoop since they 
allow compression).
  
> Add TFileTransport deserializer
> -------------------------------
>
>                 Key: HIVE-333
>                 URL: https://issues.apache.org/jira/browse/HIVE-333
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>         Environment: Linux
>            Reporter: Steve Corona
>            Assignee: Joydeep Sen Sarma
>         Attachments: hive-333.patch.1, libthrift_asf.jar
>
>
> I've been googling around all night and havn't really found what I am looking 
> for. Basically, I want to transfer some data from my web servers to hive  in 
> a format that's a little more verbose than plain CSV files. It seems like 
> JSON or thrift would be perfect for this. I am planning on sending this 
> serialized json or thrift data through scribe and loading it into Hive.. I 
> just can't figure out how to tell hive that the input data is a bunch of 
> serialized thrift records (all of the records are the "struct" type)  in a 
> TFileTransport. Hopefully this makes sense...
> Reply from Joydeep Sen Sarma ([email protected])
> Unfortunately the open source code base does not have the loaders we run to 
> convert thrift records in a tfiletransport into a sequencefile that 
> hadoop/hive can work with. One option is that we add this to Hive code base 
> (should be straightforward).
> No process required. Please file a jira - I will try to upload a patch this 
> weekend (just cut'n'paste for most part). Would appreciate some help in 
> finessing it out .. (the internal code is hardwired to some assumptions etc. )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HIVE-333) Add TFileTransport deserializer

Reply via email to