Re: Google Protocol Buffers and Hive

valentina kroshilina Fri, 02 Sep 2011 12:13:18 -0700

You can still partition the data. You'll have to run queries to add
partitions to the table, otherwise your table won't see a new partition, but
you'll have to do it regardless on what type of table you use.


We have a big cluster so I don't really see any change in performance, Hive
for this type of data is relatively fast.

For some cases GPB has advantages over plain text, so it depends...

On Fri, Sep 2, 2011 at 2:57 PM, Matias Silva <msi...@specificmedia.com>wrote:

> Hi Valentina, thanks for your response.  Do you think using external
> tables, I can still partition the data?  I do like
> the external table idea because that will save from having to do an
> additional import of the data into hive from just loading
> into HDFS.   Plus it will save on space.
>
> How is the performance using GPB/Hive?
>
> Another thing I think we can do is use the pig/elephant bird to read the
> GPB files and then write them out to a tab delimited, plain text format
> and import the data into hive.  This would be a copy of the data, but would
> it be cleaner.
>
> Thanks,
> Matt
>
>
> On Sep 2, 2011, at 9:43 AM, valentina kroshilina wrote:
>
> > I use MR to generate tables using Elephant-Bird's OutputFormat. Hive
> > can read from EXTERNAL tables using ProtobufHiveSerde and
> > ProtobufBlockInputFormat generated by Elephant-Bird. Create table
> > statement looks like the following:
> >
> > CREATE EXTERNAL TABLE IF NOT EXISTS TABLE_NAME
> > (
> > ...
> > )
> > ROW FORMAT SERDE 'elephantbird.proto.hive.serde.LzoXXXProtobufHiveSerde'
> > STORED AS
> > inputformat
> 'elephantbird.proto.mapred.input.DeprecatedLzoXXXProtobufBlockInputFormat'
> > outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
> > LOCATION '/PATH';
> >
> > So the solution is to use external tables.
> >
> > Let me know if it helps.
> >
> > On Thu, Sep 1, 2011 at 8:45 PM, Matias Silva <msi...@specificmedia.com>
> wrote:
> >> Hi Everyone, is there any documentation regarding importing
> >> GoogleProtocolBuffer files into Hive.  I'm scouring over the internet
> >> and the closest thing I came
> >> across http://search-hadoop.com/m/9zF4MEW5Od1/v=plain
> >> I saw something from Elephant-Bird where I can load the GPB file using
> pig
> >> and then store it in a plain text format and then load
> >> into Hive.  It would be great if I can just load from GPB directly into
> >> Hive.
> >> Any pointers?
> >> Thanks for your time and knowledge,
> >> Matt
> >>
> >>
>
>
> Matias Silva   [Sr. Data Warehouse Developer]
> p 949.861.8888 x1420      f 949.861.8990
> specificmedia.com
>
>
>
>

Re: Google Protocol Buffers and Hive

Reply via email to