RE: Querying JSON/Thrift data?

Joydeep Sen Sarma Sat, 07 Mar 2009 12:08:39 -0800

No process required. Please file a jira - I will try to upload a patch this 
weekend (just cut'n'paste for most part). Would appreciate some help in 
finessing it out .. (the internal code is hardwired to some assumptions etc. )

-----Original Message-----
From: Stephen Corona [mailto:scor...@adknowledge.com] 
Sent: Saturday, March 07, 2009 7:53 AM
To: hive-user@hadoop.apache.org
Subject: RE: Querying JSON/Thrift data?

Thanks for the reply. Would it be possible to add the tfiletransport -> 
sequencefile process to the hive code base? If so, what type of timeframe would 
be associated with that (i.e, is there alot of red tape to go through at 
facebook?)

I think that CSV maps and lists can be a possible short term solution. Can they 
support adding new keys to the map? Also, what is the behavior when a key 
doesn't exist in a particular map for a record? Null? Or does Hive throw an 
error?

I saw the JSON function but I think that the delimited maps/lists is a better 
solution because we don't need nested maps/lists.

Thanks again!

Steve Corona

________________________________________
From: Joydeep Sen Sarma [jssa...@facebook.com]
Sent: Saturday, March 07, 2009 1:43 AM
To: hive-user@hadoop.apache.org
Subject: RE: Querying JSON/Thrift data?

Yes - it makes complete sense. This is what we do here for some data sets.

Unfortunately the open source code base does not have the loaders we run to 
convert thrift records in a tfiletransport into a sequencefile that hadoop/hive 
can work with. One option is that we add this to Hive code base (should be 
straightforward).

Hive does supports maps and lists encoded in delimited text files (please take 
a look at the DDL syntax). If that's good enough for you - that may be a better 
option. However - this support does not support any more nesting 
(structs/lists/maps inside lists/maps).

The third option is to provide a JSON Serde. We would like to do this - but 
haven't yet. There is a JSON function available in Hive that can take a json 
encoded column and evaluate expressions over it. using this may be another 
short term workaround.

-----Original Message-----
From: Stephen Corona [mailto:scor...@adknowledge.com]
Sent: Friday, March 06, 2009 8:46 PM
To: hive-user@hadoop.apache.org
Subject: RE: Querying JSON/Thrift data?

The input format can be whatever it needs to be to get it loaded into Hive.

I've been googling around all night and havn't really found what I am looking 
for. Basically, I want to transfer some data from my web servers to hive  in a 
format that's a little more verbose than plain CSV files. It seems like JSON or 
thrift would be perfect for this. I am planning on sending this serialized json 
or thrift data through scribe and loading it into Hive.. I just can't figure 
out how to tell hive that the input data is a bunch of serialized thrift 
records (all of the records are the "struct" type)  in a TFileTransport. 
Hopefully this makes sense...

-Steve

________________________________________
From: Joydeep Sen Sarma [jssa...@facebook.com]
Sent: Friday, March 06, 2009 11:24 PM
To: hive-user@hadoop.apache.org
Subject: RE: Querying JSON/Thrift data?

can you describe a bit more on the format of the input file?

is it a set of serialized thrift records of the same class type? the current 
ThriftDeserializer expects serialized records to be embedded inside a 
BytesWritable (we make sure of this during the loading process) - but probably 
not the scenario for most people (we haven't gotten around to fixing this yet)

-----Original Message-----
From: Stephen Corona [mailto:scor...@adknowledge.com]
Sent: Friday, March 06, 2009 8:05 PM
To: hive-user@hadoop.apache.org
Subject: RE: Querying JSON/Thrift data?

I took a look at this class and tried to give it a shot.. I'm not exactly sure 
what the create table syntax should look like. I tried this:

hive> create table testing ( uid int, name string )
    > row format serializer 'org.apache.hadoop.hive.serde2.ThriftDeserializer'
    > ;
FAILED: Parse Error: line 2:7 mismatched input 'table' expecting TEMPORARY in 
create function statement

Steve Corona
________________________________________
From: Prasad Chakka [pra...@facebook.com]
Sent: Friday, March 06, 2009 7:33 PM
To: hive-user@hadoop.apache.org
Subject: Re: Querying JSON/Thrift data?

Can you use ThriftDeserializer? Look at Complex class to see how it is used.

Prasad

________________________________
From: Stephen Corona <scor...@adknowledge.com>
Reply-To: <hive-user@hadoop.apache.org>
Date: Fri, 6 Mar 2009 16:02:02 -0800
To: <hive-user@hadoop.apache.org>
Subject: RE: Querying JSON/Thrift data?

________________________________________
From: Stephen Corona
Sent: Friday, March 06, 2009 6:16 PM
To: hive-user-subscr...@hadoop.apache.org
Subject: Querying JSON/Thrift data?

Hey guys,

I am currently loading data into Hive in a CSV delimited format. This works but 
turns out to be a huge pain when adding and removing columns (since they can 
only be added to the end of the table). Is there any way to load and query data 
that's in some sort of JSON/thrift format? That way the data is already 
associated with some column and not just in a seemingly arbitrary data format? 
I am pretty open on which format to use and how to load it into Hive. FWIW, Our 
data is generated in PHP and pushed to Scribe. Scribe aggregates the CSV files 
and we load them into Hive every night.

Thanks,

Steve

RE: Querying JSON/Thrift data?

Reply via email to