No process required. Please file a jira - I will try to upload a patch this weekend (just cut'n'paste for most part). Would appreciate some help in finessing it out .. (the internal code is hardwired to some assumptions etc. )
-----Original Message----- From: Stephen Corona [mailto:scor...@adknowledge.com] Sent: Saturday, March 07, 2009 7:53 AM To: hive-user@hadoop.apache.org Subject: RE: Querying JSON/Thrift data? Thanks for the reply. Would it be possible to add the tfiletransport -> sequencefile process to the hive code base? If so, what type of timeframe would be associated with that (i.e, is there alot of red tape to go through at facebook?) I think that CSV maps and lists can be a possible short term solution. Can they support adding new keys to the map? Also, what is the behavior when a key doesn't exist in a particular map for a record? Null? Or does Hive throw an error? I saw the JSON function but I think that the delimited maps/lists is a better solution because we don't need nested maps/lists. Thanks again! Steve Corona ________________________________________ From: Joydeep Sen Sarma [jssa...@facebook.com] Sent: Saturday, March 07, 2009 1:43 AM To: hive-user@hadoop.apache.org Subject: RE: Querying JSON/Thrift data? Yes - it makes complete sense. This is what we do here for some data sets. Unfortunately the open source code base does not have the loaders we run to convert thrift records in a tfiletransport into a sequencefile that hadoop/hive can work with. One option is that we add this to Hive code base (should be straightforward). Hive does supports maps and lists encoded in delimited text files (please take a look at the DDL syntax). If that's good enough for you - that may be a better option. However - this support does not support any more nesting (structs/lists/maps inside lists/maps). The third option is to provide a JSON Serde. We would like to do this - but haven't yet. There is a JSON function available in Hive that can take a json encoded column and evaluate expressions over it. using this may be another short term workaround. -----Original Message----- From: Stephen Corona [mailto:scor...@adknowledge.com] Sent: Friday, March 06, 2009 8:46 PM To: hive-user@hadoop.apache.org Subject: RE: Querying JSON/Thrift data? The input format can be whatever it needs to be to get it loaded into Hive. I've been googling around all night and havn't really found what I am looking for. Basically, I want to transfer some data from my web servers to hive in a format that's a little more verbose than plain CSV files. It seems like JSON or thrift would be perfect for this. I am planning on sending this serialized json or thrift data through scribe and loading it into Hive.. I just can't figure out how to tell hive that the input data is a bunch of serialized thrift records (all of the records are the "struct" type) in a TFileTransport. Hopefully this makes sense... -Steve ________________________________________ From: Joydeep Sen Sarma [jssa...@facebook.com] Sent: Friday, March 06, 2009 11:24 PM To: hive-user@hadoop.apache.org Subject: RE: Querying JSON/Thrift data? can you describe a bit more on the format of the input file? is it a set of serialized thrift records of the same class type? the current ThriftDeserializer expects serialized records to be embedded inside a BytesWritable (we make sure of this during the loading process) - but probably not the scenario for most people (we haven't gotten around to fixing this yet) -----Original Message----- From: Stephen Corona [mailto:scor...@adknowledge.com] Sent: Friday, March 06, 2009 8:05 PM To: hive-user@hadoop.apache.org Subject: RE: Querying JSON/Thrift data? I took a look at this class and tried to give it a shot.. I'm not exactly sure what the create table syntax should look like. I tried this: hive> create table testing ( uid int, name string ) > row format serializer 'org.apache.hadoop.hive.serde2.ThriftDeserializer' > ; FAILED: Parse Error: line 2:7 mismatched input 'table' expecting TEMPORARY in create function statement Steve Corona ________________________________________ From: Prasad Chakka [pra...@facebook.com] Sent: Friday, March 06, 2009 7:33 PM To: hive-user@hadoop.apache.org Subject: Re: Querying JSON/Thrift data? Can you use ThriftDeserializer? Look at Complex class to see how it is used. Prasad ________________________________ From: Stephen Corona <scor...@adknowledge.com> Reply-To: <hive-user@hadoop.apache.org> Date: Fri, 6 Mar 2009 16:02:02 -0800 To: <hive-user@hadoop.apache.org> Subject: RE: Querying JSON/Thrift data? ________________________________________ From: Stephen Corona Sent: Friday, March 06, 2009 6:16 PM To: hive-user-subscr...@hadoop.apache.org Subject: Querying JSON/Thrift data? Hey guys, I am currently loading data into Hive in a CSV delimited format. This works but turns out to be a huge pain when adding and removing columns (since they can only be added to the end of the table). Is there any way to load and query data that's in some sort of JSON/thrift format? That way the data is already associated with some column and not just in a seemingly arbitrary data format? I am pretty open on which format to use and how to load it into Hive. FWIW, Our data is generated in PHP and pushed to Scribe. Scribe aggregates the CSV files and we load them into Hive every night. Thanks, Steve