I think I might like this approach the best!

Thanks for all your advice....going off to experiment.

Bobby
On Oct 4, 2009, at 4:38 AM, Zheng Shao wrote:

Sub query is NOT going to create additional jobs. Hive is smart enough to compile the query into optimized plans, and a lot of obvious optimizations are already done.

Yes that's what I mean. The functionality (for UDF to return arbitrary class) is NOT there yet.


Actually there is an even simpler approach without the need of modifying the Hive framework:

Just create a global String to JSON conversion cache (using LRU algorithm).
The size of the cache can be set to something like 2.
This makes sure the conversion from String to JSON will only happen once for each row.

For example code of LRU algorithm, take a look at: 
http://www.source-code.biz/snippets/java/6.htm

Inside each UDF like TimeSpentQuerying(String), we just call "JSON myjson = myLRUCache.get(myString);".

How do you like this one?


heng

On Sat, Oct 3, 2009 at 6:44 PM, Bobby Rullo <[email protected]> wrote:

On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:

I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and UserAgent are all UDFs that takes a JSON object and outputs a STRING.

Exactly!


Allowing such a new object type just for UDFs are simpler than supporting a new type in all parts of the system, because we don't need to care about serialization/deserialization.


Interesting. So you're saying it's ok to return whatever type from UDF.evaluate(), as long as it doesn't get serialized (i.e. as long as it's not the final output)?


If we go that route, the final query will look like this. Note that StringToJson takes a STRING and returns a JSON, and TimeSpentQuerying (etc) takes a JSON and returns either STRING or other types that are supported by Hive.

SELECT TimeSpentQuerying(txnjson), PageType(txnjson), TotalRevenue(txnjson) from (SELECT StringToJson(txn) as txnjson FROM txn_logs) tnxjson_logs
where UserAgent(txnjson) like '%Safari%';

Is this good enough?


That's pretty damn good. I like that all I have to do is create one more UDF which does the JSON parsing (and maybe JsonToString as well.)

The only thing that I'm concerned about is whether that sub-query is going to create an extra map job. Is the planner smart enough to stick stick all that work in one map?


Thanks!

bobby


Zheng

On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo <[email protected]> wrote:
Zheng,

Thanks for your reply.

I'm not sure what this "get_json_objects" function would do...do you mean it would return a canned set of fields? That's not what I'm interested in. The UDFs that I have in mind dig deep into the JSON and fetch/filter/transform data. I don't want a few canned fields to play with, I want our users to be able to create AdHoc queries on the JSON using any and all of the data inside.

Here's a more realistic example. 'Txn' is a JSON field:

select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from txn_logs where UserAgent(txn) like '%Safari%'

I have no problem writing the UDFs, I'm just trying to avoid parsing the JSON in each one of them - in other words, I want the input to the UDF's to not be a String (or Text) which I need to parse into JSON, but rather a JSON object.

Is that possible? Can I create new field types? And if so, where do I register them?

My fantasy is to be able to do this:

CREATE TABLE txn_logs (tid String, txn JSON);

Thanks,

Bobby



On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:

We have 2 example serdes, one for text data (regexserde), one for
binary data (thriftserde).

But the simplest solution for this is to add a udf get_json_objects
that returns a bunch of fields in an array. Then we can use sub query
to extract the array elements as individual elements, until we have
support for common expression extraction.


There are a few example udf in the ctrib package illustrating the way
to accept variable number of arguments.

Let us know how it goes.

Zheng



On 10/2/09, Bobby Rullo <[email protected]> wrote:
Hi there,

I want to create a new JSON Field/Column type. I know there exists
get_json_object(), but the things is I want to multiple JSON
operations in a single select statement and don't want to incur the
cost of parsing the JSON over and over again, because our json
structures are quite large.

Our schema is two fields - the first is a string of some sort used as
an id, and the second is a huge JSON structure. I want to be able
write sql like the following:

select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%';

Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
Right now what I'm doing is storing the JSON as a string and in each
of those UDF's I am de-serializing the json, and returning another
string.

Ideally I'd want to be able to return JSON objects from these UDF's so
I could compose my functions like "Foo(Bar(Baz(json))"

I'm thinking I need to create a SerDe but SerDe's seem to work at the
table level, not at the field level (in other words, I just need a
custom field, not a entire custom table type...or do I!?)

Also, if there's any docs on writing custom SerDe's please let me
know...I haven't been able to find much.

Thanks!

Bobby


--
Sent from Gmail for mobile | mobile.google.com

Yours,
Zheng




--
Yours,
Zheng




--
Yours,
Zheng

Reply via email to