I think I might like this approach the best!
Thanks for all your advice....going off to experiment.
Bobby
On Oct 4, 2009, at 4:38 AM, Zheng Shao wrote:
Sub query is NOT going to create additional jobs. Hive is smart
enough to compile the query into optimized plans, and a lot of
obvious optimizations are already done.
Yes that's what I mean. The functionality (for UDF to return
arbitrary class) is NOT there yet.
Actually there is an even simpler approach without the need of
modifying the Hive framework:
Just create a global String to JSON conversion cache (using LRU
algorithm).
The size of the cache can be set to something like 2.
This makes sure the conversion from String to JSON will only happen
once for each row.
For example code of LRU algorithm, take a look at:
http://www.source-code.biz/snippets/java/6.htm
Inside each UDF like TimeSpentQuerying(String), we just call "JSON
myjson = myLRUCache.get(myString);".
How do you like this one?
heng
On Sat, Oct 3, 2009 at 6:44 PM, Bobby Rullo <[email protected]> wrote:
On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote:
I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and
UserAgent are all UDFs that takes a JSON object and outputs a STRING.
Exactly!
Allowing such a new object type just for UDFs are simpler than
supporting a new type in all parts of the system, because we don't
need to care about serialization/deserialization.
Interesting. So you're saying it's ok to return whatever type from
UDF.evaluate(), as long as it doesn't get serialized (i.e. as long
as it's not the final output)?
If we go that route, the final query will look like this. Note that
StringToJson takes a STRING and returns a JSON, and
TimeSpentQuerying (etc) takes a JSON and returns either STRING or
other types that are supported by Hive.
SELECT TimeSpentQuerying(txnjson), PageType(txnjson),
TotalRevenue(txnjson) from (SELECT StringToJson(txn) as txnjson
FROM txn_logs) tnxjson_logs
where UserAgent(txnjson) like '%Safari%';
Is this good enough?
That's pretty damn good. I like that all I have to do is create one
more UDF which does the JSON parsing (and maybe JsonToString as well.)
The only thing that I'm concerned about is whether that sub-query is
going to create an extra map job. Is the planner smart enough to
stick stick all that work in one map?
Thanks!
bobby
Zheng
On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo <[email protected]>
wrote:
Zheng,
Thanks for your reply.
I'm not sure what this "get_json_objects" function would do...do
you mean it would return a canned set of fields? That's not what
I'm interested in. The UDFs that I have in mind dig deep into the
JSON and fetch/filter/transform data. I don't want a few canned
fields to play with, I want our users to be able to create AdHoc
queries on the JSON using any and all of the data inside.
Here's a more realistic example. 'Txn' is a JSON field:
select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn)
from txn_logs where UserAgent(txn) like '%Safari%'
I have no problem writing the UDFs, I'm just trying to avoid
parsing the JSON in each one of them - in other words, I want the
input to the UDF's to not be a String (or Text) which I need to
parse into JSON, but rather a JSON object.
Is that possible? Can I create new field types? And if so, where do
I register them?
My fantasy is to be able to do this:
CREATE TABLE txn_logs (tid String, txn JSON);
Thanks,
Bobby
On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote:
We have 2 example serdes, one for text data (regexserde), one for
binary data (thriftserde).
But the simplest solution for this is to add a udf get_json_objects
that returns a bunch of fields in an array. Then we can use sub
query
to extract the array elements as individual elements, until we have
support for common expression extraction.
There are a few example udf in the ctrib package illustrating the
way
to accept variable number of arguments.
Let us know how it goes.
Zheng
On 10/2/09, Bobby Rullo <[email protected]> wrote:
Hi there,
I want to create a new JSON Field/Column type. I know there exists
get_json_object(), but the things is I want to multiple JSON
operations in a single select statement and don't want to incur the
cost of parsing the JSON over and over again, because our json
structures are quite large.
Our schema is two fields - the first is a string of some sort
used as
an id, and the second is a huge JSON structure. I want to be able
write sql like the following:
select Foo(json), Bar(json), Baz(json) where Quux(json) like
'%hello%';
Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json.
Right now what I'm doing is storing the JSON as a string and in
each
of those UDF's I am de-serializing the json, and returning another
string.
Ideally I'd want to be able to return JSON objects from these
UDF's so
I could compose my functions like "Foo(Bar(Baz(json))"
I'm thinking I need to create a SerDe but SerDe's seem to work at
the
table level, not at the field level (in other words, I just need a
custom field, not a entire custom table type...or do I!?)
Also, if there's any docs on writing custom SerDe's please let me
know...I haven't been able to find much.
Thanks!
Bobby
--
Sent from Gmail for mobile | mobile.google.com
Yours,
Zheng
--
Yours,
Zheng
--
Yours,
Zheng