Sub query is NOT going to create additional jobs. Hive is smart enough to compile the query into optimized plans, and a lot of obvious optimizations are already done.
Yes that's what I mean. The functionality (for UDF to return arbitrary class) is NOT there yet. Actually there is an even simpler approach without the need of modifying the Hive framework: Just create a global String to JSON conversion cache (using LRU algorithm). The size of the cache can be set to something like 2. This makes sure the conversion from String to JSON will only happen once for each row. For example code of LRU algorithm, take a look at: http://www.source-code.biz/snippets/java/6.htm Inside each UDF like TimeSpentQuerying(String), we just call "JSON myjson = myLRUCache.get(myString);". How do you like this one? heng On Sat, Oct 3, 2009 at 6:44 PM, Bobby Rullo <[email protected]> wrote: > > On Oct 3, 2009, at 1:28 AM, Zheng Shao wrote: > > I got it. You mean TimeSpentQuerying, PageType, TotalRevenue and UserAgent > are all UDFs that takes a JSON object and outputs a STRING. > > Exactly! > > Allowing such a new object type just for UDFs are simpler than supporting a > new type in all parts of the system, because we don't need to care about > serialization/deserialization. > > > Interesting. So you're saying it's ok to return whatever type from > UDF.evaluate(), as long as it doesn't get serialized (i.e. as long as it's > not the final output)? > > > If we go that route, the final query will look like this. Note that > StringToJson takes a STRING and returns a JSON, and TimeSpentQuerying (etc) > takes a JSON and returns either STRING or other types that are supported by > Hive. > > SELECT TimeSpentQuerying(txnjson), PageType(txnjson), TotalRevenue(txnjson) > from (SELECT StringToJson(txn) as txnjson FROM txn_logs) tnxjson_logs > where UserAgent(txnjson) like '%Safari%'; > > Is this good enough? > > > That's pretty damn good. I like that all I have to do is create one more > UDF which does the JSON parsing (and maybe JsonToString as well.) > > The only thing that I'm concerned about is whether that sub-query is going > to create an extra map job. Is the planner smart enough to stick stick all > that work in one map? > > > Thanks! > > bobby > > > Zheng > > On Fri, Oct 2, 2009 at 11:50 PM, Bobby Rullo <[email protected]> wrote: > >> Zheng, >> Thanks for your reply. >> >> I'm not sure what this "get_json_objects" function would do...do you mean >> it would return a canned set of fields? That's not what I'm interested in. >> The UDFs that I have in mind dig deep into the JSON and >> fetch/filter/transform data. I don't want a few canned fields to play with, >> I want our users to be able to create AdHoc queries on the JSON using any >> and all of the data inside. >> >> Here's a more realistic example. 'Txn' is a JSON field: >> >> select TimeSpentQuerying(txn), PageType(txn), TotalRevenue(txn) from >> txn_logs where UserAgent(txn) like '%Safari%' >> >> I have no problem writing the UDFs, I'm just trying to avoid parsing the >> JSON in each one of them - in other words, I want the input to the UDF's to >> not be a String (or Text) which I need to parse into JSON, but rather a JSON >> object. >> >> Is that possible? Can I create new field types? And if so, where do I >> register them? >> >> My fantasy is to be able to do this: >> >> CREATE TABLE txn_logs (tid String, txn JSON); >> >> Thanks, >> >> Bobby >> >> >> >> On Oct 2, 2009, at 10:54 PM, Zheng Shao wrote: >> >> We have 2 example serdes, one for text data (regexserde), one for >> binary data (thriftserde). >> >> But the simplest solution for this is to add a udf get_json_objects >> that returns a bunch of fields in an array. Then we can use sub query >> to extract the array elements as individual elements, until we have >> support for common expression extraction. >> >> >> There are a few example udf in the ctrib package illustrating the way >> to accept variable number of arguments. >> >> Let us know how it goes. >> >> Zheng >> >> >> >> On 10/2/09, Bobby Rullo <[email protected]> wrote: >> >> Hi there, >> >> >> I want to create a new JSON Field/Column type. I know there exists >> >> get_json_object(), but the things is I want to multiple JSON >> >> operations in a single select statement and don't want to incur the >> >> cost of parsing the JSON over and over again, because our json >> >> structures are quite large. >> >> >> Our schema is two fields - the first is a string of some sort used as >> >> an id, and the second is a huge JSON structure. I want to be able >> >> write sql like the following: >> >> >> select Foo(json), Bar(json), Baz(json) where Quux(json) like '%hello%'; >> >> >> Foo(), Bar(), Baz() and Quux() are all UDF's that operate on json. >> >> Right now what I'm doing is storing the JSON as a string and in each >> >> of those UDF's I am de-serializing the json, and returning another >> >> string. >> >> >> Ideally I'd want to be able to return JSON objects from these UDF's so >> >> I could compose my functions like "Foo(Bar(Baz(json))" >> >> >> I'm thinking I need to create a SerDe but SerDe's seem to work at the >> >> table level, not at the field level (in other words, I just need a >> >> custom field, not a entire custom table type...or do I!?) >> >> >> Also, if there's any docs on writing custom SerDe's please let me >> >> know...I haven't been able to find much. >> >> >> Thanks! >> >> >> Bobby >> >> >> >> -- >> Sent from Gmail for mobile | mobile.google.com >> >> Yours, >> Zheng >> >> >> > > > -- > Yours, > Zheng > > > -- Yours, Zheng
