I need advice for a design of a loader function. I wrote a loader function LTSVLoader for PiggyBank to load LTSV files.
https://issues.apache.org/jira/browse/PIG-3215 LTSV is a variant of TSV. Each column in LTSV files consists of "<label>:<value>." For example: {data} host:pc.example.com req:GET /index.html ua:Opera/9.80 host:user.example.net req:GET /favicon.ico ua:Mozilla/5.0 req:GET /news.html host:workstation.example.org {/data} The function is intended to load an arbitrary set of columns which are specified by labels, for example "host" and "ua." I think it would be good if I could use the function as below. {code} # List 1 log = LOAD 'access.log' USING LTSVLoader() AS (host:chararray, ua:chararray); DUMP log; -- {pc.example.com,Opera/9.80} -- {user.example.net,} -- {workstation.example.org,Mozilla/5.0} {/code} However, I found it is impossible to implement the function in that way because, as far as I know, there is no way to get an input schema specified by AS clause from a loader function. Thus, the current implementation takes an input schema as a constructor argument, instead of AS clause. {code} # List 2 log = LOAD 'access.log' USING LTSVLoader('host:chararray, ua:chararray'); ... {/code} It works, but I think specifying a schema by a string is not a usual way in Pig, although JsonLoader in PiggyBank is implemented in that way. So my question is, is there a better way to get an input schema specified in a script?
