I need advice for a design of a loader function.

I wrote a loader function LTSVLoader for PiggyBank to load LTSV files.

https://issues.apache.org/jira/browse/PIG-3215

LTSV is a variant of TSV. Each column in LTSV files
consists of "<label>:<value>."
For example:

{data}
host:pc.example.com req:GET /index.html ua:Opera/9.80
host:user.example.net req:GET /favicon.ico
ua:Mozilla/5.0 req:GET /news.html host:workstation.example.org
{/data}

The function is intended to load an arbitrary set of columns
which are specified by labels, for example "host" and "ua."
I think it would be good if I could use the function as below.

{code}
# List 1
log = LOAD 'access.log' USING LTSVLoader() AS (host:chararray, ua:chararray);

DUMP log;
-- {pc.example.com,Opera/9.80}
-- {user.example.net,}
-- {workstation.example.org,Mozilla/5.0}
{/code}

However, I found it is impossible to implement the function in that way
because, as far as I know, there is no way to get an input schema
specified by AS clause from a loader function.

Thus, the current implementation takes an input schema
as a constructor argument, instead of AS clause.

{code}
# List 2
log = LOAD 'access.log' USING LTSVLoader('host:chararray, ua:chararray');
...
{/code}

It works, but I think specifying a schema by a string
is not a usual way in Pig,
although JsonLoader in PiggyBank is implemented in that way.

So my question is,
is there a better way to get an input schema specified in a script?

Reply via email to