MIYAKAWA Taku created PIG-3215:
----------------------------------

             Summary: [piggybank] Add LTSVLoader to load LTSV (Labeled 
Tab-separated Values) files
                 Key: PIG-3215
                 URL: https://issues.apache.org/jira/browse/PIG-3215
             Project: Pig
          Issue Type: New Feature
          Components: piggybank
            Reporter: MIYAKAWA Taku


LTSV, or Labeled Tab-separated Values format is now getting popular in Japan 
for log files, especially of web servers. The goal of this jira is to add 
LTSVLoader in PiggyBank to load LTSV files.

LTSV is based on TSV thus columns are separated by tab characters. Additionally 
each of columns includes a label and a value, separated by ":" character.

Read about LTSV on http://ltsv.org/.

h4. Example LTSV file (access.log)

Columns are separated by tab characters.

{noformat}
host:host1.example.org  req:GET /index.html     ua:Opera/9.80
host:host1.example.org  req:GET /favicon.ico    ua:Opera/9.80
host:pc.example.com     req:GET /news.html      ua:Mozilla/5.0
{noformat}

h4. Usage 1: Extract fields from each line

Users can specify an input schema and get columns as Pig fields.

This example loads the LTSV file shown in the previous section.

{code}
-- Parses the access log and count the number of lines
-- for each pair of the host column and the ua column.
access = LOAD 'access.log' USING 
org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
grouped_access = GROUP access BY (host, ua);
count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, 
COUNT(access);
DUMP count_for_host_ua;
{code}

The below text will be printed out.

{noformat}
(host1.example.org,Opera/9.80,2)
(pc.example.com,Firefox/5.0,1)
{noformat}

h4. Usage 2: Extract a map from each line

Users can get a map for each LTSV line. The key of a map is a label of the LTSV 
column. The value of a map comes from characters after ":" in the LTSV column.

{code}
-- Parses the access log and projects the user agent field.
access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() 
AS (m:map[]);
user_agent = FOREACH access GENERATE m#'ua' AS ua;
DUMP user_agent;
{code}

The below text will be printed out.

{noformat}
(Opera/9.80)
(Opera/9.80)
(Firefox/5.0)
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to