[ 
https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-3215:
----------------------------

    Status: Open  (was: Patch Available)
    
> [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
> ----------------------------------------------------------------------------
>
>                 Key: PIG-3215
>                 URL: https://issues.apache.org/jira/browse/PIG-3215
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>            Reporter: MIYAKAWA Taku
>            Assignee: MIYAKAWA Taku
>              Labels: piggybank
>         Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch, 
> PIG-3215.patch
>
>
> LTSV, or Labeled Tab-separated Values format is now getting popular in Japan 
> for log files, especially of web servers. The goal of this jira is to add 
> LTSVLoader in PiggyBank to load LTSV files.
> LTSV is based on TSV thus columns are separated by tab characters. 
> Additionally each of columns includes a label and a value, separated by ":" 
> character.
> Read about LTSV on http://ltsv.org/.
> h4. Example LTSV file (access.log)
> Columns are separated by tab characters.
> {noformat}
> host:host1.example.org        req:GET /index.html     ua:Opera/9.80
> host:host1.example.org        req:GET /favicon.ico    ua:Opera/9.80
> host:pc.example.com   req:GET /news.html      ua:Mozilla/5.0
> {noformat}
> h4. Usage 1: Extract fields from each line
> Users can specify an input schema and get columns as Pig fields.
> This example loads the LTSV file shown in the previous section.
> {code}
> -- Parses the access log and count the number of lines
> -- for each pair of the host column and the ua column.
> access = LOAD 'access.log' USING 
> org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
> grouped_access = GROUP access BY (host, ua);
> count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, 
> COUNT(access);
> DUMP count_for_host_ua;
> {code}
> The below text will be printed out.
> {noformat}
> (host1.example.org,Opera/9.80,2)
> (pc.example.com,Firefox/5.0,1)
> {noformat}
> h4. Usage 2: Extract a map from each line
> Users can get a map for each LTSV line. The key of a map is a label of the 
> LTSV column. The value of a map comes from characters after ":" in the LTSV 
> column.
> {code}
> -- Parses the access log and projects the user agent field.
> access = LOAD 'access.log' USING 
> org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
> user_agent = FOREACH access GENERATE m#'ua' AS ua;
> DUMP user_agent;
> {code}
> The below text will be printed out.
> {noformat}
> (Opera/9.80)
> (Opera/9.80)
> (Firefox/5.0)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to