[ https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates updated PIG-3215: ---------------------------- Status: Open (was: Patch Available) > [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files > ---------------------------------------------------------------------------- > > Key: PIG-3215 > URL: https://issues.apache.org/jira/browse/PIG-3215 > Project: Pig > Issue Type: New Feature > Components: piggybank > Reporter: MIYAKAWA Taku > Assignee: MIYAKAWA Taku > Labels: piggybank > Attachments: LTSVLoader-6.html, LTSVLoader.html, PIG-3215-6.patch, > PIG-3215.patch > > > LTSV, or Labeled Tab-separated Values format is now getting popular in Japan > for log files, especially of web servers. The goal of this jira is to add > LTSVLoader in PiggyBank to load LTSV files. > LTSV is based on TSV thus columns are separated by tab characters. > Additionally each of columns includes a label and a value, separated by ":" > character. > Read about LTSV on http://ltsv.org/. > h4. Example LTSV file (access.log) > Columns are separated by tab characters. > {noformat} > host:host1.example.org req:GET /index.html ua:Opera/9.80 > host:host1.example.org req:GET /favicon.ico ua:Opera/9.80 > host:pc.example.com req:GET /news.html ua:Mozilla/5.0 > {noformat} > h4. Usage 1: Extract fields from each line > Users can specify an input schema and get columns as Pig fields. > This example loads the LTSV file shown in the previous section. > {code} > -- Parses the access log and count the number of lines > -- for each pair of the host column and the ua column. > access = LOAD 'access.log' USING > org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray'); > grouped_access = GROUP access BY (host, ua); > count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, > COUNT(access); > DUMP count_for_host_ua; > {code} > The below text will be printed out. > {noformat} > (host1.example.org,Opera/9.80,2) > (pc.example.com,Firefox/5.0,1) > {noformat} > h4. Usage 2: Extract a map from each line > Users can get a map for each LTSV line. The key of a map is a label of the > LTSV column. The value of a map comes from characters after ":" in the LTSV > column. > {code} > -- Parses the access log and projects the user agent field. > access = LOAD 'access.log' USING > org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]); > user_agent = FOREACH access GENERATE m#'ua' AS ua; > DUMP user_agent; > {code} > The below text will be printed out. > {noformat} > (Opera/9.80) > (Opera/9.80) > (Firefox/5.0) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira