[ 
https://issues.apache.org/jira/browse/HADOOP-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran reassigned HADOOP-17943:
---------------------------------------

    Assignee: Mehakmeet Singh

> Add s3a tool to convert S3 server logs to avro/csv files
> --------------------------------------------------------
>
>                 Key: HADOOP-17943
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17943
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.2
>            Reporter: Steve Loughran
>            Assignee: Mehakmeet Singh
>            Priority: Major
>
> Add s3a tool to convert S3 server logs to avro/csv files
> With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, 
> including splitting up the referrer into its fields.
> But we don't have an easy way of using it. I've done some early work in spark 
> but as well as that code not working 
> ([https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala]),
>  it doesn't do the audit splitting.
>  And, given that the S3 audit logs can be small on a lightly loaded store, 
> not always justified.
> Proposed
> we add
>  # utility parser class to take a row and split it into a record
>  # which can be saved to avro through a schema we define
>  # or exported to CSV with/without headers. (with: easy to understand, 
> without: can cat files)
>  # add a mapper so this can be used in MR jobs (could even make it committer 
> test ..)
>  # and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli
> {code:java}
> hadoop s3 parselogs -format avro -out s3a://dest/path -recursive 
> s3a://stevel-london/logs/bucket1/*
> {code}
> would take all files under the path, load, parse and emit the output.
> design issues
>  * would you combine all files, or emit a new .avro or .csv file for each one?
>  * what's a good avro schema to cope with new context attributes
>  * CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) 
> writer ourselves.
>  me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in 
> the test suite.
>  * would you want an initial filter during processing? especially for exit 
> codes?
>  me: no, though I could see the benefit for 503s. Best to let you load it 
> into a notebook or spreadsheet and go from there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to