[ https://issues.apache.org/jira/browse/HADOOP-17943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran reassigned HADOOP-17943: --------------------------------------- Assignee: Mehakmeet Singh > Add s3a tool to convert S3 server logs to avro/csv files > -------------------------------------------------------- > > Key: HADOOP-17943 > URL: https://issues.apache.org/jira/browse/HADOOP-17943 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.3.2 > Reporter: Steve Loughran > Assignee: Mehakmeet Singh > Priority: Major > > Add s3a tool to convert S3 server logs to avro/csv files > With S3A Auditing, we have code in hadoop-aws to parse s3 log entries, > including splitting up the referrer into its fields. > But we don't have an easy way of using it. I've done some early work in spark > but as well as that code not working > ([https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/s3/S3LogRecordParser.scala]), > it doesn't do the audit splitting. > And, given that the S3 audit logs can be small on a lightly loaded store, > not always justified. > Proposed > we add > # utility parser class to take a row and split it into a record > # which can be saved to avro through a schema we define > # or exported to CSV with/without headers. (with: easy to understand, > without: can cat files) > # add a mapper so this can be used in MR jobs (could even make it committer > test ..) > # and a "hadoop s3guard/hadoop s3" entry point so you can do it on the cli > {code:java} > hadoop s3 parselogs -format avro -out s3a://dest/path -recursive > s3a://stevel-london/logs/bucket1/* > {code} > would take all files under the path, load, parse and emit the output. > design issues > * would you combine all files, or emit a new .avro or .csv file for each one? > * what's a good avro schema to cope with new context attributes > * CSV nuances: tabs vs spaces, use opencsv or implement the (escaping?) > writer ourselves. > me: TSV and do a minimal escaping and quoting emitter. Can use opencsv in > the test suite. > * would you want an initial filter during processing? especially for exit > codes? > me: no, though I could see the benefit for 503s. Best to let you load it > into a notebook or spreadsheet and go from there. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org