[jira] Commented: (HADOOP-342) Design/Implement a tool to support archival and analysis of logfiles.

Doug Cutting (JIRA) Tue, 04 Jul 2006 01:34:17 -0700

    [ 
http://issues.apache.org/jira/browse/HADOOP-342?page=comments#action_12419061 ]


Doug Cutting commented on HADOOP-342:
-------------------------------------

I will look at your patch more closely soon.

I think it would be good, rather than copy the logs into DFS, to use HTTP to 
retrieve the map input.  Ideally, map tasks would be assigned to nodes where 
the log data is local.

This could be implemented as an InputFormat that is parameterized by date.  For 
example, one might specify something like:

job.setInputFormat(LogInputFormat.class);
job.set("log.input.start", "2006-07-13 12:00:00");
job.set("log.input.end", "2006-07-13 15:00:00");

The set of hosts can be determined automatically to be all hosts in the 
cluster.  One could also specify a job id, in which case the job's start and 
end time would be used, or a start job id and end job id.

We might implement parts of this by enhancing the web server run on each 
tasktracker, e.g., to directly support access to logs by date range.

Does this make sense?

> Design/Implement a tool to support archival and analysis of logfiles.
> ---------------------------------------------------------------------
>
>          Key: HADOOP-342
>          URL: http://issues.apache.org/jira/browse/HADOOP-342
>      Project: Hadoop
>         Type: New Feature

>     Reporter: Arun C Murthy
>  Attachments: logalyzer.patch
>
> Requirements:
>   a) Create a tool support archival of logfiles (from diverse sources) in 
> hadoop's dfs.
>   b) The tool should also support analysis of the logfiles via grep/sort 
> primitives. The tool should allow for fairly generic pattern 'grep's and let 
> users 'sort' the matching lines (from grep) on 'columns' of their choice.
>   E.g. from hadoop logs: Look for all log-lines with 'FATAL' and sort them 
> based on timestamps (column x)  and then on column y (column x, followed by 
> column y).
> Design/Implementation:
>   a) Log Archival
>     Archival of logs from diverse sources can be accomplished using the 
> *distcp* tool (HADOOP-341).
>   
>   b) Log analysis
>     The idea is to enable users of the tool to perform analysis of logs via 
> grep/sort primitives.
>     This can be accomplished via a relatively simple Map-Reduce task where 
> the map does the *grep* for the given pattern via RegexMapper and then the 
> implicit *sort* (reducer) is used with a custom Comparator which performs the 
> user-specified comparision (columns). 
>     The sort/grep specs can be fairly powerful by letting the user of the 
> tool use java's in-built regex patterns (java.util.regex).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (HADOOP-342) Design/Implement a tool to support archival and analysis of logfiles.

Reply via email to