[ 
https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559474#action_12559474
 ] 

Enis Soztutar commented on HADOOP-2501:
---------------------------------------

Initially I intend to develop the following tools for the sequence files : 

- info    : give information about the file, including its header 
information(key class, value class, compressed, etc. )
- dump : dump the contents of the file to a text file, by calling toString() 
methods on the keys and values
- head  :  print n lines from the dump
- get     : get the value of the given key. Essentially we will provide a 
method 
{code}
   Writable get(WritableComparable key) { ... }
{code}
which calls equals() method for the key. However to be useful as a command line 
utility we will add a command line option 
{noformat}
  bin/hadoop  seq -get <key>
{noformat}
which does the comparison among string values. 

- filter <filter> : filter the input keeping only entries passing the filter. 
(fix and use https://issues.apache.org/jira/browse/HADOOP-449)
- stats : give statistics about the file, such as num of records, average key 
length, average value length, longest key, shortest key, ... more ?  
- sort : sort the file
- merge: merge multiple files

In addition to these, we can discuss the following : 
- tail : We can implement this but will it be worth the effort ? 
- same set of tools for map files : sure to go, but I think we can leave this 
to another issue. 
- provide implementation for the above tasks both w/ and w/o mapreduce.  Simply 
for some tasks mapred will be an overkill. Should we implement both versions? 

And finally any comments and suggestions are welcome. 

> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>            Assignee: Enis Soztutar
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression 
> type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to