[
https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559474#action_12559474
]
Enis Soztutar commented on HADOOP-2501:
---------------------------------------
Initially I intend to develop the following tools for the sequence files :
- info : give information about the file, including its header
information(key class, value class, compressed, etc. )
- dump : dump the contents of the file to a text file, by calling toString()
methods on the keys and values
- head : print n lines from the dump
- get : get the value of the given key. Essentially we will provide a
method
{code}
Writable get(WritableComparable key) { ... }
{code}
which calls equals() method for the key. However to be useful as a command line
utility we will add a command line option
{noformat}
bin/hadoop seq -get <key>
{noformat}
which does the comparison among string values.
- filter <filter> : filter the input keeping only entries passing the filter.
(fix and use https://issues.apache.org/jira/browse/HADOOP-449)
- stats : give statistics about the file, such as num of records, average key
length, average value length, longest key, shortest key, ... more ?
- sort : sort the file
- merge: merge multiple files
In addition to these, we can discuss the following :
- tail : We can implement this but will it be worth the effort ?
- same set of tools for map files : sure to go, but I think we can leave this
to another issue.
- provide implementation for the above tasks both w/ and w/o mapreduce. Simply
for some tasks mapred will be an overkill. Should we implement both versions?
And finally any comments and suggestions are welcome.
> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
> Key: HADOOP-2501
> URL: https://issues.apache.org/jira/browse/HADOOP-2501
> Project: Hadoop
> Issue Type: New Feature
> Components: io
> Reporter: Arun C Murthy
> Assignee: Enis Soztutar
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
> * info (print-out header information such as key/value types, compression
> type/codec etc.)
> * cat
> * head/tail
> * merge multiple seq-files into one
> * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.