[ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559474#action_12559474 ]
Enis Soztutar commented on HADOOP-2501: --------------------------------------- Initially I intend to develop the following tools for the sequence files : - info : give information about the file, including its header information(key class, value class, compressed, etc. ) - dump : dump the contents of the file to a text file, by calling toString() methods on the keys and values - head : print n lines from the dump - get : get the value of the given key. Essentially we will provide a method {code} Writable get(WritableComparable key) { ... } {code} which calls equals() method for the key. However to be useful as a command line utility we will add a command line option {noformat} bin/hadoop seq -get <key> {noformat} which does the comparison among string values. - filter <filter> : filter the input keeping only entries passing the filter. (fix and use https://issues.apache.org/jira/browse/HADOOP-449) - stats : give statistics about the file, such as num of records, average key length, average value length, longest key, shortest key, ... more ? - sort : sort the file - merge: merge multiple files In addition to these, we can discuss the following : - tail : We can implement this but will it be worth the effort ? - same set of tools for map files : sure to go, but I think we can leave this to another issue. - provide implementation for the above tasks both w/ and w/o mapreduce. Simply for some tasks mapred will be an overkill. Should we implement both versions? And finally any comments and suggestions are welcome. > Implement utility-tools for working with SequenceFiles > ------------------------------------------------------ > > Key: HADOOP-2501 > URL: https://issues.apache.org/jira/browse/HADOOP-2501 > Project: Hadoop > Issue Type: New Feature > Components: io > Reporter: Arun C Murthy > Assignee: Enis Soztutar > > It would be nice to implement a bunch of utilities to work with SequenceFiles: > * info (print-out header information such as key/value types, compression > type/codec etc.) > * cat > * head/tail > * merge multiple seq-files into one > * ... > I'd imagine this would look like: > {noformat} > $ bin/hadoop seq -info /user/joe/blah.seq > $ bin/hadoop seq -head -n 10 /user/joe/blah.seq > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.