[ https://issues.apache.org/jira/browse/MAHOUT-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Isabel Drost updated MAHOUT-590: -------------------------------- Attachment: MAHOUT-590.patch Updated version. > add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory > ----------------------------------------------------------------------------- > > Key: MAHOUT-590 > URL: https://issues.apache.org/jira/browse/MAHOUT-590 > Project: Mahout > Issue Type: Improvement > Components: Utils > Affects Versions: 0.4 > Environment: Mac OS X 10.6.6, java version "1.6.0_22" > RHL Linux 2.6.18 > Reporter: Shige Takeda > Assignee: Sean Owen > Priority: Minor > Attachments: 0001-added-TSV-input-file-support.patch, > MAHOUT-590.patch, MAHOUT-590.patch > > > I would like to add TSV (Tab Separated Value) input file type support to > SequenceFilesFromDirectory. > Here is my real use case: > I have 36M records of input, each of which consists of ID and CONTENT and > various other attributes, and I wanted to convert them to sequence files for > clustering records by term vectors of CONTENT. However the problem is since I > cannot create 36M files under my home directory due to quota limit that is up > to 50k files, I was not able to convert them to sequence files by > SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV > where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig > and most hadoop stream programs to process as input and output. NOTE: CONTENT > size is up to around 2k bytes. Hence I feel better TSV support by > SequenceFilesFromDirectory directly instead of taking two steps; TSV to text > files and text files to Sequence files. > I'm attaching the patch. > Hope this makes sense to other folks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.