[jira] Updated: (MAHOUT-590) add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory

Isabel Drost (JIRA) Fri, 28 Jan 2011 03:00:12 -0800

     [ 
https://issues.apache.org/jira/browse/MAHOUT-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Isabel Drost updated MAHOUT-590:
--------------------------------

    Attachment: MAHOUT-590.patch

Updated version.

> add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-590
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-590
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>    Affects Versions: 0.4
>         Environment: Mac OS X 10.6.6, java version "1.6.0_22"
> RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>         Attachments: 0001-added-TSV-input-file-support.patch, 
> MAHOUT-590.patch, MAHOUT-590.patch
>
>
> I would like to add TSV (Tab Separated Value) input file type support to 
> SequenceFilesFromDirectory.
> Here is my real use case:
> I have 36M records of input, each of which consists of ID and CONTENT and 
> various other attributes, and I wanted to convert them to sequence files for 
> clustering records by term vectors of CONTENT. However the problem is since I 
> cannot create 36M files under my home directory due to quota limit that is up 
> to 50k files, I was not able to convert them to sequence files by 
> SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV 
> where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig 
> and most hadoop stream programs to process as input and output. NOTE: CONTENT 
> size is up to around 2k bytes. Hence I feel better TSV support by 
> SequenceFilesFromDirectory directly instead of taking two steps; TSV to text 
> files and text files to Sequence files.
> I'm attaching the patch.
> Hope this makes sense to other folks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-590) add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory

Reply via email to