[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Pat Ferrel (JIRA) Sat, 17 May 2014 13:49:33 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000861#comment-14000861
 ]


Pat Ferrel commented on MAHOUT-1541:
------------------------------------

More progress: it reads and writes. A CLI option parser is taking about all 
params needed and some minimal validation and consistency checking is started.

The structure of IndexedDataset and the 'Store' types are ready for some early 
review. The idea is that Stores create IndexedDataset(s) by reading text files 
and write them out too. The design allows for non-text read/write but nothing 
is implemented for this yet. There is also an example driver that may be split 
into generic and job specific parts but I haven't tackled this yet.

The read/write have not been tested on a cluster yet.

Feedback is appreciated, especially where noted on the github description. 
https://github.com/pferrel/harness/wiki

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to