[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Pat Ferrel (JIRA) Fri, 06 Jun 2014 12:00:26 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14020232#comment-14020232
 ]


Pat Ferrel commented on MAHOUT-1541:
------------------------------------

The existing itemsimilarity outputs indicators, one per row. This is more 
compact but seems like not the ideal way to get them. The indicators will 
likely be associated in rows to a specific item id. Therefore I've implemented 
the Spark version to output what amounts the a sparse matrix with external IDs 
intact. A row will (if no formatting params are specified) look like this:

user1<tab>user2:strength1,user3:strength2
user2<tab>user5:strength3,user6:strength4,...

To make this directly ingestible by Solr or ElasticSearch we probably need to 
drop the strength values via a CLI option and maybe support named output 
formats like CSV, JSON. CSV can be done with the current options if the 
strengths are dropped.

> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1541) Create CLI Driver for Spark Cooccurrence Analysis

Reply via email to