Pat Ferrel created MAHOUT-1541:
----------------------------------

             Summary: Create CLI Driver for Spark Cooccurrence Analysis
                 Key: MAHOUT-1541
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
             Project: Mahout
          Issue Type: Bug
          Components: CLI
            Reporter: Pat Ferrel
            Assignee: Pat Ferrel


Create a CLI driver to import data in a flexible manner, create an 
IndexedDataset with BiMap ID translation dictionaries, call the Spark 
CooccurrenceAnalysis with the appropriate params, then write output with 
external IDs optionally reattached.

Ultimately it should be able to read input as the legacy mr does but will 
support reading externally defined IDs and flexible formats. Output will be of 
the legacy format or text files of the user's specification with reattached 
Item IDs. 

Support for legacy formats is a question, users can always use the legacy code 
if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining 
can be accomplished without any writing to an actual file so the legacy 
sequence file output may not be needed.

Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to