Pat Ferrel created MAHOUT-1541:
----------------------------------
Summary: Create CLI Driver for Spark Cooccurrence Analysis
Key: MAHOUT-1541
URL: https://issues.apache.org/jira/browse/MAHOUT-1541
Project: Mahout
Issue Type: Bug
Components: CLI
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Create a CLI driver to import data in a flexible manner, create an
IndexedDataset with BiMap ID translation dictionaries, call the Spark
CooccurrenceAnalysis with the appropriate params, then write output with
external IDs optionally reattached.
Ultimately it should be able to read input as the legacy mr does but will
support reading externally defined IDs and flexible formats. Output will be of
the legacy format or text files of the user's specification with reattached
Item IDs.
Support for legacy formats is a question, users can always use the legacy code
if they want this. Internal to the IndexedDataset is a Spark DRM so pipelining
can be accomplished without any writing to an actual file so the legacy
sequence file output may not be needed.
Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)