[
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052663#comment-14052663
]
Pat Ferrel commented on MAHOUT-1541:
------------------------------------
There is still work to do on this so I'll leave the Jira open for a bit.
Two issues:
* Do we care about output as tuple, like the hadoop version? The Spark version
outputs a DRM with app specific item and user IDs.
* We need an option to sort the indicators by strength and omit strength
values from the output. This will allow the output to be indexed directly by a
search engine. One step from logfiles to indexable indicators, can't wait.
Something on the horizon. The algo for epinions data seems to need 5g of Spark
executor memory. This seems like a lot and may have to do with the use of
HashBiMaps of IDs that are broadcast to every node. This can be optimized first
by not calculating ID dictionaries for users, since they are not in the output.
When using legacy Mahout int IDs we don't need any Dictionaries. Also looking
at the lifetime of the broadcast vals. The A'A dictionaries may be kept alive
while doing the B'A calc--still investigating this.
Quite a bit more difficult will be a truly scalable treatment of the
dictionaries.
> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
> Key: MAHOUT-1541
> URL: https://issues.apache.org/jira/browse/MAHOUT-1541
> Project: Mahout
> Issue Type: New Feature
> Components: CLI
> Reporter: Pat Ferrel
> Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an
> IndexedDataset with BiMap ID translation dictionaries, call the Spark
> CooccurrenceAnalysis with the appropriate params, then write output with
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will
> support reading externally defined IDs and flexible formats. Output will be
> of the legacy format or text files of the user's specification with
> reattached Item IDs.
> Support for legacy formats is a question, users can always use the legacy
> code if they want this. Internal to the IndexedDataset is a Spark DRM so
> pipelining can be accomplished without any writing to an actual file so the
> legacy sequence file output may not be needed.
> Opinions?
--
This message was sent by Atlassian JIRA
(v6.2#6252)