[ 
https://issues.apache.org/jira/browse/MAHOUT-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043735#comment-14043735
 ] 

ASF GitHub Bot commented on MAHOUT-1541:
----------------------------------------

Github user pferrel commented on the pull request:

    https://github.com/apache/mahout/pull/22#issuecomment-47129279
  
    This is getting close to merge time. There are several test cases, which 
exercise most of the code in the PR. It is building with tests on the current 
master.
    
    Missing but needs to be done before merge:
    * a modified mahout script that runs this. 
    
    Significantly missing, but planned for another PR:
    * two, or more input streams for cross-cooccurrence. This version will do a 
cross-cooccurrence calc by filtering one input tuple stream into two matrices. 
This allows testing and may be a common use case but should not be the only 
option for this use.
    * uses HashBiMaps from Guava for _all_ ID management, even when the IDs are 
Mahout ordinals. Also all four ID indexes are created even though in this case 
the External Row/User IDs are never used. An optimization would calculate only 
the dictionaries needed.
    * HashBiMaps are created once and broadcast to the rest of the jobs. These 
are not based on rdds and so we may want to do something about these in the 
future. Haven't thought much about this so suggestions are welcome.


> Create CLI Driver for Spark Cooccurrence Analysis
> -------------------------------------------------
>
>                 Key: MAHOUT-1541
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1541
>             Project: Mahout
>          Issue Type: Bug
>          Components: CLI
>            Reporter: Pat Ferrel
>            Assignee: Pat Ferrel
>
> Create a CLI driver to import data in a flexible manner, create an 
> IndexedDataset with BiMap ID translation dictionaries, call the Spark 
> CooccurrenceAnalysis with the appropriate params, then write output with 
> external IDs optionally reattached.
> Ultimately it should be able to read input as the legacy mr does but will 
> support reading externally defined IDs and flexible formats. Output will be 
> of the legacy format or text files of the user's specification with 
> reattached Item IDs. 
> Support for legacy formats is a question, users can always use the legacy 
> code if they want this. Internal to the IndexedDataset is a Spark DRM so 
> pipelining can be accomplished without any writing to an actual file so the 
> legacy sequence file output may not be needed.
> Opinions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to