[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

ASF GitHub Bot (JIRA) Sun, 14 Sep 2014 20:51:45 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133552#comment-14133552
 ]


ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user dlyubimov commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-55550711
  
    @andrewpalumbo I have to disagree with this patch. drmFromHDFS actually 
does exactly what it is supposed to. Here is the test script (assuming you are 
running in local mode you should see mappers outputs directly to your console 
and see distinct keys there.: 
    
    
    // hdfs write -- uncomment to test
    // r.writeDRM("hdfs://localhost:11010/A")
    
    val drmA = drmParallelizeEmpty(30, 40)
    
    val drmB = drmA.mapBlock() { case (keys, block) =>
      keys.map { key => s"key-${key}"} -> block
    }
    
    
    // in local mode we can see printouts to console so we
    // can check if the keys are actually there as strings
    
    drmB.mapBlock() { case (keys,block) =>
      keys.map(println)
      keys -> block
    }.collect
    
    // save
    drmB.writeDRM("B-with-test-keys")
    
    // load back
    val drmC = drmFromHDFS("B-with-test-keys")
    
    // check the keys are still text there
    drmB.mapBlock() { case (keys,block) =>
      keys.map(println)
      keys -> block
    }.collect



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Reply via email to