[ 
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157573#comment-14157573
 ] 

ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------

Github user andrewpalumbo commented on the pull request:

    https://github.com/apache/mahout/pull/52#issuecomment-57739029
  
    I've been a bit bogged down here this and haven't had much of a chance to 
look at this but this last commit adds some minor edits for Hadoop 1.2.1 
support.  It builds and passes tests on my machine.
    
    I'm not sure if we wanted to go this way, and I'd hoped to put a quick 
branch up with option (1) from above- the simplest but the option which changes 
the method signature and allows for more user error, but havent had a chance 
and am going to be out of town early next week.  I can come back to this then 
and investigate something like:
     
    ```scala
    def drmFromHDFS[K:ClassTag] (path: String, parMin:Int = 0)(implicit sc: 
DistributedContext): CheckpointedDrm[K] 
    ```
    but thinking of it, it will still have a similar isue with:
    ```scala
    private[common] def w2bytes(w: Writable) = 
w.asInstanceOf[BytesWritable].getBytes
    ```



> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for 
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1615
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Andrew Palumbo
>            Assignee: Andrew Palumbo
>             Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form 
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds 
> with the same Key for all Pairs:  
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path = 
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...} 
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
>     key: /talk.religion.misc/84570
> {...}
> for the entire set.  This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in 
> SparkEngine.scala: 
> {code}
>  val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], 
> minPartitions = parMin)
>         // Get rid of VectorWritable
>         .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to