[ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149771#comment-14149771 ]
ASF GitHub Bot commented on MAHOUT-1615: ---------------------------------------- Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/52#issuecomment-57002012 to be a bit more concrete, there's indeed slight discrepancy between write and read names, but semantically they are what they say they are, i.e. they are persisting drm to hdfs. To be even more concrete, i am probably for simply package-level `drmDfsRead()` and method-level `dfsWrite()` names. The convention here is that all drm-related package-level routines start with `drm` prefix so we don't easily mix these things with other things in global scope. Now, everything else, including reading/writing CSV formats, is an _export_ operation (as opposed to persistence). Consequently, proper names are perhaps along the lines `drmImportCSV` and `exportCSV` respectively. Import and export emphasizes the fact that format is not native, loses a lot of coherency enforcement, and requires a lot of validation while parsing back. > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1615 > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > Project: Mahout > Issue Type: Bug > Reporter: Andrew Palumbo > Fix For: 1.0 > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > {code} > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > {code} > Has keys: > {...} > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > {...} > for the entire set. This is the last Key in the set. > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > {code} > val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], > minPartitions = parMin) > // Get rid of VectorWritable > .map(t => (t._1, t._2.get())) > {code} > which gives the same key for all t._1. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)