i already mentioned that i don't want any whiff of hadoop stuff in math-scala. For most part, because of impossibility to pinpoint exact hadoop api version a third party wants to use. There will always be applications claiming incompatbility with this or that in Hadoop with that approach.
it might make sense to create another module for "all things Hadoop", and make engine specific modules depend on that, but i am not sure if amount of current code really justifies it yet. mrLegacy is kind of that, but it is legacy. maybe it makes sense to move some things from legacy to that "all things hadoop" module (e.g. sequence file iterators and such), although that is not used right now anywhere beyond mrlegacy. On Thu, Oct 9, 2014 at 4:34 PM, ASF GitHub Bot (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165989#comment-14165989 > ] > > ASF GitHub Bot commented on MAHOUT-1615: > ---------------------------------------- > > Github user andrewpalumbo commented on the pull request: > > https://github.com/apache/mahout/pull/58#issuecomment-58594183 > > @dlyubimov , @pferrel - any objections to moving common/HDFSUtils into > math-scala? > > > > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > > > ------------------------------------------------------------------------------------------------- > > > > Key: MAHOUT-1615 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > > Project: Mahout > > Issue Type: Bug > > Reporter: Andrew Palumbo > > Assignee: Andrew Palumbo > > Fix For: 1.0 > > > > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > > {code} > > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > > {code} > > Has keys: > > {...} > > key: /talk.religion.misc/84570 > > key: /talk.religion.misc/84570 > > key: /talk.religion.misc/84570 > > {...} > > for the entire set. This is the last Key in the set. > > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > > {code} > > val rdd = sc.sequenceFile(path, classOf[Writable], > classOf[VectorWritable], minPartitions = parMin) > > // Get rid of VectorWritable > > .map(t => (t._1, t._2.get())) > > {code} > > which gives the same key for all t._1. > > > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
