Re: [jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Dmitriy Lyubimov Fri, 10 Oct 2014 08:35:16 -0700

i already mentioned that i don't want any whiff of hadoop stuff in
math-scala. For most part, because of impossibility to pinpoint exact
hadoop api version a third party wants to use. There will always be
applications claiming incompatbility with this or that in Hadoop with that
approach.


it might make sense to create another module for "all things Hadoop", and
make engine specific modules depend on that, but i am not sure if amount of
current code really justifies it yet. mrLegacy is kind of that, but it is
legacy. maybe it makes sense to move some things from legacy to that "all
things hadoop" module (e.g. sequence file iterators and such), although
that is not used right now anywhere beyond mrlegacy.


On Thu, Oct 9, 2014 at 4:34 PM, ASF GitHub Bot (JIRA) <[email protected]>
wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165989#comment-14165989
> ]
>
> ASF GitHub Bot commented on MAHOUT-1615:
> ----------------------------------------
>
> Github user andrewpalumbo commented on the pull request:
>
>     https://github.com/apache/mahout/pull/58#issuecomment-58594183
>
>     @dlyubimov , @pferrel - any objections to moving common/HDFSUtils into
> math-scala?
>
>
> > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for
> Text-Keyed SequenceFiles
> >
> -------------------------------------------------------------------------------------------------
> >
> >                 Key: MAHOUT-1615
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> >             Project: Mahout
> >          Issue Type: Bug
> >            Reporter: Andrew Palumbo
> >            Assignee: Andrew Palumbo
> >             Fix For: 1.0
> >
> >
> > When reading in seq2sparse output from HDFS in the spark-shell of form
> <Text,VectorWriteable>  SparkEngine's drmFromHDFS method is creating rdds
> with the same Key for all Pairs:
> > {code}
> > mahout> val drmTFIDF= drmFromHDFS( path =
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> > {code}
> > Has keys:
> > {...}
> >     key: /talk.religion.misc/84570
> >     key: /talk.religion.misc/84570
> >     key: /talk.religion.misc/84570
> > {...}
> > for the entire set.  This is the last Key in the set.
> > The problem can be traced to the first line of drmFromHDFS(...) in
> SparkEngine.scala:
> > {code}
> >  val rdd = sc.sequenceFile(path, classOf[Writable],
> classOf[VectorWritable], minPartitions = parMin)
> >         // Get rid of VectorWritable
> >         .map(t => (t._1, t._2.get()))
> > {code}
> > which gives the same key for all t._1.
> >
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Re: [jira] [Commented] (MAHOUT-1615) SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles

Reply via email to