[ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143927#comment-14143927 ]
ASF GitHub Bot commented on MAHOUT-1615: ---------------------------------------- GitHub user andrewpalumbo reopened a pull request: https://github.com/apache/mahout/pull/52 MAHOUT-1615: drmFromHDFS returning the same Key for all Key,Vec Pairs for Text-Keyed SequenceFiles SparkContext.sequenceFile(...) will yield the same key per partition for Text-Keyed Sequence files if the key a new copy of the key is not created when mapping to an RDD. This patch checks for Text Keys and creates a copy of each Key if necessary. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewpalumbo/mahout MAHOUT-1615 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mahout/pull/52.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #52 ---- commit 6adb01ee53ce591962b97a4ed474c111635f7c47 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-14T20:50:50Z Create copy of Key for Text Keys commit a81c34000e0152a7ad7349afdba8a9e854380653 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-14T21:11:45Z use y: Writable instead of y: VectorWritable commit 2d431a40bd7a62a117487451b3e255c2e56c7d1a Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-14T22:55:34Z scala not java commit 5541d500ef8cfb53c9e18da1c760ea8c39dd5409 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-17T19:57:35Z Read SequenceFile Header to get Key/Value Classes commit 79468e7a83e7af1f9ab10689b46a900bd463aa38 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-17T19:59:34Z Use our new method to get the ClassTag commit ffb40161173df145041cf60d9b2878af3cab911b Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-18T23:07:33Z Very Clunky use of getKeyClass. This commit solves inital problems. Needs to be gutted. Hadoop configuration needs to be set up correctly. Spark I/O DFS tests fail because reader doesnt know weather to read locally or from HDFS commit 56b5db1b2ee5cdef3700fb9199c75637d5e3b570 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-19T19:10:00Z Use case matching and val2keyFunc to map rdd commit 0edeffe30c64e78297335f06516d8cb3ff6b36a3 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-21T21:58:29Z Cleanup/temporarily hardcode spark kryo buffer property commit 0d64b0be427998e74cc6d57bf653573b235e0a31 Author: Dmitriy Lyubimov <dlyubi...@apache.org> Date: 2014-09-22T17:13:15Z Adding key class tag extraction from checkpointed class; adding assert to `DRM DFS i/o (local)` test to fail if key class tag is incorrectly loaded. temporary commended h20 module since it doesn't compile commit c63468e10b9ae7e7b0b50728b0d2b71883f894b8 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-22T17:40:20Z Merge branch 'MAHOUT-1615-drmFromHdfs' of http://github.com/dlyubimov/mahout into MAHOUT-1615 commit 36b40615619c12dcd7820e7628b3bbf86329bd76 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-22T21:42:29Z Read a sequence FILE not a directory- like the error says commit da014759404db1398f8e334e2801b5159e816d56 Author: Andrew Palumbo <ap....@outlook.com> Date: 2014-09-22T21:50:46Z cleanup ---- > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1615 > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > Project: Mahout > Issue Type: Bug > Reporter: Andrew Palumbo > Fix For: 1.0 > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > {code} > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > {code} > Has keys: > {...} > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > {...} > for the entire set. This is the last Key in the set. > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > {code} > val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], > minPartitions = parMin) > // Get rid of VectorWritable > .map(t => (t._1, t._2.get())) > {code} > which gives the same key for all t._1. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)