[
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143378#comment-14143378
]
ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------
Github user andrewpalumbo commented on the pull request:
https://github.com/apache/mahout/pull/52#issuecomment-56401176
Still a work in progress, (and still in need of some cleanup). The latest
commits now solve the original key object reuse problem by method (2) - reading
key type in from the SequenceFile Headers and then matching on it:
mahout> val drmTFIDF= drmFromHDFS( path =
"/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] =
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236
mahout> val rowLabels=drmTFIDF.getRowLabelBindings
rowLabels: java.util.Map[String,Integer] =
{/soc.religion.christian/21427=6141, /comp.graphics/38427=422,
/comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495,
/soc.religion.christian/21332=6103, /sci.med/59045=5265,
/sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404,
/rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282,
/rec.autos/103326=2968, /talk.politics.misc/179110=7333,
/comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146,
/rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424,
/comp.graphics/38707=522, /comp.graphics/38597=484,
/sci.electronics/54317=5083, /rec.motorcycles/104708=3322,
/rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601,
/sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho...
mahout> rowLabels.size
res15: Int = 7598
Which is what I am expecting.
Two problems that I am still having:
(1) Its not yet solving the problem of setting the DrmLike[_] ClassTag yet.
mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) =
implicitly[ClassTag[K]]
mahout> getKeyClassTag(drmTFIDF)
res13: scala.reflect.ClassTag[_] = Object
I believe that this is just because I'm not setting it correctly due to my
limited scala abilities.
(2) DRM DFS i/o (local) is failing. I believe that this may downside to
integrating HDFS I/O code into the spark module. I'm not positive I'm setting
the configuration correctly inside of drmFromHDFS(...). I have no problem
reading in the files from within the spark-shell, but the spark `DRM DFS i/o
(local)` test is failing with:
DRM DFS i/o (local) *** FAILED ***
java.io.FileNotFoundException:
/home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory)
I believe may be because SequenceFile.readHeader(...) is trying to read
from HDFS and the test is writing locally. I will continue to look into this.
> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
> Issue Type: Bug
> Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form
> <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds
> with the same Key for all Pairs:
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path =
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...}
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set. This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in
> SparkEngine.scala:
> {code}
> val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable],
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)