[
https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143419#comment-14143419
]
ASF GitHub Bot commented on MAHOUT-1615:
----------------------------------------
Github user dlyubimov commented on the pull request:
https://github.com/apache/mahout/pull/52#issuecomment-56404290
on (1), it doesn't work because it takes classTag from the method bound,
not from actual evidence in the class.
in order for this to work, i suggest to add
def keyClassTag:ClassTag[K]
to CheckpointedDrm trait and implement it in concrete checkpoined
implementations as simply `implicitly[ClassTag[K]]`. Unfortunately you
cannot implement it in a trait (like inside DrmLike or CheckpointedDRM)
because as it stands, traits do not support access to concrete class
evidence (as our workaround demonstrates, it is theoretically possible to
support it thru virtual query to implementation, but as it stands, scala is
not really there).
on (2), you need to ask to load not the directory, by any partition file
inside that directory. Obviously you need to require that source directory
contains at least on partion file with a header.
Also keep in mind that SequenceFile api changed A LOT between hadoop 2 and
1 and spark works with both, but naive (non-reflection) implementation can
only work with whatever currently declared as Mahout dependency. This is
why i am saying implementing it with full cross-version hadoop
compatibility the way Spark does is extremely hairy.
On Mon, Sep 22, 2014 at 9:36 AM, Andrew Palumbo <[email protected]>
wrote:
> Still a work in progress, (and still in need of some cleanup). The latest
> commits now solve the original key object reuse problem by method (2) -
> reading key type in from the SequenceFile Headers and then matching on it:
>
> mahout> val drmTFIDF= drmFromHDFS( path =
"/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> 14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
> drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] =
org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236
> mahout> val rowLabels=drmTFIDF.getRowLabelBindings
> rowLabels: java.util.Map[String,Integer] =
{/soc.religion.christian/21427=6141, /comp.graphics/38427=422,
/comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495,
/soc.religion.christian/21332=6103, /sci.med/59045=5265,
/sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404,
/rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282,
/rec.autos/103326=2968, /talk.politics.misc/179110=7333,
/comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146,
/rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424,
/comp.graphics/38707=522, /comp.graphics/38597=484,
/sci.electronics/54317=5083, /rec.motorcycles/104708=3322,
/rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601,
/sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho...
> mahout> rowLabels.size
> res15: Int = 7598
>
> Which is what I am expecting.
>
> Two problems that I am still having:
>
> (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag
yet.
>
> mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) =
implicitly[ClassTag[K]]
>
> mahout> getKeyClassTag(drmTFIDF)
> res13: scala.reflect.ClassTag[_] = Object
>
> I believe that this is just because I'm not setting it correctly due to my
> limited scala abilities.
>
> (2) DRM DFS i/o (local) is failing. I believe that this may downside to
> integrating HDFS I/O code into the spark module. I'm not positive I'm
> setting the configuration correctly inside of drmFromHDFS(...). I have no
> problem reading in the files from within the spark-shell, but the spark
DRM
> DFS i/o (local) test is failing with:
>
> DRM DFS i/o (local) *** FAILED ***
> java.io.FileNotFoundException:
/home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory)
>
> I believe may be because SequenceFile.readHeader(...) is trying to read
> from HDFS and the test is writing locally. I will continue to look into
> this.
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/mahout/pull/52#issuecomment-56401176>.
>
> SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for
> Text-Keyed SequenceFiles
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1615
> URL: https://issues.apache.org/jira/browse/MAHOUT-1615
> Project: Mahout
> Issue Type: Bug
> Reporter: Andrew Palumbo
> Fix For: 1.0
>
>
> When reading in seq2sparse output from HDFS in the spark-shell of form
> <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds
> with the same Key for all Pairs:
> {code}
> mahout> val drmTFIDF= drmFromHDFS( path =
> "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000")
> {code}
> Has keys:
> {...}
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> key: /talk.religion.misc/84570
> {...}
> for the entire set. This is the last Key in the set.
> The problem can be traced to the first line of drmFromHDFS(...) in
> SparkEngine.scala:
> {code}
> val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable],
> minPartitions = parMin)
> // Get rid of VectorWritable
> .map(t => (t._1, t._2.get()))
> {code}
> which gives the same key for all t._1.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)