[ https://issues.apache.org/jira/browse/MAHOUT-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143432#comment-14143432 ]
ASF GitHub Bot commented on MAHOUT-1615: ---------------------------------------- Github user dlyubimov commented on the pull request: https://github.com/apache/mahout/pull/52#issuecomment-56405511 to make scala things a little bit simpler, declaring context bound is like creating an implicit value inside a class or implicit parameter call to a method. Examples of what it means : def method[K:ContextType]( ...params) is equivalent to declaring def method(...params)(implicit evidence$1:ContextType) similarly, class Clazz[K:ContextType] (...params...) is equivalent to writing class Clazz(...params...)(implicit val evidence$1:ContextType) .... so when you do what you did it is equivalent to returning implicitly passed parameter to the mehod, not implicit value extraction from implementation class you've passed. On Mon, Sep 22, 2014 at 9:58 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > on (1), it doesn't work because it takes classTag from the method bound, > not from actual evidence in the class. > > in order for this to work, i suggest to add > > def keyClassTag:ClassTag[K] > > to CheckpointedDrm trait and implement it in concrete checkpoined > implementations as simply `implicitly[ClassTag[K]]`. Unfortunately you > cannot implement it in a trait (like inside DrmLike or CheckpointedDRM) > because as it stands, traits do not support access to concrete class > evidence (as our workaround demonstrates, it is theoretically possible to > support it thru virtual query to implementation, but as it stands, scala is > not really there). > > on (2), you need to ask to load not the directory, by any partition file > inside that directory. Obviously you need to require that source directory > contains at least on partion file with a header. > > Also keep in mind that SequenceFile api changed A LOT between hadoop 2 and > 1 and spark works with both, but naive (non-reflection) implementation can > only work with whatever currently declared as Mahout dependency. This is > why i am saying implementing it with full cross-version hadoop > compatibility the way Spark does is extremely hairy. > > > > > > On Mon, Sep 22, 2014 at 9:36 AM, Andrew Palumbo <notificati...@github.com> > wrote: > >> Still a work in progress, (and still in need of some cleanup). The latest >> commits now solve the original key object reuse problem by method (2) - >> reading key type in from the SequenceFile Headers and then matching on it: >> >> mahout> val drmTFIDF= drmFromHDFS( path = "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") >> 14/09/22 11:20:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable >> drmTFIDF: org.apache.mahout.math.drm.CheckpointedDrm[_] = org.apache.mahout.sparkbindings.drm.CheckpointedDrmSpark@adf7236 >> mahout> val rowLabels=drmTFIDF.getRowLabelBindings >> rowLabels: java.util.Map[String,Integer] = {/soc.religion.christian/21427=6141, /comp.graphics/38427=422, /comp.sys.ibm.pc.hardware/60526=1281, /misc.forsale/76295=2495, /soc.religion.christian/21332=6103, /sci.med/59045=5265, /sci.electronics/54343=5096, /comp.sys.ibm.pc.hardware/60928=1404, /rec.sport.hockey/54173=4205, /rec.motorcycles/104596=3282, /rec.autos/103326=2968, /talk.politics.misc/179110=7333, /comp.windows.x/66966=1944, /rec.autos/103707=3053, /comp.windows.x/67474=2146, /rec.sport.baseball/105011=3850, /talk.religion.misc/83812=7424, /comp.graphics/38707=522, /comp.graphics/38597=484, /sci.electronics/54317=5083, /rec.motorcycles/104708=3322, /rec.sport.hockey/53627=3994, /comp.sys.mac.hardware/51633=1601, /sci.crypt/16088=4686, /sci.electronics/53714=4840, /rec.sport.ho... >> mahout> rowLabels.size >> res15: Int = 7598 >> >> Which is what I am expecting. >> >> Two problems that I am still having: >> >> (1) Its not yet solving the problem of setting the DrmLike[_] ClassTag >> yet. >> >> mahout> def getKeyClassTag[K: ClassTag](drm:DrmLike[K]) = implicitly[ClassTag[K]] >> >> mahout> getKeyClassTag(drmTFIDF) >> res13: scala.reflect.ClassTag[_] = Object >> >> I believe that this is just because I'm not setting it correctly due to >> my limited scala abilities. >> >> (2) DRM DFS i/o (local) is failing. I believe that this may downside to >> integrating HDFS I/O code into the spark module. I'm not positive I'm >> setting the configuration correctly inside of drmFromHDFS(...). I have no >> problem reading in the files from within the spark-shell, but the spark DRM >> DFS i/o (local) test is failing with: >> >> DRM DFS i/o (local) *** FAILED *** >> java.io.FileNotFoundException: /home/andy/sandbox/mahout/spark/tmp/UploadedDRM (Is a directory) >> >> I believe may be because SequenceFile.readHeader(...) is trying to read >> from HDFS and the test is writing locally. I will continue to look into >> this. >> >> — >> Reply to this email directly or view it on GitHub >> <https://github.com/apache/mahout/pull/52#issuecomment-56401176>. >> > > > SparkEngine drmFromHDFS returning the same Key for all Key,Vec Pairs for > Text-Keyed SequenceFiles > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1615 > URL: https://issues.apache.org/jira/browse/MAHOUT-1615 > Project: Mahout > Issue Type: Bug > Reporter: Andrew Palumbo > Fix For: 1.0 > > > When reading in seq2sparse output from HDFS in the spark-shell of form > <Text,VectorWriteable> SparkEngine's drmFromHDFS method is creating rdds > with the same Key for all Pairs: > {code} > mahout> val drmTFIDF= drmFromHDFS( path = > "/tmp/mahout-work-andy/20news-test-vectors/part-r-00000") > {code} > Has keys: > {...} > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > key: /talk.religion.misc/84570 > {...} > for the entire set. This is the last Key in the set. > The problem can be traced to the first line of drmFromHDFS(...) in > SparkEngine.scala: > {code} > val rdd = sc.sequenceFile(path, classOf[Writable], classOf[VectorWritable], > minPartitions = parMin) > // Get rid of VectorWritable > .map(t => (t._1, t._2.get())) > {code} > which gives the same key for all t._1. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)