Re: TF-IDF, seq2sparse and DataFrame support
We should get a JIRA going for this and try to get this in for 0.10.1. On 03/24/2015 04:32 PM, Gokhan Capan wrote: Andrew, Maybe making class tag evident in mapBlock calls?, i.e: val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]]) Best, Gokhan On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo wrote: This (last commit on this branch) should be the beginning of a workaround for the problem of reading and returning a Generic-Writable keyed Drm: https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30 aae3f37e14 However the keyClassTag of the DrmLike returned by the mapBlock() calls and finally by the method itself is somehow converted to object. I'm not exactly sure why this is happening. I think that the implicit evidence is being dropped in the mapBlock call on a [Object]casted CheckPointedDrm. Maybe by calling it out of the scope of this method (breaking down the method would fix it.) valtfMatrix = drmMetadata.keyClassTagmatch{ casect ifct == ClassTag.Int=> { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Int]] } casectifct ==ClassTag(classOf[String]) => { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[String]] } casectifct == ClassTag.Long=> { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Long]] } case_ => { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ CheckpointedDrmSpark[Int]] } } tfMatrix.checkpoint() // make sure that the classtag of the tf matrix matches the metadata keyClasstag assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Passes here with eg. String keys val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... } assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Fails here for all with tfIdfMatrix.keyClassTag as an Object. I'll keep looking at it a bit. If anybody has any ideas please let me know. On 03/09/2015 02:12 PM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1]https://github.com/a
Re: TF-IDF, seq2sparse and DataFrame support
Andrew, Maybe making class tag evident in mapBlock calls?, i.e: val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... }(drmMetadata.keyClassTag.asInstanceOf[ClassTag[Any]]) Best, Gokhan On Tue, Mar 17, 2015 at 6:06 PM, Andrew Palumbo wrote: > > This (last commit on this branch) should be the beginning of a workaround > for the problem of reading and returning a Generic-Writable keyed Drm: > > https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30 > aae3f37e14 > > However the keyClassTag of the DrmLike returned by the mapBlock() calls > and finally by the method itself is somehow converted to object. I'm not > exactly sure why this is happening. I think that the implicit evidence is > being dropped in the mapBlock call on a [Object]casted CheckPointedDrm. > Maybe by calling it out of the scope of this method (breaking down the > method would fix it.) > valtfMatrix = drmMetadata.keyClassTagmatch{ > > casect ifct == ClassTag.Int=> { > (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) > (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ > CheckpointedDrmSpark[Int]] > } > casectifct ==ClassTag(classOf[String]) => { > (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) > (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ > CheckpointedDrmSpark[String]] > } > casectifct == ClassTag.Long=> { > (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) > (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ > CheckpointedDrmSpark[Long]] > } > case_ => { > (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) > (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[ > CheckpointedDrmSpark[Int]] > } > } > > tfMatrix.checkpoint() > > // make sure that the classtag of the tf matrix matches the metadata > keyClasstag > assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Passes here > with eg. String keys > > val tfIdfMatrix = tfMatrix.mapBlock(..){ > ...idf transformation, etc... > } > > assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Fails here > for all with tfIdfMatrix.keyClassTag > as an > Object. > > > I'll keep looking at it a bit. If anybody has any ideas please let me > know. > > > > > > > > On 03/09/2015 02:12 PM, Gokhan Capan wrote: > >> So, here is a sketch of a Spark implementation of seq2sparse, returning a >> (matrix:DrmLike, dictionary:Map): >> >> https://github.com/gcapan/mahout/tree/seq2sparse >> >> Although it should be possible, I couldn't manage to make it process >> non-integer document ids. Any fix would be appreciated. There is a simple >> test attached, but I think there is more to do in terms of handling all >> parameters of the original seq2sparse implementation. >> >> I put it directly to the SparkEngine ---not that I think of this object is >> the most appropriate placeholder, it just seemed convenient to me. >> >> Best >> >> >> Gokhan >> >> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: >> >> IndexedDataset might suffice until real DataFrames come along. >>> >>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: >>> >>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a >>> byproduct of it IIRC. matrix definitely not a structure to hold those. >>> >>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo >>> wrote: >>> >>> On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean > something like a DRM of tokens, that is a DataFrame with row=doc column > = >>> token. A one row DataFrame is a slightly heavy weight string/document. A > DataFrame with token counts would be perfect for input TF-IDF, no? It > would >>> be a vector that maintains the tokens as ids for the counts, right? > > Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For >>> this >>> I believe we would need something like a Distributed vector of Strings >>> that >>> could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an > all-documents DataFrame basically. Colocation? > > as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on
Re: TF-IDF, seq2sparse and DataFrame support
This (last commit on this branch) should be the beginning of a workaround for the problem of reading and returning a Generic-Writable keyed Drm: https://github.com/gcapan/mahout/commit/cd737cf1b7672d3d73fe206c7bad30aae3f37e14 However the keyClassTag of the DrmLike returned by the mapBlock() calls and finally by the method itself is somehow converted to object. I'm not exactly sure why this is happening. I think that the implicit evidence is being dropped in the mapBlock call on a [Object]casted CheckPointedDrm. Maybe by calling it out of the scope of this method (breaking down the method would fix it.) valtfMatrix = drmMetadata.keyClassTagmatch{ casect ifct == ClassTag.Int=> { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]] } casectifct ==ClassTag(classOf[String]) => { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[String]] } casectifct == ClassTag.Long=> { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Long]] } case_ => { (drmWrap(rdd = tfVectors, ncol = numCols, cacheHint = CacheHint.NONE) (keyClassTag.asInstanceOf[ClassTag[Any]])).asInstanceOf[CheckpointedDrmSpark[Int]] } } tfMatrix.checkpoint() // make sure that the classtag of the tf matrix matches the metadata keyClasstag assert(tfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Passes here with eg. String keys val tfIdfMatrix = tfMatrix.mapBlock(..){ ...idf transformation, etc... } assert(tfIdfMatrix.keyClassTag == drmMetadata.keyClassTag) <-- Fails here for all with tfIdfMatrix.keyClassTag as an Object. I'll keep looking at it a bit. If anybody has any ideas please let me know. On 03/09/2015 02:12 PM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles .java [2]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java [3]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. java On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: Just copied ov
Re: TF-IDF, seq2sparse and DataFrame support
I think everyone agrees that getting this into a PR would be great. We need a modernized text pipeline and this is an excellent starting point. We can discuss there. On Mar 10, 2015, at 3:53 AM, Gokhan Capan wrote: Some answers: - Non-integer document ids: The implementation does not use operations defined for DrmLike[Int]-only, so the row keys do not have to be Int's. I just couldn't manage to create the returning DrmLike with the correct key type. Although while wrapping into a DrmLike, I tried to pass the key-class using HDFS utils like they are being used in drmDfsRead, but I somehow wasn't successful. So non-int document ids is not an actual issue here. - Breaking the implementation out to smaller pieces: Let's just collect the requirements and adjust the implementation accordingly. I honestly didn't think very much about where the implementation fits in, architecturally, and what pieces are of public interest. Best Gokhan On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi wrote: > AP, How is ur impl different from Gokhan's? > > On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo wrote: > >> BTW, i'm not sure o.a.m.nlp is the best package name for either, I was >> using because o.a.m.vectorizer, which is probably a better name, had >> conflicts in mrlegacy. >> >> >> On 03/09/2015 09:29 PM, Andrew Palumbo wrote: >> >>> >>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's >>> seq2sparse implementation to live. >>> >>> On 03/09/2015 09:07 PM, Pat Ferrel wrote: >>> Does o.a.m.nlp in the spark module seem like a good place for this to > live? > I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon > as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are > intended for new document vectorization once the legacy deps are removed from > the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? >> > 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. > >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: > Ah, you are doing all the lucene analyzer, ngrams and other > tokenizing, > nice. > > On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: > > Ah I found the right button in Github no PR necessary. > > On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: > > If you create a PR it’s easier to see what was changed. > > Wouldn’t it be better to read in files from a directory assigning > doc-id = filename and term-ids = terms or are their still Hadoop > pipeline > tools that are needed to create the sequence files? This sort of > mimics the > way Spark reads SchemaRDDs from Json files. > > BTW this can also be done with a new reader trait on the > IndexedDataset. It will give you two bidirectional maps (BiMap) and a > DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other > does > the same for columns (text tokens). This would be a few lines of code > since > the string mapping and DRM creation is already written, The only > thing to > do would be map the doc/row ids to filenames. This allows you to take > the > non-int doc ids out of the DRM and replace them with a map. Not based > on a > Spark dataframe yet probably will be. > > On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: > > So, here is a sketch of a Spark implementation of seq2sparse, > returning > a > (matrix:DrmLike, dictionary:Map): > > https://github.com/gcapan/mahout/tree/seq2sparse > > Although it should be possible, I couldn't manage to make it process > non-integer document ids. Any fix would be appreciated. There is a > simple > test attached, but I think there is more to do in terms of handling > all > parameters of the original seq2sparse implementation. > > I put it directly to the SparkEngine ---not that I think of this > object > is > the most appropriate placeholder, it just seemed convenient to me. > > Best > > > Gokhan >
Re: TF-IDF, seq2sparse and DataFrame support
Some answers: - Non-integer document ids: The implementation does not use operations defined for DrmLike[Int]-only, so the row keys do not have to be Int's. I just couldn't manage to create the returning DrmLike with the correct key type. Although while wrapping into a DrmLike, I tried to pass the key-class using HDFS utils like they are being used in drmDfsRead, but I somehow wasn't successful. So non-int document ids is not an actual issue here. - Breaking the implementation out to smaller pieces: Let's just collect the requirements and adjust the implementation accordingly. I honestly didn't think very much about where the implementation fits in, architecturally, and what pieces are of public interest. Best Gokhan On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi wrote: > AP, How is ur impl different from Gokhan's? > > On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo wrote: > > > BTW, i'm not sure o.a.m.nlp is the best package name for either, I was > > using because o.a.m.vectorizer, which is probably a better name, had > > conflicts in mrlegacy. > > > > > > On 03/09/2015 09:29 PM, Andrew Palumbo wrote: > > > >> > >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's > >> seq2sparse implementation to live. > >> > >> On 03/09/2015 09:07 PM, Pat Ferrel wrote: > >> > >>> Does o.a.m.nlp in the spark module seem like a good place for this to > live? > > >>> I think you meant math-scala? > >>> > >>> Actually we should rename math to core > >>> > >>> > >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: > >>> > >>> Cool- This is great! I think this is really important to have in. > >>> > >>> +1 to a pull request for comments. > >>> > >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has > >>> very simple TF and TFIDF classes based on lucene's IDF calculation and > >>> MLlib's I just got a bad flu and haven't had a chance to push it. It > >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon > as i > >>> can in case you want to use them. > >>> > >>> Does o.a.m.nlp in the spark module seem like a good place for this to > >>> live? > >>> > >>> Those classes may be of use to you- they're very simple and are > intended > >>> for new document vectorization once the legacy deps are removed from > the > >>> spark module. They also might make interoperability with easier. > >>> > >>> One thought having not been able to look at this too closely yet. > >>> > >>> //do we need do calculate df-vector? > > > 1. We do need a document frequency map or vector to be able to > >>> calculate the IDF terms when vectorizing a new document outside of the > >>> original corpus. > >>> > >>> > >>> > >>> > >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: > >>> > Ah, you are doing all the lucene analyzer, ngrams and other > tokenizing, > nice. > > On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: > > Ah I found the right button in Github no PR necessary. > > On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: > > If you create a PR it’s easier to see what was changed. > > Wouldn’t it be better to read in files from a directory assigning > doc-id = filename and term-ids = terms or are their still Hadoop > pipeline > tools that are needed to create the sequence files? This sort of > mimics the > way Spark reads SchemaRDDs from Json files. > > BTW this can also be done with a new reader trait on the > IndexedDataset. It will give you two bidirectional maps (BiMap) and a > DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other > does > the same for columns (text tokens). This would be a few lines of code > since > the string mapping and DRM creation is already written, The only > thing to > do would be map the doc/row ids to filenames. This allows you to take > the > non-int doc ids out of the DRM and replace them with a map. Not based > on a > Spark dataframe yet probably will be. > > On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: > > So, here is a sketch of a Spark implementation of seq2sparse, > returning > a > (matrix:DrmLike, dictionary:Map): > > https://github.com/gcapan/mahout/tree/seq2sparse > > Although it should be possible, I couldn't manage to make it process > non-integer document ids. Any fix would be appreciated. There is a > simple > test attached, but I think there is more to do in terms of handling > all > parameters of the original seq2sparse implementation. > > I put it directly to the SparkEngine ---not that I think of this > object > is > the most appropriate placeholder, it just seemed convenient to me. > > Best > > > Gokhan > > On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel > wrote: > > IndexedDataset might suffice until real DataFrames come along. > > > > On
Re: TF-IDF, seq2sparse and DataFrame support
sorry for any confusion... what i just pushed from #75 is not an implementation of seq2sparse at all- just a really simple implementation the Lucene DefaultSimilarity wrapper classes used in the mrlegacy seq2sparse implementation to compute TF-IDF weights for a single term given a dictionary, term frequency count, corpus size and a documentfrequency count: https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/TFIDF.java https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/Weight.java I also added a MLlibTFIDF weight: https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/nlp/tfidf/TFIDF.scala For interoperability with MLlib's Hashing TF-IDF which uses a slightly different formula. The classes I pushed are really just to use for something simple like this: val tfidf: TFIDF = new TFIDF() val currentTfIdf = tfidf.calculate(termFreq, docFreq.toInt, docSize, totalDFSize.toInt) I'm using them to vectorize a new document for Naive Bayes using a in a mahout spark-shell script for MAHOUT-1536 (using a model that was trained with mrlegacy seq2sparse vectors): https://github.com/andrewpalumbo/mahout/blob/MAHOUT-1536-scala/examples/bin/spark/ClassifyNewNBfull.scala I was coincidentally going to push them over the weekend but didn't have a chance, and i thought he may have some use them. Having looked at Gokhan's seq2sparse implementation a little more, I don't think that he really will have any use for them. regarding the package name, I was just suggesting that Gokhan could put his implementation in o.a.m.nlp if SparkEngine is not where it will go. Just looking more closely at the actual TF-IDF calculation now: The mrlegacy TD-IDF weights are calculated by DefaultSimilarity as: sqrt(termFreq) * (log(numDocs / (docFreq + 1)) + 1.0) If I'm reading it correctly, Gokhan's Implementartion is using: termFreq * log(numDocs/docFreq) ; where docFreq is always > 0 Which is closer to the MLlib TF-IDF formula. (without smoothing). This is kind of the reason I was thinking that it is good to have `TermWeight` classes- to keep different (correct) formulas apart. Looking at my `MLlibTFIDF` code right now i believe there may be a bug in it and also some incorrect documentation ... I will go over it tomorrow. On 03/09/2015 09:56 PM, Suneel Marthi wrote: AP, How is ur impl different from Gokhan's? On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo wrote: BTW, i'm not sure o.a.m.nlp is the best package name for either, I was using because o.a.m.vectorizer, which is probably a better name, had conflicts in mrlegacy. On 03/09/2015 09:29 PM, Andrew Palumbo wrote: I meant would o.a.m.nlp in the spark module be a good place for Gokhan's seq2sparse implementation to live. On 03/09/2015 09:07 PM, Pat Ferrel wrote: Does o.a.m.nlp in the spark module seem like a good place for this to live? I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping
Re: TF-IDF, seq2sparse and DataFrame support
AP, How is ur impl different from Gokhan's? On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo wrote: > BTW, i'm not sure o.a.m.nlp is the best package name for either, I was > using because o.a.m.vectorizer, which is probably a better name, had > conflicts in mrlegacy. > > > On 03/09/2015 09:29 PM, Andrew Palumbo wrote: > >> >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's >> seq2sparse implementation to live. >> >> On 03/09/2015 09:07 PM, Pat Ferrel wrote: >> >>> Does o.a.m.nlp in the spark module seem like a good place for this to live? >>> I think you meant math-scala? >>> >>> Actually we should rename math to core >>> >>> >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: >>> >>> Cool- This is great! I think this is really important to have in. >>> >>> +1 to a pull request for comments. >>> >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has >>> very simple TF and TFIDF classes based on lucene's IDF calculation and >>> MLlib's I just got a bad flu and haven't had a chance to push it. It >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i >>> can in case you want to use them. >>> >>> Does o.a.m.nlp in the spark module seem like a good place for this to >>> live? >>> >>> Those classes may be of use to you- they're very simple and are intended >>> for new document vectorization once the legacy deps are removed from the >>> spark module. They also might make interoperability with easier. >>> >>> One thought having not been able to look at this too closely yet. >>> >>> //do we need do calculate df-vector? > 1. We do need a document frequency map or vector to be able to >>> calculate the IDF terms when vectorizing a new document outside of the >>> original corpus. >>> >>> >>> >>> >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: >>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov > wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It > is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo > wrote: > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: >> >> Andrew, not sure what you mean about storing strings. If you mean >>> something like a DRM of tokens, that is a DataFrame with row=doc >>> column >>> >> = > >> token. A one row DataFrame is a slightly heavy weight >>> string/document. A >>> DataFrame with token counts would be perfect for input TF-IDF, no? It >>> >> would > >> be a vector that maintains the tokens as ids for the counts, right? >>> >>> Yes- dataframes will be perfect for this. The problem that i was >> referring to was that we dont have a DSL Data Structure to to do the >> initial distributed tokenizing of the documents[1] line:
Re: TF-IDF, seq2sparse and DataFrame support
BTW, i'm not sure o.a.m.nlp is the best package name for either, I was using because o.a.m.vectorizer, which is probably a better name, had conflicts in mrlegacy. On 03/09/2015 09:29 PM, Andrew Palumbo wrote: I meant would o.a.m.nlp in the spark module be a good place for Gokhan's seq2sparse implementation to live. On 03/09/2015 09:07 PM, Pat Ferrel wrote: Does o.a.m.nlp in the spark module seem like a good place for this to live? I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy
Re: TF-IDF, seq2sparse and DataFrame support
I meant would o.a.m.nlp in the spark module be a good place for Gokhan's seq2sparse implementation to live. On 03/09/2015 09:07 PM, Pat Ferrel wrote: Does o.a.m.nlp in the spark module seem like a good place for this to live? I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1] https://github.com/apache/mahout/blob/master/m
Re: TF-IDF, seq2sparse and DataFrame support
> Does o.a.m.nlp in the spark module seem like a good place for this to live? I think you meant math-scala? Actually we should rename math to core On Mar 9, 2015, at 3:15 PM, Andrew Palumbo wrote: Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. >> //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: > Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. > > On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: > > Ah I found the right button in Github no PR necessary. > > On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: > > If you create a PR it’s easier to see what was changed. > > Wouldn’t it be better to read in files from a directory assigning doc-id = > filename and term-ids = terms or are their still Hadoop pipeline tools that > are needed to create the sequence files? This sort of mimics the way Spark > reads SchemaRDDs from Json files. > > BTW this can also be done with a new reader trait on the IndexedDataset. It > will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap > gives any String <-> Int for rows, the other does the same for columns (text > tokens). This would be a few lines of code since the string mapping and DRM > creation is already written, The only thing to do would be map the doc/row > ids to filenames. This allows you to take the non-int doc ids out of the DRM > and replace them with a map. Not based on a Spark dataframe yet probably will > be. > > On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: > > So, here is a sketch of a Spark implementation of seq2sparse, returning a > (matrix:DrmLike, dictionary:Map): > > https://github.com/gcapan/mahout/tree/seq2sparse > > Although it should be possible, I couldn't manage to make it process > non-integer document ids. Any fix would be appreciated. There is a simple > test attached, but I think there is more to do in terms of handling all > parameters of the original seq2sparse implementation. > > I put it directly to the SparkEngine ---not that I think of this object is > the most appropriate placeholder, it just seemed convenient to me. > > Best > > > Gokhan > > On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > >> IndexedDataset might suffice until real DataFrames come along. >> >> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: >> >> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a >> byproduct of it IIRC. matrix definitely not a structure to hold those. >> >> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: >> >>> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >>> Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column >> = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It >> would be a vector that maintains the tokens as ids for the counts, right? >>> Yes- dataframes will be perfect for this. The problem that i was >>> referring to was that we dont have a DSL Data Structure to to do the >>> initial distributed tokenizing of the documents[1] line:257, [2] . For >> this >>> I believe we would need something like a Distributed vector of Strings >> that >>> could be broadcast to a mapBlock closure and then tokenized from there. >>> Even there, MapBlock may not be perfect for this, but some of the new >>> Distributed functions that Gockhan is working on may. >>> I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? >>> as far as collocations i believe that the n-gram are computed and counted >>> in the CollocDriver [3] (i might be wrong her...its been a while since i >>> looked at the code...) either way, I dont think I ever looked too closely >>> and i was a bit fuzzy on this... >>> >>> These were just some thoughts that I had when briefly looking at porting >>> seq2sparse to the DSL before.. Obviously we don't have to follow this >>> algorithm
Re: TF-IDF, seq2sparse and DataFrame support
Cool- This is great! I think this is really important to have in. +1 to a pull request for comments. I have pr#75(https://github.com/apache/mahout/pull/75) open - It has very simple TF and TFIDF classes based on lucene's IDF calculation and MLlib's I just got a bad flu and haven't had a chance to push it. It creates an o.a.m.nlp package in mahout-math. I will push that as soon as i can in case you want to use them. Does o.a.m.nlp in the spark module seem like a good place for this to live? Those classes may be of use to you- they're very simple and are intended for new document vectorization once the legacy deps are removed from the spark module. They also might make interoperability with easier. One thought having not been able to look at this too closely yet. //do we need do calculate df-vector? 1. We do need a document frequency map or vector to be able to calculate the IDF terms when vectorizing a new document outside of the original corpus. On 03/09/2015 05:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles .java [2] https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java [3]https://github.com/apache/mahout/blob/master/mrlegacy/ src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. java
Re: TF-IDF, seq2sparse and DataFrame support
There is a whole pipeline here and an interesting way of making parts accessible via nested function defs. Would it make sense to break them out into separate functions so the base function doesn’t take so many params? Maybe one big helper and smaller but separate pipeline funtions so it would be easier to string together your own? For instance I’d like part-of-speech or even nlp as a filter and would never perform the tfidf or LLR in my recommender use cases since they are done in other places. I see they can be disabled. This would be useful for a content based recommender but needs a BiMap or the doc-ids preserved in the DRM rows, since they must be written to a search engine as application specific ids—not Mahout ints. Input a matrix of doc-id, token, perform AA’ with LLR filtering of the tokens (spark-rowsimilarity) and write this to a search engine _using application specific tokens and doc-ids_. The search engine does the TF-IDF. Then either get similar docs for any doc-id or use the user’s history of docs-ids read as a query on AA’ to get personalized recs. On Mar 9, 2015, at 2:10 PM, Pat Ferrel wrote: Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > >> >> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >> >>> Andrew, not sure what you mean about storing strings. If you mean >>> something like a DRM of tokens, that is a DataFrame with row=doc column > = >>> token. A one row DataFrame is a slightly heavy weight string/document. A >>> DataFrame with token counts would be perfect for input TF-IDF, no? It > would >>> be a vector that maintains the tokens as ids for the counts, right? >>> >> >> Yes- dataframes will be perfect for this. The problem that i was >> referring to was that we dont have a DSL Data Structure to to do the >> initial distributed tokenizing of the documents[1] line:257, [2] . For > this >> I believe we would need something like a Distributed vector of Strings > that >> could be broadcast to a mapBlock closure and then tokenized from there. >> Even there, MapBlock may not be perfect for this, but some of the new >> Distributed functions that Gockhan is working on may. >> >>> >>> I agree seq2sparse type input is a strong feature. Text files into an >>> all-documents DataFrame basically. Colocation? >>> >> as far as collocations i believe that the n-gram are computed and counted >> in the CollocDriver [3] (i might be wrong her...its been a while since i >> looked at the code...) either way, I dont think I ever looked too closely >> and i was a bit fuzzy on this... >> >> These were just some thoughts that I had when briefly looking at porting >> seq2sparse to the DSL before.. Obviously we don't have to follow this >> algorithm but its a nice starting point. >> >> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >> .java >> [2] https://githu
Re: TF-IDF, seq2sparse and DataFrame support
Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, nice. On Mar 9, 2015, at 2:07 PM, Pat Ferrel wrote: Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > >> >> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >> >>> Andrew, not sure what you mean about storing strings. If you mean >>> something like a DRM of tokens, that is a DataFrame with row=doc column > = >>> token. A one row DataFrame is a slightly heavy weight string/document. A >>> DataFrame with token counts would be perfect for input TF-IDF, no? It > would >>> be a vector that maintains the tokens as ids for the counts, right? >>> >> >> Yes- dataframes will be perfect for this. The problem that i was >> referring to was that we dont have a DSL Data Structure to to do the >> initial distributed tokenizing of the documents[1] line:257, [2] . For > this >> I believe we would need something like a Distributed vector of Strings > that >> could be broadcast to a mapBlock closure and then tokenized from there. >> Even there, MapBlock may not be perfect for this, but some of the new >> Distributed functions that Gockhan is working on may. >> >>> >>> I agree seq2sparse type input is a strong feature. Text files into an >>> all-documents DataFrame basically. Colocation? >>> >> as far as collocations i believe that the n-gram are computed and counted >> in the CollocDriver [3] (i might be wrong her...its been a while since i >> looked at the code...) either way, I dont think I ever looked too closely >> and i was a bit fuzzy on this... >> >> These were just some thoughts that I had when briefly looking at porting >> seq2sparse to the DSL before.. Obviously we don't have to follow this >> algorithm but its a nice starting point. >> >> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >> .java >> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java >> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. >> java >> >> >> >>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >>> >>> Just copied over the relevant last few messages to keep the other thread >>> on topic... >>> >>> >>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >>> I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation > DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream > data structure. (my "told you so" moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require > more grace about column naming and manipulation. Maybe relevan
Re: TF-IDF, seq2sparse and DataFrame support
Ah I found the right button in Github no PR necessary. On Mar 9, 2015, at 1:55 PM, Pat Ferrel wrote: If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > >> >> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >> >>> Andrew, not sure what you mean about storing strings. If you mean >>> something like a DRM of tokens, that is a DataFrame with row=doc column > = >>> token. A one row DataFrame is a slightly heavy weight string/document. A >>> DataFrame with token counts would be perfect for input TF-IDF, no? It > would >>> be a vector that maintains the tokens as ids for the counts, right? >>> >> >> Yes- dataframes will be perfect for this. The problem that i was >> referring to was that we dont have a DSL Data Structure to to do the >> initial distributed tokenizing of the documents[1] line:257, [2] . For > this >> I believe we would need something like a Distributed vector of Strings > that >> could be broadcast to a mapBlock closure and then tokenized from there. >> Even there, MapBlock may not be perfect for this, but some of the new >> Distributed functions that Gockhan is working on may. >> >>> >>> I agree seq2sparse type input is a strong feature. Text files into an >>> all-documents DataFrame basically. Colocation? >>> >> as far as collocations i believe that the n-gram are computed and counted >> in the CollocDriver [3] (i might be wrong her...its been a while since i >> looked at the code...) either way, I dont think I ever looked too closely >> and i was a bit fuzzy on this... >> >> These were just some thoughts that I had when briefly looking at porting >> seq2sparse to the DSL before.. Obviously we don't have to follow this >> algorithm but its a nice starting point. >> >> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >> .java >> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java >> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. >> java >> >> >> >>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >>> >>> Just copied over the relevant last few messages to keep the other thread >>> on topic... >>> >>> >>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >>> I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation > DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream > data structure. (my "told you so" moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require > more grace about column naming and manipulation. Maybe relevant to TF-IDF > work when it deals with non-matrix content. >>> Sounds like a worthy effort to me. We'd be basically
Re: TF-IDF, seq2sparse and DataFrame support
If you create a PR it’s easier to see what was changed. Wouldn’t it be better to read in files from a directory assigning doc-id = filename and term-ids = terms or are their still Hadoop pipeline tools that are needed to create the sequence files? This sort of mimics the way Spark reads SchemaRDDs from Json files. BTW this can also be done with a new reader trait on the IndexedDataset. It will give you two bidirectional maps (BiMap) and a DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does the same for columns (text tokens). This would be a few lines of code since the string mapping and DRM creation is already written, The only thing to do would be map the doc/row ids to filenames. This allows you to take the non-int doc ids out of the DRM and replace them with a map. Not based on a Spark dataframe yet probably will be. On Mar 9, 2015, at 11:12 AM, Gokhan Capan wrote: So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > >> >> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >> >>> Andrew, not sure what you mean about storing strings. If you mean >>> something like a DRM of tokens, that is a DataFrame with row=doc column > = >>> token. A one row DataFrame is a slightly heavy weight string/document. A >>> DataFrame with token counts would be perfect for input TF-IDF, no? It > would >>> be a vector that maintains the tokens as ids for the counts, right? >>> >> >> Yes- dataframes will be perfect for this. The problem that i was >> referring to was that we dont have a DSL Data Structure to to do the >> initial distributed tokenizing of the documents[1] line:257, [2] . For > this >> I believe we would need something like a Distributed vector of Strings > that >> could be broadcast to a mapBlock closure and then tokenized from there. >> Even there, MapBlock may not be perfect for this, but some of the new >> Distributed functions that Gockhan is working on may. >> >>> >>> I agree seq2sparse type input is a strong feature. Text files into an >>> all-documents DataFrame basically. Colocation? >>> >> as far as collocations i believe that the n-gram are computed and counted >> in the CollocDriver [3] (i might be wrong her...its been a while since i >> looked at the code...) either way, I dont think I ever looked too closely >> and i was a bit fuzzy on this... >> >> These were just some thoughts that I had when briefly looking at porting >> seq2sparse to the DSL before.. Obviously we don't have to follow this >> algorithm but its a nice starting point. >> >> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >> .java >> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java >> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ >> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. >> java >> >> >> >>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >>> >>> Just copied over the relevant last few messages to keep the other thread >>> on topic... >>> >>> >>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >>> I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation > DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream > data structure. (my "told you so" moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require > more grace about column naming and manipulation. Maybe relevant to TF-IDF > work when it deals with non-matrix content. >>> Sounds like a worthy effort to me. We'd be basically implementing an > API >>> at the math-scala level for SchemaRDD/Dataframe datastructures correct? >>> >
Re: TF-IDF, seq2sparse and DataFrame support
So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map): https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > > > > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: > > > >> Andrew, not sure what you mean about storing strings. If you mean > >> something like a DRM of tokens, that is a DataFrame with row=doc column > = > >> token. A one row DataFrame is a slightly heavy weight string/document. A > >> DataFrame with token counts would be perfect for input TF-IDF, no? It > would > >> be a vector that maintains the tokens as ids for the counts, right? > >> > > > > Yes- dataframes will be perfect for this. The problem that i was > > referring to was that we dont have a DSL Data Structure to to do the > > initial distributed tokenizing of the documents[1] line:257, [2] . For > this > > I believe we would need something like a Distributed vector of Strings > that > > could be broadcast to a mapBlock closure and then tokenized from there. > > Even there, MapBlock may not be perfect for this, but some of the new > > Distributed functions that Gockhan is working on may. > > > >> > >> I agree seq2sparse type input is a strong feature. Text files into an > >> all-documents DataFrame basically. Colocation? > >> > > as far as collocations i believe that the n-gram are computed and counted > > in the CollocDriver [3] (i might be wrong her...its been a while since i > > looked at the code...) either way, I dont think I ever looked too closely > > and i was a bit fuzzy on this... > > > > These were just some thoughts that I had when briefly looking at porting > > seq2sparse to the DSL before.. Obviously we don't have to follow this > > algorithm but its a nice starting point. > > > > [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > > .java > > [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > > [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > > java > > > > > > > >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: > >> > >> Just copied over the relevant last few messages to keep the other thread > >> on topic... > >> > >> > >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: > >> > >>> I'd suggest to consider this: remember all this talk about > >>> language-integrated spark ql being basically dataframe manipulation > DSL? > >>> > >>> so now Spark devs are noticing this generality as well and are actually > >>> proposing to rename SchemaRDD into DataFrame and make it mainstream > data > >>> structure. (my "told you so" moment of sorts > >>> > >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed > >>> DataFrame our two major structures. In particular, standardize on using > >>> DataFrame for things that may include non-numerical data and require > more > >>> grace about column naming and manipulation. Maybe relevant to TF-IDF > work > >>> when it deals with non-matrix content. > >>> > >> Sounds like a worthy effort to me. We'd be basically implementing an > API > >> at the math-scala level for SchemaRDD/Dataframe datastructures correct? > >> > >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel > wrote: > >> > >>> Seems like seq2sparse would be really easy to replace since it takes > text > files to start with, then the whole pipeline could be kept in rdds. > The > dictionaries and counts could be either in-memory maps or rdds for use > with > joins? This would get rid of sequence files completely from the > pipeline. > Item similarity uses in-memory maps but the plan is to make it more > scalable using joins as an alternative with the same API allowing the > user > to trade-off footprint for speed. > > >>> I think you're right- should be relatively easy. I've been looking at > >> porting seq2sparse to the DSL for bit now and the stopper at the DSL > level > >> is that we don't have a distributed data structure for strings..Seems > like > >> getting a Dat
Re: TF-IDF, seq2sparse and DataFrame support
IndexedDataset might suffice until real DataFrames come along. On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov wrote: Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: > >> Andrew, not sure what you mean about storing strings. If you mean >> something like a DRM of tokens, that is a DataFrame with row=doc column = >> token. A one row DataFrame is a slightly heavy weight string/document. A >> DataFrame with token counts would be perfect for input TF-IDF, no? It would >> be a vector that maintains the tokens as ids for the counts, right? >> > > Yes- dataframes will be perfect for this. The problem that i was > referring to was that we dont have a DSL Data Structure to to do the > initial distributed tokenizing of the documents[1] line:257, [2] . For this > I believe we would need something like a Distributed vector of Strings that > could be broadcast to a mapBlock closure and then tokenized from there. > Even there, MapBlock may not be perfect for this, but some of the new > Distributed functions that Gockhan is working on may. > >> >> I agree seq2sparse type input is a strong feature. Text files into an >> all-documents DataFrame basically. Colocation? >> > as far as collocations i believe that the n-gram are computed and counted > in the CollocDriver [3] (i might be wrong her...its been a while since i > looked at the code...) either way, I dont think I ever looked too closely > and i was a bit fuzzy on this... > > These were just some thoughts that I had when briefly looking at porting > seq2sparse to the DSL before.. Obviously we don't have to follow this > algorithm but its a nice starting point. > > [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > .java > [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > java > > > >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >> >> Just copied over the relevant last few messages to keep the other thread >> on topic... >> >> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >> >>> I'd suggest to consider this: remember all this talk about >>> language-integrated spark ql being basically dataframe manipulation DSL? >>> >>> so now Spark devs are noticing this generality as well and are actually >>> proposing to rename SchemaRDD into DataFrame and make it mainstream data >>> structure. (my "told you so" moment of sorts >>> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed >>> DataFrame our two major structures. In particular, standardize on using >>> DataFrame for things that may include non-numerical data and require more >>> grace about column naming and manipulation. Maybe relevant to TF-IDF work >>> when it deals with non-matrix content. >>> >> Sounds like a worthy effort to me. We'd be basically implementing an API >> at the math-scala level for SchemaRDD/Dataframe datastructures correct? >> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: >> >>> Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. >>> I think you're right- should be relatively easy. I've been looking at >> porting seq2sparse to the DSL for bit now and the stopper at the DSL level >> is that we don't have a distributed data structure for strings..Seems like >> getting a DataFrame implemented as Dmitriy mentioned above would take care >> of this problem. >> >> The other issue i'm a little fuzzy on is the distributed collocation >> mapping- it's a part of the seq2sparse code that I've not spent too much >> time in. >> >> I think that this would be very worthy effort as well- I believe >> seq2sparse is a particular strong mahout feature. >> >> I'll start another thread since we're now way off topic from the >> refactoring proposal. >> >> My use for TF-IDF is for row similarity and would take a DRM (actually >> IndexedDataset) and calculate row/doc similarities. It works now but only >> using LLR. This is OK when thinking of the items as tags or metadata but >> for text tokens something like cosine may be better. >> >> I’d imagine a downsampling phase that would precede TF-IDF u
Re: TF-IDF, seq2sparse and DataFrame support
Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a byproduct of it IIRC. matrix definitely not a structure to hold those. On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo wrote: > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: > >> Andrew, not sure what you mean about storing strings. If you mean >> something like a DRM of tokens, that is a DataFrame with row=doc column = >> token. A one row DataFrame is a slightly heavy weight string/document. A >> DataFrame with token counts would be perfect for input TF-IDF, no? It would >> be a vector that maintains the tokens as ids for the counts, right? >> > > Yes- dataframes will be perfect for this. The problem that i was > referring to was that we dont have a DSL Data Structure to to do the > initial distributed tokenizing of the documents[1] line:257, [2] . For this > I believe we would need something like a Distributed vector of Strings that > could be broadcast to a mapBlock closure and then tokenized from there. > Even there, MapBlock may not be perfect for this, but some of the new > Distributed functions that Gockhan is working on may. > >> >> I agree seq2sparse type input is a strong feature. Text files into an >> all-documents DataFrame basically. Colocation? >> > as far as collocations i believe that the n-gram are computed and counted > in the CollocDriver [3] (i might be wrong her...its been a while since i > looked at the code...) either way, I dont think I ever looked too closely > and i was a bit fuzzy on this... > > These were just some thoughts that I had when briefly looking at porting > seq2sparse to the DSL before.. Obviously we don't have to follow this > algorithm but its a nice starting point. > > [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > .java > [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > java > > > >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >> >> Just copied over the relevant last few messages to keep the other thread >> on topic... >> >> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >> >>> I'd suggest to consider this: remember all this talk about >>> language-integrated spark ql being basically dataframe manipulation DSL? >>> >>> so now Spark devs are noticing this generality as well and are actually >>> proposing to rename SchemaRDD into DataFrame and make it mainstream data >>> structure. (my "told you so" moment of sorts >>> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed >>> DataFrame our two major structures. In particular, standardize on using >>> DataFrame for things that may include non-numerical data and require more >>> grace about column naming and manipulation. Maybe relevant to TF-IDF work >>> when it deals with non-matrix content. >>> >> Sounds like a worthy effort to me. We'd be basically implementing an API >> at the math-scala level for SchemaRDD/Dataframe datastructures correct? >> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: >> >>> Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. >>> I think you're right- should be relatively easy. I've been looking at >> porting seq2sparse to the DSL for bit now and the stopper at the DSL level >> is that we don't have a distributed data structure for strings..Seems like >> getting a DataFrame implemented as Dmitriy mentioned above would take care >> of this problem. >> >> The other issue i'm a little fuzzy on is the distributed collocation >> mapping- it's a part of the seq2sparse code that I've not spent too much >> time in. >> >> I think that this would be very worthy effort as well- I believe >> seq2sparse is a particular strong mahout feature. >> >> I'll start another thread since we're now way off topic from the >> refactoring proposal. >> >> My use for TF-IDF is for row similarity and would take a DRM (actually >> IndexedDataset) and calculate row/doc similarities. It works now but only >> using LLR. This is OK when thinking of the items as tags or metadata but >> for text tokens something like cosine may be better. >> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot >> like how CF preferences are downsampled. This would produce an sparsified >> all-docs DRM. Then (if the counts were saved) TF-I
Re: TF-IDF, seq2sparse and DataFrame support
I think I have a sketch of implementation for creating a drm from a sequence file of s, a.k.a. seq2sparse, using Spark. Give me a couple days day and I will provide an initial implementation. Best Gokhan On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo wrote: > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: > >> Andrew, not sure what you mean about storing strings. If you mean >> something like a DRM of tokens, that is a DataFrame with row=doc column = >> token. A one row DataFrame is a slightly heavy weight string/document. A >> DataFrame with token counts would be perfect for input TF-IDF, no? It would >> be a vector that maintains the tokens as ids for the counts, right? >> > > Yes- dataframes will be perfect for this. The problem that i was > referring to was that we dont have a DSL Data Structure to to do the > initial distributed tokenizing of the documents[1] line:257, [2] . For this > I believe we would need something like a Distributed vector of Strings that > could be broadcast to a mapBlock closure and then tokenized from there. > Even there, MapBlock may not be perfect for this, but some of the new > Distributed functions that Gockhan is working on may. > >> >> I agree seq2sparse type input is a strong feature. Text files into an >> all-documents DataFrame basically. Colocation? >> > as far as collocations i believe that the n-gram are computed and counted > in the CollocDriver [3] (i might be wrong her...its been a while since i > looked at the code...) either way, I dont think I ever looked too closely > and i was a bit fuzzy on this... > > These were just some thoughts that I had when briefly looking at porting > seq2sparse to the DSL before.. Obviously we don't have to follow this > algorithm but its a nice starting point. > > [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > .java > [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > java > > > >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: >> >> Just copied over the relevant last few messages to keep the other thread >> on topic... >> >> >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >> >>> I'd suggest to consider this: remember all this talk about >>> language-integrated spark ql being basically dataframe manipulation DSL? >>> >>> so now Spark devs are noticing this generality as well and are actually >>> proposing to rename SchemaRDD into DataFrame and make it mainstream data >>> structure. (my "told you so" moment of sorts >>> >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed >>> DataFrame our two major structures. In particular, standardize on using >>> DataFrame for things that may include non-numerical data and require more >>> grace about column naming and manipulation. Maybe relevant to TF-IDF work >>> when it deals with non-matrix content. >>> >> Sounds like a worthy effort to me. We'd be basically implementing an API >> at the math-scala level for SchemaRDD/Dataframe datastructures correct? >> >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: >> >>> Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. >>> I think you're right- should be relatively easy. I've been looking at >> porting seq2sparse to the DSL for bit now and the stopper at the DSL level >> is that we don't have a distributed data structure for strings..Seems like >> getting a DataFrame implemented as Dmitriy mentioned above would take care >> of this problem. >> >> The other issue i'm a little fuzzy on is the distributed collocation >> mapping- it's a part of the seq2sparse code that I've not spent too much >> time in. >> >> I think that this would be very worthy effort as well- I believe >> seq2sparse is a particular strong mahout feature. >> >> I'll start another thread since we're now way off topic from the >> refactoring proposal. >> >> My use for TF-IDF is for row similarity and would take a DRM (actually >> IndexedDataset) and calculate row/doc similarities. It works now but only >> using LLR. This is OK when thinking of the items as tags or metadata but >> for text tokens something like cosine may be better. >> >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot >> like how CF preferences are downsampled. This would produce an sp
Re: TF-IDF, seq2sparse and DataFrame support
On 02/04/2015 11:13 AM, Pat Ferrel wrote: Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? Yes- dataframes will be perfect for this. The problem that i was referring to was that we dont have a DSL Data Structure to to do the initial distributed tokenizing of the documents[1] line:257, [2] . For this I believe we would need something like a Distributed vector of Strings that could be broadcast to a mapBlock closure and then tokenized from there. Even there, MapBlock may not be perfect for this, but some of the new Distributed functions that Gockhan is working on may. I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? as far as collocations i believe that the n-gram are computed and counted in the CollocDriver [3] (i might be wrong her...its been a while since i looked at the code...) either way, I dont think I ever looked too closely and i was a bit fuzzy on this... These were just some thoughts that I had when briefly looking at porting seq2sparse to the DSL before.. Obviously we don't have to follow this algorithm but its a nice starting point. [1] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles.java [2] https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java [3]https://github.com/apache/mahout/blob/master/mrlegacy/src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.java On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream data structure. (my "told you so" moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require more grace about column naming and manipulation. Maybe relevant to TF-IDF work when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part of the seq2sparse code that I've not spent too much time in. I think that this would be very worthy effort as well- I believe seq2sparse is a particular strong mahout feature. I'll start another thread since we're now way off topic from the refactoring proposal. My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better. I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so good for search but should produce much better similarities than Solr’s “moreLikeThis” and does it for all pairs rather than one at a time. In any case it can be used to do a create a personalized content-based recommender or augment a CF recommender with one more indicator type. On Feb 3, 2015, at 3:37 PM, Andrew Palumbo wrote: On 02/03/2015 12:44 PM, Andrew Palumbo wro
Re: TF-IDF, seq2sparse and DataFrame support
Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right? I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? On Feb 4, 2015, at 7:47 AM, Andrew Palumbo wrote: Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: > I'd suggest to consider this: remember all this talk about > language-integrated spark ql being basically dataframe manipulation DSL? > > so now Spark devs are noticing this generality as well and are actually > proposing to rename SchemaRDD into DataFrame and make it mainstream data > structure. (my "told you so" moment of sorts > > What i am getting at, i'd suggest to make DRM and Spark's newly renamed > DataFrame our two major structures. In particular, standardize on using > DataFrame for things that may include non-numerical data and require more > grace about column naming and manipulation. Maybe relevant to TF-IDF work > when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: >> Seems like seq2sparse would be really easy to replace since it takes text >> files to start with, then the whole pipeline could be kept in rdds. The >> dictionaries and counts could be either in-memory maps or rdds for use with >> joins? This would get rid of sequence files completely from the pipeline. >> Item similarity uses in-memory maps but the plan is to make it more >> scalable using joins as an alternative with the same API allowing the user >> to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part of the seq2sparse code that I've not spent too much time in. I think that this would be very worthy effort as well- I believe seq2sparse is a particular strong mahout feature. I'll start another thread since we're now way off topic from the refactoring proposal. My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better. I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so good for search but should produce much better similarities than Solr’s “moreLikeThis” and does it for all pairs rather than one at a time. In any case it can be used to do a create a personalized content-based recommender or augment a CF recommender with one more indicator type. On Feb 3, 2015, at 3:37 PM, Andrew Palumbo wrote: On 02/03/2015 12:44 PM, Andrew Palumbo wrote: > On 02/03/2015 12:22 PM, Pat Ferrel wrote: >> Some issues WRT lower level Spark integration: >> 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance. >> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math. >> 3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start. >> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for str
TF-IDF, seq2sparse and DataFrame support
Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: I'd suggest to consider this: remember all this talk about language-integrated spark ql being basically dataframe manipulation DSL? so now Spark devs are noticing this generality as well and are actually proposing to rename SchemaRDD into DataFrame and make it mainstream data structure. (my "told you so" moment of sorts What i am getting at, i'd suggest to make DRM and Spark's newly renamed DataFrame our two major structures. In particular, standardize on using DataFrame for things that may include non-numerical data and require more grace about column naming and manipulation. Maybe relevant to TF-IDF work when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel wrote: Seems like seq2sparse would be really easy to replace since it takes text files to start with, then the whole pipeline could be kept in rdds. The dictionaries and counts could be either in-memory maps or rdds for use with joins? This would get rid of sequence files completely from the pipeline. Item similarity uses in-memory maps but the plan is to make it more scalable using joins as an alternative with the same API allowing the user to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part of the seq2sparse code that I've not spent too much time in. I think that this would be very worthy effort as well- I believe seq2sparse is a particular strong mahout feature. I'll start another thread since we're now way off topic from the refactoring proposal. My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better. I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so good for search but should produce much better similarities than Solr’s “moreLikeThis” and does it for all pairs rather than one at a time. In any case it can be used to do a create a personalized content-based recommender or augment a CF recommender with one more indicator type. On Feb 3, 2015, at 3:37 PM, Andrew Palumbo wrote: On 02/03/2015 12:44 PM, Andrew Palumbo wrote: On 02/03/2015 12:22 PM, Pat Ferrel wrote: Some issues WRT lower level Spark integration: 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance. 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math. 3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start. Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for streaming and especially incrementally updated streaming. Seems like we need to address this. Andrew, regardless of the above having TF-IDF would be super helpful—row similarity for content/text would benefit greatly. I will put a PR up soon. Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes and Weight interface over from mr-legacy to math-scala. They're available now in spark-shell but won't be after this refactoring. These still require dictionary and a frequency count maps to vectorize incoming text- so they're more for use with the