Some answers: - Non-integer document ids: The implementation does not use operations defined for DrmLike[Int]-only, so the row keys do not have to be Int's. I just couldn't manage to create the returning DrmLike with the correct key type. Although while wrapping into a DrmLike, I tried to pass the key-class using HDFS utils like they are being used in drmDfsRead, but I somehow wasn't successful. So non-int document ids is not an actual issue here.
- Breaking the implementation out to smaller pieces: Let's just collect the requirements and adjust the implementation accordingly. I honestly didn't think very much about where the implementation fits in, architecturally, and what pieces are of public interest. Best Gokhan On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <suneel.mar...@gmail.com> wrote: > AP, How is ur impl different from Gokhan's? > > On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <ap....@outlook.com> wrote: > > > BTW, i'm not sure o.a.m.nlp is the best package name for either, I was > > using because o.a.m.vectorizer, which is probably a better name, had > > conflicts in mrlegacy. > > > > > > On 03/09/2015 09:29 PM, Andrew Palumbo wrote: > > > >> > >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's > >> seq2sparse implementation to live. > >> > >> On 03/09/2015 09:07 PM, Pat Ferrel wrote: > >> > >>> Does o.a.m.nlp in the spark module seem like a good place for this to > >>>> live? > >>>> > >>> I think you meant math-scala? > >>> > >>> Actually we should rename math to core > >>> > >>> > >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <ap....@outlook.com> wrote: > >>> > >>> Cool- This is great! I think this is really important to have in. > >>> > >>> +1 to a pull request for comments. > >>> > >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has > >>> very simple TF and TFIDF classes based on lucene's IDF calculation and > >>> MLlib's I just got a bad flu and haven't had a chance to push it. It > >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon > as i > >>> can in case you want to use them. > >>> > >>> Does o.a.m.nlp in the spark module seem like a good place for this to > >>> live? > >>> > >>> Those classes may be of use to you- they're very simple and are > intended > >>> for new document vectorization once the legacy deps are removed from > the > >>> spark module. They also might make interoperability with easier. > >>> > >>> One thought having not been able to look at this too closely yet. > >>> > >>> //do we need do calculate df-vector? > >>>>> > >>>> 1. We do need a document frequency map or vector to be able to > >>> calculate the IDF terms when vectorizing a new document outside of the > >>> original corpus. > >>> > >>> > >>> > >>> > >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: > >>> > >>>> Ah, you are doing all the lucene analyzer, ngrams and other > tokenizing, > >>>> nice. > >>>> > >>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > >>>> > >>>> Ah I found the right button in Github no PR necessary. > >>>> > >>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > >>>> > >>>> If you create a PR it’s easier to see what was changed. > >>>> > >>>> Wouldn’t it be better to read in files from a directory assigning > >>>> doc-id = filename and term-ids = terms or are their still Hadoop > pipeline > >>>> tools that are needed to create the sequence files? This sort of > mimics the > >>>> way Spark reads SchemaRDDs from Json files. > >>>> > >>>> BTW this can also be done with a new reader trait on the > >>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a > >>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other > does > >>>> the same for columns (text tokens). This would be a few lines of code > since > >>>> the string mapping and DRM creation is already written, The only > thing to > >>>> do would be map the doc/row ids to filenames. This allows you to take > the > >>>> non-int doc ids out of the DRM and replace them with a map. Not based > on a > >>>> Spark dataframe yet probably will be. > >>>> > >>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <gkhn...@gmail.com> wrote: > >>>> > >>>> So, here is a sketch of a Spark implementation of seq2sparse, > returning > >>>> a > >>>> (matrix:DrmLike, dictionary:Map): > >>>> > >>>> https://github.com/gcapan/mahout/tree/seq2sparse > >>>> > >>>> Although it should be possible, I couldn't manage to make it process > >>>> non-integer document ids. Any fix would be appreciated. There is a > >>>> simple > >>>> test attached, but I think there is more to do in terms of handling > all > >>>> parameters of the original seq2sparse implementation. > >>>> > >>>> I put it directly to the SparkEngine ---not that I think of this > object > >>>> is > >>>> the most appropriate placeholder, it just seemed convenient to me. > >>>> > >>>> Best > >>>> > >>>> > >>>> Gokhan > >>>> > >>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <p...@occamsmachete.com> > >>>> wrote: > >>>> > >>>> IndexedDataset might suffice until real DataFrames come along. > >>>>> > >>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <dlie...@gmail.com> > >>>>> wrote: > >>>>> > >>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It > >>>>> is a > >>>>> byproduct of it IIRC. matrix definitely not a structure to hold > those. > >>>>> > >>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <ap....@outlook.com> > >>>>> wrote: > >>>>> > >>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote: > >>>>>> > >>>>>> Andrew, not sure what you mean about storing strings. If you mean > >>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc > >>>>>>> column > >>>>>>> > >>>>>> = > >>>>> > >>>>>> token. A one row DataFrame is a slightly heavy weight > >>>>>>> string/document. A > >>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? > It > >>>>>>> > >>>>>> would > >>>>> > >>>>>> be a vector that maintains the tokens as ids for the counts, right? > >>>>>>> > >>>>>>> Yes- dataframes will be perfect for this. The problem that i was > >>>>>> referring to was that we dont have a DSL Data Structure to to do the > >>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . > For > >>>>>> > >>>>> this > >>>>> > >>>>>> I believe we would need something like a Distributed vector of > Strings > >>>>>> > >>>>> that > >>>>> > >>>>>> could be broadcast to a mapBlock closure and then tokenized from > >>>>>> there. > >>>>>> Even there, MapBlock may not be perfect for this, but some of the > new > >>>>>> Distributed functions that Gockhan is working on may. > >>>>>> > >>>>>> I agree seq2sparse type input is a strong feature. Text files into > an > >>>>>>> all-documents DataFrame basically. Colocation? > >>>>>>> > >>>>>>> as far as collocations i believe that the n-gram are computed and > >>>>>> counted > >>>>>> in the CollocDriver [3] (i might be wrong her...its been a while > >>>>>> since i > >>>>>> looked at the code...) either way, I dont think I ever looked too > >>>>>> closely > >>>>>> and i was a bit fuzzy on this... > >>>>>> > >>>>>> These were just some thoughts that I had when briefly looking at > >>>>>> porting > >>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow > this > >>>>>> algorithm but its a nice starting point. > >>>>>> > >>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > >>>>>> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > >>>>>> > >>>>>> .java > >>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > >>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > >>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > >>>>>> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > >>>>>> > >>>>>> java > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap....@outlook.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>> Just copied over the relevant last few messages to keep the other > >>>>>>> thread > >>>>>>> on topic... > >>>>>>> > >>>>>>> > >>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: > >>>>>>> > >>>>>>> I'd suggest to consider this: remember all this talk about > >>>>>>>> language-integrated spark ql being basically dataframe > manipulation > >>>>>>>> > >>>>>>> DSL? > >>>>> > >>>>>> so now Spark devs are noticing this generality as well and are > >>>>>>>> actually > >>>>>>>> proposing to rename SchemaRDD into DataFrame and make it > mainstream > >>>>>>>> > >>>>>>> data > >>>>> > >>>>>> structure. (my "told you so" moment of sorts > >>>>>>>> > >>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly > >>>>>>>> renamed > >>>>>>>> DataFrame our two major structures. In particular, standardize on > >>>>>>>> using > >>>>>>>> DataFrame for things that may include non-numerical data and > require > >>>>>>>> > >>>>>>> more > >>>>> > >>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF > >>>>>>>> > >>>>>>> work > >>>>> > >>>>>> when it deals with non-matrix content. > >>>>>>>> > >>>>>>>> Sounds like a worthy effort to me. We'd be basically > implementing > >>>>>>> an > >>>>>>> > >>>>>> API > >>>>> > >>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures > >>>>>>> correct? > >>>>>>> > >>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com> > >>>>>>> > >>>>>> wrote: > >>>>> > >>>>>> Seems like seq2sparse would be really easy to replace since it takes > >>>>>>>> > >>>>>>> text > >>>>> > >>>>>> files to start with, then the whole pipeline could be kept in rdds. > >>>>>>>>> > >>>>>>>> The > >>>>> > >>>>>> dictionaries and counts could be either in-memory maps or rdds for > use > >>>>>>>>> with > >>>>>>>>> joins? This would get rid of sequence files completely from the > >>>>>>>>> pipeline. > >>>>>>>>> Item similarity uses in-memory maps but the plan is to make it > more > >>>>>>>>> scalable using joins as an alternative with the same API allowing > >>>>>>>>> the > >>>>>>>>> user > >>>>>>>>> to trade-off footprint for speed. > >>>>>>>>> > >>>>>>>>> I think you're right- should be relatively easy. I've been > >>>>>>>> looking at > >>>>>>>> > >>>>>>> porting seq2sparse to the DSL for bit now and the stopper at the > DSL > >>>>>>> > >>>>>> level > >>>>> > >>>>>> is that we don't have a distributed data structure for > strings..Seems > >>>>>>> > >>>>>> like > >>>>> > >>>>>> getting a DataFrame implemented as Dmitriy mentioned above would > take > >>>>>>> > >>>>>> care > >>>>> > >>>>>> of this problem. > >>>>>>> > >>>>>>> The other issue i'm a little fuzzy on is the distributed > collocation > >>>>>>> mapping- it's a part of the seq2sparse code that I've not spent > too > >>>>>>> > >>>>>> much > >>>>> > >>>>>> time in. > >>>>>>> > >>>>>>> I think that this would be very worthy effort as well- I believe > >>>>>>> seq2sparse is a particular strong mahout feature. > >>>>>>> > >>>>>>> I'll start another thread since we're now way off topic from the > >>>>>>> refactoring proposal. > >>>>>>> > >>>>>>> My use for TF-IDF is for row similarity and would take a DRM > >>>>>>> (actually > >>>>>>> IndexedDataset) and calculate row/doc similarities. It works now > but > >>>>>>> > >>>>>> only > >>>>> > >>>>>> using LLR. This is OK when thinking of the items as tags or metadata > >>>>>>> but > >>>>>>> for text tokens something like cosine may be better. > >>>>>>> > >>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using > LLR > >>>>>>> a > >>>>>>> > >>>>>> lot > >>>>> > >>>>>> like how CF preferences are downsampled. This would produce an > >>>>>>> > >>>>>> sparsified > >>>>> > >>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight > >>>>>>> the > >>>>>>> terms before row similarity uses cosine. This is not so good for > >>>>>>> search > >>>>>>> but > >>>>>>> should produce much better similarities than Solr’s “moreLikeThis” > >>>>>>> and > >>>>>>> does > >>>>>>> it for all pairs rather than one at a time. > >>>>>>> > >>>>>>> In any case it can be used to do a create a personalized > >>>>>>> content-based > >>>>>>> recommender or augment a CF recommender with one more indicator > type. > >>>>>>> > >>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote: > >>>>>>> > >>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote: > >>>>>>>> > >>>>>>>> Some issues WRT lower level Spark integration: > >>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I > >>>>>>>>> actually > >>>>>>>>> > >>>>>>>>> looked at. There may be other things we can pick up from their > >>>>>>>> > >>>>>>> committers > >>>>> > >>>>>> since they have an abundance. > >>>>>>> > >>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated > >>>>>>>> to > >>>>>>>> me when someone on the Spark list asked about matrix transpose and > >>>>>>>> an > >>>>>>>> > >>>>>>> MLlib > >>>>>>> committer’s answer was something like “why would you want to do > >>>>>>> that?”. > >>>>>>> Usually you don’t actually execute the transpose but they don’t > even > >>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At > >>>>>>> present > >>>>>>> > >>>>>> you > >>>>> > >>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix > >>>>>>> stuff. > >>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I > know). > >>>>>>> If > >>>>>>> the > >>>>>>> DSL could interchange datasets with MLlib, people would be pointed > to > >>>>>>> > >>>>>> the > >>>>> > >>>>>> DSL for all of a bunch of “why would you want to do that?” features. > >>>>>>> > >>>>>> MLlib > >>>>> > >>>>>> seems to be algorithms, not math. > >>>>>>> > >>>>>>> 3) integration of Streaming. DStreams support most of the RDD > >>>>>>>> interface. Doing a batch recalc on a moving time window would > nearly > >>>>>>>> > >>>>>>> fall > >>>>> > >>>>>> out of DStream backed DRMs. This isn’t the same as incremental > updates > >>>>>>> > >>>>>> on > >>>>> > >>>>>> streaming but it’s a start. > >>>>>>> > >>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink > >>>>>>>> faster compute engines. So we jumped. Now the need is for > streaming > >>>>>>>> and > >>>>>>>> > >>>>>>> especially incrementally updated streaming. Seems like we need to > >>>>>>> > >>>>>> address > >>>>> > >>>>>> this. > >>>>>>> > >>>>>>> Andrew, regardless of the above having TF-IDF would be super > >>>>>>>> helpful—row similarity for content/text would benefit greatly. > >>>>>>>> I will put a PR up soon. > >>>>>>>> > >>>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF > >>>>>>> > >>>>>> classes > >>>>> > >>>>>> and Weight interface over from mr-legacy to math-scala. They're > >>>>>>> > >>>>>> available > >>>>> > >>>>>> now in spark-shell but won't be after this refactoring. These still > >>>>>>> require dictionary and a frequency count maps to vectorize incoming > >>>>>>> > >>>>>> text- > >>>>> > >>>>>> so they're more for use with the old MR seq2sparse and I don't think > >>>>>>> > >>>>>> they > >>>>> > >>>>>> can be used with Spark's HashingTF and IDF. I'll put them up soon. > >>>>>>> Hopefully they'll be of some use. > >>>>>>> > >>>>>>> > >>>>>>> > >>>> > >>> > >> > > >