So, here is a sketch of a Spark implementation of seq2sparse, returning a (matrix:DrmLike, dictionary:Map):
https://github.com/gcapan/mahout/tree/seq2sparse Although it should be possible, I couldn't manage to make it process non-integer document ids. Any fix would be appreciated. There is a simple test attached, but I think there is more to do in terms of handling all parameters of the original seq2sparse implementation. I put it directly to the SparkEngine ---not that I think of this object is the most appropriate placeholder, it just seemed convenient to me. Best Gokhan On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]> wrote: > IndexedDataset might suffice until real DataFrames come along. > > On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]> wrote: > > Dealing with dictionaries is inevitably DataFrame for seq2sparse. It is a > byproduct of it IIRC. matrix definitely not a structure to hold those. > > On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]> wrote: > > > > > On 02/04/2015 11:13 AM, Pat Ferrel wrote: > > > >> Andrew, not sure what you mean about storing strings. If you mean > >> something like a DRM of tokens, that is a DataFrame with row=doc column > = > >> token. A one row DataFrame is a slightly heavy weight string/document. A > >> DataFrame with token counts would be perfect for input TF-IDF, no? It > would > >> be a vector that maintains the tokens as ids for the counts, right? > >> > > > > Yes- dataframes will be perfect for this. The problem that i was > > referring to was that we dont have a DSL Data Structure to to do the > > initial distributed tokenizing of the documents[1] line:257, [2] . For > this > > I believe we would need something like a Distributed vector of Strings > that > > could be broadcast to a mapBlock closure and then tokenized from there. > > Even there, MapBlock may not be perfect for this, but some of the new > > Distributed functions that Gockhan is working on may. > > > >> > >> I agree seq2sparse type input is a strong feature. Text files into an > >> all-documents DataFrame basically. Colocation? > >> > > as far as collocations i believe that the n-gram are computed and counted > > in the CollocDriver [3] (i might be wrong her...its been a while since i > > looked at the code...) either way, I dont think I ever looked too closely > > and i was a bit fuzzy on this... > > > > These were just some thoughts that I had when briefly looking at porting > > seq2sparse to the DSL before.. Obviously we don't have to follow this > > algorithm but its a nice starting point. > > > > [1] https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles > > .java > > [2] https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java > > [3]https://github.com/apache/mahout/blob/master/mrlegacy/ > > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. > > java > > > > > > > >> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> wrote: > >> > >> Just copied over the relevant last few messages to keep the other thread > >> on topic... > >> > >> > >> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: > >> > >>> I'd suggest to consider this: remember all this talk about > >>> language-integrated spark ql being basically dataframe manipulation > DSL? > >>> > >>> so now Spark devs are noticing this generality as well and are actually > >>> proposing to rename SchemaRDD into DataFrame and make it mainstream > data > >>> structure. (my "told you so" moment of sorts > >>> > >>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed > >>> DataFrame our two major structures. In particular, standardize on using > >>> DataFrame for things that may include non-numerical data and require > more > >>> grace about column naming and manipulation. Maybe relevant to TF-IDF > work > >>> when it deals with non-matrix content. > >>> > >> Sounds like a worthy effort to me. We'd be basically implementing an > API > >> at the math-scala level for SchemaRDD/Dataframe datastructures correct? > >> > >> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> > wrote: > >> > >>> Seems like seq2sparse would be really easy to replace since it takes > text > >>>> files to start with, then the whole pipeline could be kept in rdds. > The > >>>> dictionaries and counts could be either in-memory maps or rdds for use > >>>> with > >>>> joins? This would get rid of sequence files completely from the > >>>> pipeline. > >>>> Item similarity uses in-memory maps but the plan is to make it more > >>>> scalable using joins as an alternative with the same API allowing the > >>>> user > >>>> to trade-off footprint for speed. > >>>> > >>> I think you're right- should be relatively easy. I've been looking at > >> porting seq2sparse to the DSL for bit now and the stopper at the DSL > level > >> is that we don't have a distributed data structure for strings..Seems > like > >> getting a DataFrame implemented as Dmitriy mentioned above would take > care > >> of this problem. > >> > >> The other issue i'm a little fuzzy on is the distributed collocation > >> mapping- it's a part of the seq2sparse code that I've not spent too > much > >> time in. > >> > >> I think that this would be very worthy effort as well- I believe > >> seq2sparse is a particular strong mahout feature. > >> > >> I'll start another thread since we're now way off topic from the > >> refactoring proposal. > >> > >> My use for TF-IDF is for row similarity and would take a DRM (actually > >> IndexedDataset) and calculate row/doc similarities. It works now but > only > >> using LLR. This is OK when thinking of the items as tags or metadata but > >> for text tokens something like cosine may be better. > >> > >> I’d imagine a downsampling phase that would precede TF-IDF using LLR a > lot > >> like how CF preferences are downsampled. This would produce an > sparsified > >> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the > >> terms before row similarity uses cosine. This is not so good for search > >> but > >> should produce much better similarities than Solr’s “moreLikeThis” and > >> does > >> it for all pairs rather than one at a time. > >> > >> In any case it can be used to do a create a personalized content-based > >> recommender or augment a CF recommender with one more indicator type. > >> > >> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote: > >> > >> > >> On 02/03/2015 12:44 PM, Andrew Palumbo wrote: > >> > >>> On 02/03/2015 12:22 PM, Pat Ferrel wrote: > >>> > >>>> Some issues WRT lower level Spark integration: > >>>> 1) interoperability with Spark data. TF-IDF is one example I actually > >>>> > >>> looked at. There may be other things we can pick up from their > committers > >> since they have an abundance. > >> > >>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to > >>>> > >>> me when someone on the Spark list asked about matrix transpose and an > >> MLlib > >> committer’s answer was something like “why would you want to do that?”. > >> Usually you don’t actually execute the transpose but they don’t even > >> support A’A, AA’, or A’B, which are core to what I work on. At present > you > >> pretty much have to choose between MLlib or Mahout for sparse matrix > >> stuff. > >> Maybe a half-way measure is some implicit conversions (ugh, I know). If > >> the > >> DSL could interchange datasets with MLlib, people would be pointed to > the > >> DSL for all of a bunch of “why would you want to do that?” features. > MLlib > >> seems to be algorithms, not math. > >> > >>> 3) integration of Streaming. DStreams support most of the RDD > >>>> > >>> interface. Doing a batch recalc on a moving time window would nearly > fall > >> out of DStream backed DRMs. This isn’t the same as incremental updates > on > >> streaming but it’s a start. > >> > >>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink > >>>> > >>> faster compute engines. So we jumped. Now the need is for streaming and > >> especially incrementally updated streaming. Seems like we need to > address > >> this. > >> > >>> Andrew, regardless of the above having TF-IDF would be super > >>>> > >>> helpful—row similarity for content/text would benefit greatly. > >> > >>> I will put a PR up soon. > >>> > >> Just to clarify, I'll be porting over the (very simple) TF, TFIDF > classes > >> and Weight interface over from mr-legacy to math-scala. They're > available > >> now in spark-shell but won't be after this refactoring. These still > >> require dictionary and a frequency count maps to vectorize incoming > text- > >> so they're more for use with the old MR seq2sparse and I don't think > they > >> can be used with Spark's HashingTF and IDF. I'll put them up soon. > >> Hopefully they'll be of some use. > >> > >> > > > >
