Andrew, not sure what you mean about storing strings. If you mean something like a DRM of tokens, that is a DataFrame with row=doc column = token. A one row DataFrame is a slightly heavy weight string/document. A DataFrame with token counts would be perfect for input TF-IDF, no? It would be a vector that maintains the tokens as ids for the counts, right?
I agree seq2sparse type input is a strong feature. Text files into an all-documents DataFrame basically. Colocation? On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> wrote: Just copied over the relevant last few messages to keep the other thread on topic... On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: > I'd suggest to consider this: remember all this talk about > language-integrated spark ql being basically dataframe manipulation DSL? > > so now Spark devs are noticing this generality as well and are actually > proposing to rename SchemaRDD into DataFrame and make it mainstream data > structure. (my "told you so" moment of sorts > > What i am getting at, i'd suggest to make DRM and Spark's newly renamed > DataFrame our two major structures. In particular, standardize on using > DataFrame for things that may include non-numerical data and require more > grace about column naming and manipulation. Maybe relevant to TF-IDF work > when it deals with non-matrix content. Sounds like a worthy effort to me. We'd be basically implementing an API at the math-scala level for SchemaRDD/Dataframe datastructures correct? On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> wrote: >> Seems like seq2sparse would be really easy to replace since it takes text >> files to start with, then the whole pipeline could be kept in rdds. The >> dictionaries and counts could be either in-memory maps or rdds for use with >> joins? This would get rid of sequence files completely from the pipeline. >> Item similarity uses in-memory maps but the plan is to make it more >> scalable using joins as an alternative with the same API allowing the user >> to trade-off footprint for speed. I think you're right- should be relatively easy. I've been looking at porting seq2sparse to the DSL for bit now and the stopper at the DSL level is that we don't have a distributed data structure for strings..Seems like getting a DataFrame implemented as Dmitriy mentioned above would take care of this problem. The other issue i'm a little fuzzy on is the distributed collocation mapping- it's a part of the seq2sparse code that I've not spent too much time in. I think that this would be very worthy effort as well- I believe seq2sparse is a particular strong mahout feature. I'll start another thread since we're now way off topic from the refactoring proposal. My use for TF-IDF is for row similarity and would take a DRM (actually IndexedDataset) and calculate row/doc similarities. It works now but only using LLR. This is OK when thinking of the items as tags or metadata but for text tokens something like cosine may be better. I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot like how CF preferences are downsampled. This would produce an sparsified all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the terms before row similarity uses cosine. This is not so good for search but should produce much better similarities than Solr’s “moreLikeThis” and does it for all pairs rather than one at a time. In any case it can be used to do a create a personalized content-based recommender or augment a CF recommender with one more indicator type. On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote: On 02/03/2015 12:44 PM, Andrew Palumbo wrote: > On 02/03/2015 12:22 PM, Pat Ferrel wrote: >> Some issues WRT lower level Spark integration: >> 1) interoperability with Spark data. TF-IDF is one example I actually looked at. There may be other things we can pick up from their committers since they have an abundance. >> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to me when someone on the Spark list asked about matrix transpose and an MLlib committer’s answer was something like “why would you want to do that?”. Usually you don’t actually execute the transpose but they don’t even support A’A, AA’, or A’B, which are core to what I work on. At present you pretty much have to choose between MLlib or Mahout for sparse matrix stuff. Maybe a half-way measure is some implicit conversions (ugh, I know). If the DSL could interchange datasets with MLlib, people would be pointed to the DSL for all of a bunch of “why would you want to do that?” features. MLlib seems to be algorithms, not math. >> 3) integration of Streaming. DStreams support most of the RDD interface. Doing a batch recalc on a moving time window would nearly fall out of DStream backed DRMs. This isn’t the same as incremental updates on streaming but it’s a start. >> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink faster compute engines. So we jumped. Now the need is for streaming and especially incrementally updated streaming. Seems like we need to address this. >> Andrew, regardless of the above having TF-IDF would be super helpful—row similarity for content/text would benefit greatly. > I will put a PR up soon. Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes and Weight interface over from mr-legacy to math-scala. They're available now in spark-shell but won't be after this refactoring. These still require dictionary and a frequency count maps to vectorize incoming text- so they're more for use with the old MR seq2sparse and I don't think they can be used with Spark's HashingTF and IDF. I'll put them up soon. Hopefully they'll be of some use.
