AP, How is ur impl different from Gokhan's? On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <[email protected]> wrote:
> BTW, i'm not sure o.a.m.nlp is the best package name for either, I was > using because o.a.m.vectorizer, which is probably a better name, had > conflicts in mrlegacy. > > > On 03/09/2015 09:29 PM, Andrew Palumbo wrote: > >> >> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's >> seq2sparse implementation to live. >> >> On 03/09/2015 09:07 PM, Pat Ferrel wrote: >> >>> Does o.a.m.nlp in the spark module seem like a good place for this to >>>> live? >>>> >>> I think you meant math-scala? >>> >>> Actually we should rename math to core >>> >>> >>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <[email protected]> wrote: >>> >>> Cool- This is great! I think this is really important to have in. >>> >>> +1 to a pull request for comments. >>> >>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has >>> very simple TF and TFIDF classes based on lucene's IDF calculation and >>> MLlib's I just got a bad flu and haven't had a chance to push it. It >>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i >>> can in case you want to use them. >>> >>> Does o.a.m.nlp in the spark module seem like a good place for this to >>> live? >>> >>> Those classes may be of use to you- they're very simple and are intended >>> for new document vectorization once the legacy deps are removed from the >>> spark module. They also might make interoperability with easier. >>> >>> One thought having not been able to look at this too closely yet. >>> >>> //do we need do calculate df-vector? >>>>> >>>> 1. We do need a document frequency map or vector to be able to >>> calculate the IDF terms when vectorizing a new document outside of the >>> original corpus. >>> >>> >>> >>> >>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: >>> >>>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing, >>>> nice. >>>> >>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <[email protected]> wrote: >>>> >>>> Ah I found the right button in Github no PR necessary. >>>> >>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <[email protected]> wrote: >>>> >>>> If you create a PR it’s easier to see what was changed. >>>> >>>> Wouldn’t it be better to read in files from a directory assigning >>>> doc-id = filename and term-ids = terms or are their still Hadoop pipeline >>>> tools that are needed to create the sequence files? This sort of mimics the >>>> way Spark reads SchemaRDDs from Json files. >>>> >>>> BTW this can also be done with a new reader trait on the >>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a >>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does >>>> the same for columns (text tokens). This would be a few lines of code since >>>> the string mapping and DRM creation is already written, The only thing to >>>> do would be map the doc/row ids to filenames. This allows you to take the >>>> non-int doc ids out of the DRM and replace them with a map. Not based on a >>>> Spark dataframe yet probably will be. >>>> >>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <[email protected]> wrote: >>>> >>>> So, here is a sketch of a Spark implementation of seq2sparse, returning >>>> a >>>> (matrix:DrmLike, dictionary:Map): >>>> >>>> https://github.com/gcapan/mahout/tree/seq2sparse >>>> >>>> Although it should be possible, I couldn't manage to make it process >>>> non-integer document ids. Any fix would be appreciated. There is a >>>> simple >>>> test attached, but I think there is more to do in terms of handling all >>>> parameters of the original seq2sparse implementation. >>>> >>>> I put it directly to the SparkEngine ---not that I think of this object >>>> is >>>> the most appropriate placeholder, it just seemed convenient to me. >>>> >>>> Best >>>> >>>> >>>> Gokhan >>>> >>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]> >>>> wrote: >>>> >>>> IndexedDataset might suffice until real DataFrames come along. >>>>> >>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]> >>>>> wrote: >>>>> >>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It >>>>> is a >>>>> byproduct of it IIRC. matrix definitely not a structure to hold those. >>>>> >>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]> >>>>> wrote: >>>>> >>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >>>>>> >>>>>> Andrew, not sure what you mean about storing strings. If you mean >>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc >>>>>>> column >>>>>>> >>>>>> = >>>>> >>>>>> token. A one row DataFrame is a slightly heavy weight >>>>>>> string/document. A >>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It >>>>>>> >>>>>> would >>>>> >>>>>> be a vector that maintains the tokens as ids for the counts, right? >>>>>>> >>>>>>> Yes- dataframes will be perfect for this. The problem that i was >>>>>> referring to was that we dont have a DSL Data Structure to to do the >>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For >>>>>> >>>>> this >>>>> >>>>>> I believe we would need something like a Distributed vector of Strings >>>>>> >>>>> that >>>>> >>>>>> could be broadcast to a mapBlock closure and then tokenized from >>>>>> there. >>>>>> Even there, MapBlock may not be perfect for this, but some of the new >>>>>> Distributed functions that Gockhan is working on may. >>>>>> >>>>>> I agree seq2sparse type input is a strong feature. Text files into an >>>>>>> all-documents DataFrame basically. Colocation? >>>>>>> >>>>>>> as far as collocations i believe that the n-gram are computed and >>>>>> counted >>>>>> in the CollocDriver [3] (i might be wrong her...its been a while >>>>>> since i >>>>>> looked at the code...) either way, I dont think I ever looked too >>>>>> closely >>>>>> and i was a bit fuzzy on this... >>>>>> >>>>>> These were just some thoughts that I had when briefly looking at >>>>>> porting >>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this >>>>>> algorithm but its a nice starting point. >>>>>> >>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >>>>>> >>>>>> .java >>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java >>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. >>>>>> >>>>>> java >>>>>> >>>>>> >>>>>> >>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Just copied over the relevant last few messages to keep the other >>>>>>> thread >>>>>>> on topic... >>>>>>> >>>>>>> >>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >>>>>>> >>>>>>> I'd suggest to consider this: remember all this talk about >>>>>>>> language-integrated spark ql being basically dataframe manipulation >>>>>>>> >>>>>>> DSL? >>>>> >>>>>> so now Spark devs are noticing this generality as well and are >>>>>>>> actually >>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream >>>>>>>> >>>>>>> data >>>>> >>>>>> structure. (my "told you so" moment of sorts >>>>>>>> >>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly >>>>>>>> renamed >>>>>>>> DataFrame our two major structures. In particular, standardize on >>>>>>>> using >>>>>>>> DataFrame for things that may include non-numerical data and require >>>>>>>> >>>>>>> more >>>>> >>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF >>>>>>>> >>>>>>> work >>>>> >>>>>> when it deals with non-matrix content. >>>>>>>> >>>>>>>> Sounds like a worthy effort to me. We'd be basically implementing >>>>>>> an >>>>>>> >>>>>> API >>>>> >>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures >>>>>>> correct? >>>>>>> >>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> >>>>>>> >>>>>> wrote: >>>>> >>>>>> Seems like seq2sparse would be really easy to replace since it takes >>>>>>>> >>>>>>> text >>>>> >>>>>> files to start with, then the whole pipeline could be kept in rdds. >>>>>>>>> >>>>>>>> The >>>>> >>>>>> dictionaries and counts could be either in-memory maps or rdds for use >>>>>>>>> with >>>>>>>>> joins? This would get rid of sequence files completely from the >>>>>>>>> pipeline. >>>>>>>>> Item similarity uses in-memory maps but the plan is to make it more >>>>>>>>> scalable using joins as an alternative with the same API allowing >>>>>>>>> the >>>>>>>>> user >>>>>>>>> to trade-off footprint for speed. >>>>>>>>> >>>>>>>>> I think you're right- should be relatively easy. I've been >>>>>>>> looking at >>>>>>>> >>>>>>> porting seq2sparse to the DSL for bit now and the stopper at the DSL >>>>>>> >>>>>> level >>>>> >>>>>> is that we don't have a distributed data structure for strings..Seems >>>>>>> >>>>>> like >>>>> >>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take >>>>>>> >>>>>> care >>>>> >>>>>> of this problem. >>>>>>> >>>>>>> The other issue i'm a little fuzzy on is the distributed collocation >>>>>>> mapping- it's a part of the seq2sparse code that I've not spent too >>>>>>> >>>>>> much >>>>> >>>>>> time in. >>>>>>> >>>>>>> I think that this would be very worthy effort as well- I believe >>>>>>> seq2sparse is a particular strong mahout feature. >>>>>>> >>>>>>> I'll start another thread since we're now way off topic from the >>>>>>> refactoring proposal. >>>>>>> >>>>>>> My use for TF-IDF is for row similarity and would take a DRM >>>>>>> (actually >>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but >>>>>>> >>>>>> only >>>>> >>>>>> using LLR. This is OK when thinking of the items as tags or metadata >>>>>>> but >>>>>>> for text tokens something like cosine may be better. >>>>>>> >>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR >>>>>>> a >>>>>>> >>>>>> lot >>>>> >>>>>> like how CF preferences are downsampled. This would produce an >>>>>>> >>>>>> sparsified >>>>> >>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight >>>>>>> the >>>>>>> terms before row similarity uses cosine. This is not so good for >>>>>>> search >>>>>>> but >>>>>>> should produce much better similarities than Solr’s “moreLikeThis” >>>>>>> and >>>>>>> does >>>>>>> it for all pairs rather than one at a time. >>>>>>> >>>>>>> In any case it can be used to do a create a personalized >>>>>>> content-based >>>>>>> recommender or augment a CF recommender with one more indicator type. >>>>>>> >>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote: >>>>>>> >>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote: >>>>>>>> >>>>>>>> Some issues WRT lower level Spark integration: >>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I >>>>>>>>> actually >>>>>>>>> >>>>>>>>> looked at. There may be other things we can pick up from their >>>>>>>> >>>>>>> committers >>>>> >>>>>> since they have an abundance. >>>>>>> >>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated >>>>>>>> to >>>>>>>> me when someone on the Spark list asked about matrix transpose and >>>>>>>> an >>>>>>>> >>>>>>> MLlib >>>>>>> committer’s answer was something like “why would you want to do >>>>>>> that?”. >>>>>>> Usually you don’t actually execute the transpose but they don’t even >>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At >>>>>>> present >>>>>>> >>>>>> you >>>>> >>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix >>>>>>> stuff. >>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know). >>>>>>> If >>>>>>> the >>>>>>> DSL could interchange datasets with MLlib, people would be pointed to >>>>>>> >>>>>> the >>>>> >>>>>> DSL for all of a bunch of “why would you want to do that?” features. >>>>>>> >>>>>> MLlib >>>>> >>>>>> seems to be algorithms, not math. >>>>>>> >>>>>>> 3) integration of Streaming. DStreams support most of the RDD >>>>>>>> interface. Doing a batch recalc on a moving time window would nearly >>>>>>>> >>>>>>> fall >>>>> >>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates >>>>>>> >>>>>> on >>>>> >>>>>> streaming but it’s a start. >>>>>>> >>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink >>>>>>>> faster compute engines. So we jumped. Now the need is for streaming >>>>>>>> and >>>>>>>> >>>>>>> especially incrementally updated streaming. Seems like we need to >>>>>>> >>>>>> address >>>>> >>>>>> this. >>>>>>> >>>>>>> Andrew, regardless of the above having TF-IDF would be super >>>>>>>> helpful—row similarity for content/text would benefit greatly. >>>>>>>> I will put a PR up soon. >>>>>>>> >>>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF >>>>>>> >>>>>> classes >>>>> >>>>>> and Weight interface over from mr-legacy to math-scala. They're >>>>>>> >>>>>> available >>>>> >>>>>> now in spark-shell but won't be after this refactoring. These still >>>>>>> require dictionary and a frequency count maps to vectorize incoming >>>>>>> >>>>>> text- >>>>> >>>>>> so they're more for use with the old MR seq2sparse and I don't think >>>>>>> >>>>>> they >>>>> >>>>>> can be used with Spark's HashingTF and IDF. I'll put them up soon. >>>>>>> Hopefully they'll be of some use. >>>>>>> >>>>>>> >>>>>>> >>>> >>> >> >
