I think everyone agrees that getting this into a PR would be great. We need a modernized text pipeline and this is an excellent starting point. We can discuss there.
On Mar 10, 2015, at 3:53 AM, Gokhan Capan <[email protected]> wrote: Some answers: - Non-integer document ids: The implementation does not use operations defined for DrmLike[Int]-only, so the row keys do not have to be Int's. I just couldn't manage to create the returning DrmLike with the correct key type. Although while wrapping into a DrmLike, I tried to pass the key-class using HDFS utils like they are being used in drmDfsRead, but I somehow wasn't successful. So non-int document ids is not an actual issue here. - Breaking the implementation out to smaller pieces: Let's just collect the requirements and adjust the implementation accordingly. I honestly didn't think very much about where the implementation fits in, architecturally, and what pieces are of public interest. Best Gokhan On Tue, Mar 10, 2015 at 3:56 AM, Suneel Marthi <[email protected]> wrote: > AP, How is ur impl different from Gokhan's? > > On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <[email protected]> wrote: > >> BTW, i'm not sure o.a.m.nlp is the best package name for either, I was >> using because o.a.m.vectorizer, which is probably a better name, had >> conflicts in mrlegacy. >> >> >> On 03/09/2015 09:29 PM, Andrew Palumbo wrote: >> >>> >>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's >>> seq2sparse implementation to live. >>> >>> On 03/09/2015 09:07 PM, Pat Ferrel wrote: >>> >>>> Does o.a.m.nlp in the spark module seem like a good place for this to >>>>> live? >>>>> >>>> I think you meant math-scala? >>>> >>>> Actually we should rename math to core >>>> >>>> >>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <[email protected]> wrote: >>>> >>>> Cool- This is great! I think this is really important to have in. >>>> >>>> +1 to a pull request for comments. >>>> >>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has >>>> very simple TF and TFIDF classes based on lucene's IDF calculation and >>>> MLlib's I just got a bad flu and haven't had a chance to push it. It >>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon > as i >>>> can in case you want to use them. >>>> >>>> Does o.a.m.nlp in the spark module seem like a good place for this to >>>> live? >>>> >>>> Those classes may be of use to you- they're very simple and are > intended >>>> for new document vectorization once the legacy deps are removed from > the >>>> spark module. They also might make interoperability with easier. >>>> >>>> One thought having not been able to look at this too closely yet. >>>> >>>> //do we need do calculate df-vector? >>>>>> >>>>> 1. We do need a document frequency map or vector to be able to >>>> calculate the IDF terms when vectorizing a new document outside of the >>>> original corpus. > >>> >>>> >>>> >>>> >>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote: >>>> >>>>> Ah, you are doing all the lucene analyzer, ngrams and other > tokenizing, >>>>> nice. >>>>> >>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <[email protected]> wrote: >>>>> >>>>> Ah I found the right button in Github no PR necessary. >>>>> >>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <[email protected]> wrote: >>>>> >>>>> If you create a PR it’s easier to see what was changed. >>>>> >>>>> Wouldn’t it be better to read in files from a directory assigning >>>>> doc-id = filename and term-ids = terms or are their still Hadoop > pipeline >>>>> tools that are needed to create the sequence files? This sort of > mimics the >>>>> way Spark reads SchemaRDDs from Json files. >>>>> >>>>> BTW this can also be done with a new reader trait on the >>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a >>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other > does >>>>> the same for columns (text tokens). This would be a few lines of code > since >>>>> the string mapping and DRM creation is already written, The only > thing to >>>>> do would be map the doc/row ids to filenames. This allows you to take > the >>>>> non-int doc ids out of the DRM and replace them with a map. Not based > on a >>>>> Spark dataframe yet probably will be. >>>>> >>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <[email protected]> wrote: >>>>> >>>>> So, here is a sketch of a Spark implementation of seq2sparse, > returning >>>>> a >>>>> (matrix:DrmLike, dictionary:Map): >>>>> >>>>> https://github.com/gcapan/mahout/tree/seq2sparse >>>>> >>>>> Although it should be possible, I couldn't manage to make it process >>>>> non-integer document ids. Any fix would be appreciated. There is a >>>>> simple >>>>> test attached, but I think there is more to do in terms of handling > all >>>>> parameters of the original seq2sparse implementation. >>>>> >>>>> I put it directly to the SparkEngine ---not that I think of this > object >>>>> is >>>>> the most appropriate placeholder, it just seemed convenient to me. >>>>> >>>>> Best >>>>> >>>>> >>>>> Gokhan >>>>> >>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]> >>>>> wrote: >>>>> >>>>> IndexedDataset might suffice until real DataFrames come along. >>>>>> >>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It >>>>>> is a >>>>>> byproduct of it IIRC. matrix definitely not a structure to hold > those. >>>>>> >>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]> >>>>>> wrote: >>>>>> >>>>>> On 02/04/2015 11:13 AM, Pat Ferrel wrote: >>>>>>> >>>>>>> Andrew, not sure what you mean about storing strings. If you mean >>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc >>>>>>>> column >>>>>>>> >>>>>>> = >>>>>> >>>>>>> token. A one row DataFrame is a slightly heavy weight >>>>>>>> string/document. A >>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? > It >>>>>>>> >>>>>>> would >>>>>> >>>>>>> be a vector that maintains the tokens as ids for the counts, right? >>>>>>>> >>>>>>>> Yes- dataframes will be perfect for this. The problem that i was >>>>>>> referring to was that we dont have a DSL Data Structure to to do the >>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . > For >>>>>>> >>>>>> this >>>>>> >>>>>>> I believe we would need something like a Distributed vector of > Strings >>>>>>> >>>>>> that >>>>>> >>>>>>> could be broadcast to a mapBlock closure and then tokenized from >>>>>>> there. >>>>>>> Even there, MapBlock may not be perfect for this, but some of the > new >>>>>>> Distributed functions that Gockhan is working on may. >>>>>>> >>>>>>> I agree seq2sparse type input is a strong feature. Text files into > an >>>>>>>> all-documents DataFrame basically. Colocation? >>>>>>>> >>>>>>>> as far as collocations i believe that the n-gram are computed and >>>>>>> counted >>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while >>>>>>> since i >>>>>>> looked at the code...) either way, I dont think I ever looked too >>>>>>> closely >>>>>>> and i was a bit fuzzy on this... >>>>>>> >>>>>>> These were just some thoughts that I had when briefly looking at >>>>>>> porting >>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow > this >>>>>>> algorithm but its a nice starting point. >>>>>>> >>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>>> > src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles >>>>>>> >>>>>>> .java >>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java >>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/ >>>>>>> > src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver. >>>>>>> >>>>>>> java >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Just copied over the relevant last few messages to keep the other >>>>>>>> thread >>>>>>>> on topic... >>>>>>>> >>>>>>>> >>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote: >>>>>>>> >>>>>>>> I'd suggest to consider this: remember all this talk about >>>>>>>>> language-integrated spark ql being basically dataframe > manipulation >>>>>>>>> >>>>>>>> DSL? >>>>>> >>>>>>> so now Spark devs are noticing this generality as well and are >>>>>>>>> actually >>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it > mainstream >>>>>>>>> >>>>>>>> data >>>>>> >>>>>>> structure. (my "told you so" moment of sorts >>>>>>>>> >>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly >>>>>>>>> renamed >>>>>>>>> DataFrame our two major structures. In particular, standardize on >>>>>>>>> using >>>>>>>>> DataFrame for things that may include non-numerical data and > require >>>>>>>>> >>>>>>>> more >>>>>> >>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF >>>>>>>>> >>>>>>>> work >>>>>> >>>>>>> when it deals with non-matrix content. >>>>>>>>> >>>>>>>>> Sounds like a worthy effort to me. We'd be basically > implementing >>>>>>>> an >>>>>>>> >>>>>>> API >>>>>> >>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures >>>>>>>> correct? >>>>>>>> >>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> >>>>>>>> >>>>>>> wrote: >>>>>> >>>>>>> Seems like seq2sparse would be really easy to replace since it takes >>>>>>>>> >>>>>>>> text >>>>>> >>>>>>> files to start with, then the whole pipeline could be kept in rdds. >>>>>>>>>> >>>>>>>>> The >>>>>> >>>>>>> dictionaries and counts could be either in-memory maps or rdds for > use >>>>>>>>>> with >>>>>>>>>> joins? This would get rid of sequence files completely from the >>>>>>>>>> pipeline. >>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it > more >>>>>>>>>> scalable using joins as an alternative with the same API allowing >>>>>>>>>> the >>>>>>>>>> user >>>>>>>>>> to trade-off footprint for speed. >>>>>>>>>> >>>>>>>>>> I think you're right- should be relatively easy. I've been >>>>>>>>> looking at >>>>>>>>> >>>>>>>> porting seq2sparse to the DSL for bit now and the stopper at the > DSL >>>>>>>> >>>>>>> level >>>>>> >>>>>>> is that we don't have a distributed data structure for > strings..Seems >>>>>>>> >>>>>>> like >>>>>> >>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would > take >>>>>>>> >>>>>>> care >>>>>> >>>>>>> of this problem. >>>>>>>> >>>>>>>> The other issue i'm a little fuzzy on is the distributed > collocation >>>>>>>> mapping- it's a part of the seq2sparse code that I've not spent > too >>>>>>>> >>>>>>> much >>>>>> >>>>>>> time in. >>>>>>>> >>>>>>>> I think that this would be very worthy effort as well- I believe >>>>>>>> seq2sparse is a particular strong mahout feature. >>>>>>>> >>>>>>>> I'll start another thread since we're now way off topic from the >>>>>>>> refactoring proposal. >>>>>>>> >>>>>>>> My use for TF-IDF is for row similarity and would take a DRM >>>>>>>> (actually >>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now > but >>>>>>>> >>>>>>> only >>>>>> >>>>>>> using LLR. This is OK when thinking of the items as tags or metadata >>>>>>>> but >>>>>>>> for text tokens something like cosine may be better. >>>>>>>> >>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using > LLR >>>>>>>> a >>>>>>>> >>>>>>> lot >>>>>> >>>>>>> like how CF preferences are downsampled. This would produce an >>>>>>>> >>>>>>> sparsified >>>>>> >>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight >>>>>>>> the >>>>>>>> terms before row similarity uses cosine. This is not so good for >>>>>>>> search >>>>>>>> but >>>>>>>> should produce much better similarities than Solr’s “moreLikeThis” >>>>>>>> and >>>>>>>> does >>>>>>>> it for all pairs rather than one at a time. >>>>>>>> >>>>>>>> In any case it can be used to do a create a personalized >>>>>>>> content-based >>>>>>>> recommender or augment a CF recommender with one more indicator > type. >>>>>>>> >>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote: >>>>>>>> >>>>>>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote: >>>>>>>>> >>>>>>>>> Some issues WRT lower level Spark integration: >>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I >>>>>>>>>> actually >>>>>>>>>> >>>>>>>>>> looked at. There may be other things we can pick up from their >>>>>>>>> >>>>>>>> committers >>>>>> >>>>>>> since they have an abundance. >>>>>>>> >>>>>>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated >>>>>>>>> to >>>>>>>>> me when someone on the Spark list asked about matrix transpose and >>>>>>>>> an >>>>>>>>> >>>>>>>> MLlib >>>>>>>> committer’s answer was something like “why would you want to do >>>>>>>> that?”. >>>>>>>> Usually you don’t actually execute the transpose but they don’t > even >>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At >>>>>>>> present >>>>>>>> >>>>>>> you >>>>>> >>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix >>>>>>>> stuff. >>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I > know). >>>>>>>> If >>>>>>>> the >>>>>>>> DSL could interchange datasets with MLlib, people would be pointed > to >>>>>>>> >>>>>>> the >>>>>> >>>>>>> DSL for all of a bunch of “why would you want to do that?” features. >>>>>>>> >>>>>>> MLlib >>>>>> >>>>>>> seems to be algorithms, not math. >>>>>>>> >>>>>>>> 3) integration of Streaming. DStreams support most of the RDD >>>>>>>>> interface. Doing a batch recalc on a moving time window would > nearly >>>>>>>>> >>>>>>>> fall >>>>>> >>>>>>> out of DStream backed DRMs. This isn’t the same as incremental > updates >>>>>>>> >>>>>>> on >>>>>> >>>>>>> streaming but it’s a start. >>>>>>>> >>>>>>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink >>>>>>>>> faster compute engines. So we jumped. Now the need is for > streaming >>>>>>>>> and >>>>>>>>> >>>>>>>> especially incrementally updated streaming. Seems like we need to >>>>>>>> >>>>>>> address >>>>>> >>>>>>> this. >>>>>>>> >>>>>>>> Andrew, regardless of the above having TF-IDF would be super >>>>>>>>> helpful—row similarity for content/text would benefit greatly. >>>>>>>>> I will put a PR up soon. >>>>>>>>> >>>>>>>>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF >>>>>>>> >>>>>>> classes >>>>>> >>>>>>> and Weight interface over from mr-legacy to math-scala. They're >>>>>>>> >>>>>>> available >>>>>> >>>>>>> now in spark-shell but won't be after this refactoring. These still >>>>>>>> require dictionary and a frequency count maps to vectorize incoming >>>>>>>> >>>>>>> text- >>>>>> >>>>>>> so they're more for use with the old MR seq2sparse and I don't think >>>>>>>> >>>>>>> they >>>>>> >>>>>>> can be used with Spark's HashingTF and IDF. I'll put them up soon. >>>>>>>> Hopefully they'll be of some use. >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>> >>> >> >
