Re: TF-IDF, seq2sparse and DataFrame support

Suneel Marthi Mon, 09 Mar 2015 18:58:45 -0700

AP, How is ur impl different from Gokhan's?

On Mon, Mar 9, 2015 at 9:54 PM, Andrew Palumbo <[email protected]> wrote:


> BTW, i'm not sure o.a.m.nlp is the best package name for either,  I was
> using because o.a.m.vectorizer, which is probably a better name, had
> conflicts in mrlegacy.
>
>
> On 03/09/2015 09:29 PM, Andrew Palumbo wrote:
>
>>
>> I meant would o.a.m.nlp in the spark module be a good place for Gokhan's
>> seq2sparse implementation to live.
>>
>> On 03/09/2015 09:07 PM, Pat Ferrel wrote:
>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>>> live?
>>>>
>>> I think you meant math-scala?
>>>
>>> Actually we should rename math to core
>>>
>>>
>>> On Mar 9, 2015, at 3:15 PM, Andrew Palumbo <[email protected]> wrote:
>>>
>>> Cool- This is great! I think this is really important to have in.
>>>
>>> +1 to a pull request for comments.
>>>
>>> I have pr#75(https://github.com/apache/mahout/pull/75) open - It has
>>> very simple TF and TFIDF classes based on lucene's IDF calculation and
>>> MLlib's  I just got a bad flu and haven't had a chance to push it.  It
>>> creates an o.a.m.nlp package in mahout-math. I will push that as soon as i
>>> can in case you want to use them.
>>>
>>> Does o.a.m.nlp  in the spark module seem like a good place for this to
>>> live?
>>>
>>> Those classes may be of use to you- they're very simple and are intended
>>> for new document vectorization once the legacy deps are removed from the
>>> spark module.  They also might make interoperability with easier.
>>>
>>> One thought having not been able to look at this too closely yet.
>>>
>>>  //do we need do calculate df-vector?
>>>>>
>>>> 1.  We do need a document frequency map or vector to be able to
>>> calculate the IDF terms when vectorizing a new document outside of the
>>> original corpus.
>>>
>>>
>>>
>>>
>>> On 03/09/2015 05:10 PM, Pat Ferrel wrote:
>>>
>>>> Ah, you are doing all the lucene analyzer, ngrams and other tokenizing,
>>>> nice.
>>>>
>>>> On Mar 9, 2015, at 2:07 PM, Pat Ferrel <[email protected]> wrote:
>>>>
>>>> Ah I found the right button in Github no PR necessary.
>>>>
>>>> On Mar 9, 2015, at 1:55 PM, Pat Ferrel <[email protected]> wrote:
>>>>
>>>> If you create a PR it’s easier to see what was changed.
>>>>
>>>> Wouldn’t it be better to read in files from a directory assigning
>>>> doc-id = filename and term-ids = terms or are their still Hadoop pipeline
>>>> tools that are needed to create the sequence files? This sort of mimics the
>>>> way Spark reads SchemaRDDs from Json files.
>>>>
>>>> BTW this can also be done with a new reader trait on the
>>>> IndexedDataset. It will give you two bidirectional maps (BiMap) and a
>>>> DrmLike[Int]. One BiMap gives any String <-> Int for rows, the other does
>>>> the same for columns (text tokens). This would be a few lines of code since
>>>> the string mapping and DRM creation is already written, The only thing to
>>>> do would be map the doc/row ids to filenames. This allows you to take the
>>>> non-int doc ids out of the DRM and replace them with a map. Not based on a
>>>> Spark dataframe yet probably will be.
>>>>
>>>> On Mar 9, 2015, at 11:12 AM, Gokhan Capan <[email protected]> wrote:
>>>>
>>>> So, here is a sketch of a Spark implementation of seq2sparse, returning
>>>> a
>>>> (matrix:DrmLike, dictionary:Map):
>>>>
>>>> https://github.com/gcapan/mahout/tree/seq2sparse
>>>>
>>>> Although it should be possible, I couldn't manage to make it process
>>>> non-integer document ids. Any fix would be appreciated. There is a
>>>> simple
>>>> test attached, but I think there is more to do in terms of handling all
>>>> parameters of the original seq2sparse implementation.
>>>>
>>>> I put it directly to the SparkEngine ---not that I think of this object
>>>> is
>>>> the most appropriate placeholder, it just seemed convenient to me.
>>>>
>>>> Best
>>>>
>>>>
>>>> Gokhan
>>>>
>>>> On Thu, Feb 5, 2015 at 3:48 AM, Pat Ferrel <[email protected]>
>>>> wrote:
>>>>
>>>>  IndexedDataset might suffice until real DataFrames come along.
>>>>>
>>>>> On Feb 4, 2015, at 3:42 PM, Dmitriy Lyubimov <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Dealing with dictionaries is inevitably DataFrame for seq2sparse. It
>>>>> is a
>>>>> byproduct of it IIRC. matrix definitely not a structure to hold those.
>>>>>
>>>>> On Wed, Feb 4, 2015 at 9:16 AM, Andrew Palumbo <[email protected]>
>>>>> wrote:
>>>>>
>>>>>  On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>>>>>>
>>>>>>  Andrew, not sure what you mean about storing strings. If you mean
>>>>>>> something like a DRM of tokens, that is a DataFrame with row=doc
>>>>>>> column
>>>>>>>
>>>>>> =
>>>>>
>>>>>> token. A one row DataFrame is a slightly heavy weight
>>>>>>> string/document. A
>>>>>>> DataFrame with token counts would be perfect for input TF-IDF, no? It
>>>>>>>
>>>>>> would
>>>>>
>>>>>> be a vector that maintains the tokens as ids for the counts, right?
>>>>>>>
>>>>>>>  Yes- dataframes will be perfect for this.  The problem that i was
>>>>>> referring to was that we dont have a DSL Data Structure to to do the
>>>>>> initial distributed tokenizing of the documents[1] line:257, [2] . For
>>>>>>
>>>>> this
>>>>>
>>>>>> I believe we would need something like a Distributed vector of Strings
>>>>>>
>>>>> that
>>>>>
>>>>>> could be broadcast to a mapBlock closure and then tokenized from
>>>>>> there.
>>>>>> Even there, MapBlock may not be perfect for this, but some of the new
>>>>>> Distributed functions that Gockhan is working on may.
>>>>>>
>>>>>>  I agree seq2sparse type input is a strong feature. Text files into an
>>>>>>> all-documents DataFrame basically. Colocation?
>>>>>>>
>>>>>>>  as far as collocations i believe that the n-gram are computed and
>>>>>> counted
>>>>>> in the CollocDriver [3] (i might be wrong her...its been a while
>>>>>> since i
>>>>>> looked at the code...) either way, I dont think I ever looked too
>>>>>> closely
>>>>>> and i was a bit fuzzy on this...
>>>>>>
>>>>>> These were just some thoughts that I had when briefly looking at
>>>>>> porting
>>>>>> seq2sparse to the DSL before.. Obviously we don't have to follow this
>>>>>> algorithm but its a nice starting point.
>>>>>>
>>>>>> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
>>>>>>
>>>>>> .java
>>>>>> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
>>>>>> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
>>>>>> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
>>>>>>
>>>>>> java
>>>>>>
>>>>>>
>>>>>>
>>>>>>  On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Just copied over the relevant last few messages to keep the other
>>>>>>> thread
>>>>>>> on topic...
>>>>>>>
>>>>>>>
>>>>>>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>>>>>>
>>>>>>>  I'd suggest to consider this: remember all this talk about
>>>>>>>> language-integrated spark ql being basically dataframe manipulation
>>>>>>>>
>>>>>>> DSL?
>>>>>
>>>>>> so now Spark devs are noticing this generality as well and are
>>>>>>>> actually
>>>>>>>> proposing to rename SchemaRDD into DataFrame and make it mainstream
>>>>>>>>
>>>>>>> data
>>>>>
>>>>>> structure. (my "told you so" moment of sorts
>>>>>>>>
>>>>>>>> What i am getting at, i'd suggest to make DRM and Spark's newly
>>>>>>>> renamed
>>>>>>>> DataFrame our two major structures. In particular, standardize on
>>>>>>>> using
>>>>>>>> DataFrame for things that may include non-numerical data and require
>>>>>>>>
>>>>>>> more
>>>>>
>>>>>> grace about column naming and manipulation. Maybe relevant to TF-IDF
>>>>>>>>
>>>>>>> work
>>>>>
>>>>>> when it deals with non-matrix content.
>>>>>>>>
>>>>>>>>  Sounds like a worthy effort to me.  We'd be basically implementing
>>>>>>> an
>>>>>>>
>>>>>> API
>>>>>
>>>>>> at the math-scala level for SchemaRDD/Dataframe datastructures
>>>>>>> correct?
>>>>>>>
>>>>>>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> Seems like seq2sparse would be really easy to replace since it takes
>>>>>>>>
>>>>>>> text
>>>>>
>>>>>> files to start with, then the whole pipeline could be kept in rdds.
>>>>>>>>>
>>>>>>>> The
>>>>>
>>>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>>>>>>> with
>>>>>>>>> joins? This would get rid of sequence files completely from the
>>>>>>>>> pipeline.
>>>>>>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>>>>>>> scalable using joins as an alternative with the same API allowing
>>>>>>>>> the
>>>>>>>>> user
>>>>>>>>> to trade-off footprint for speed.
>>>>>>>>>
>>>>>>>>>  I think you're right- should be relatively easy.  I've been
>>>>>>>> looking at
>>>>>>>>
>>>>>>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL
>>>>>>>
>>>>>> level
>>>>>
>>>>>> is that we don't have a distributed data structure for strings..Seems
>>>>>>>
>>>>>> like
>>>>>
>>>>>> getting a DataFrame implemented as Dmitriy mentioned above would take
>>>>>>>
>>>>>> care
>>>>>
>>>>>> of this problem.
>>>>>>>
>>>>>>> The other issue i'm a little fuzzy on  is the distributed collocation
>>>>>>> mapping-  it's a part of the seq2sparse code that I've not spent too
>>>>>>>
>>>>>> much
>>>>>
>>>>>> time in.
>>>>>>>
>>>>>>> I think that this would be very worthy effort as well- I believe
>>>>>>> seq2sparse is a particular strong mahout feature.
>>>>>>>
>>>>>>> I'll start another thread since we're now way off topic from the
>>>>>>> refactoring proposal.
>>>>>>>
>>>>>>> My use for TF-IDF is for row similarity and would take a DRM
>>>>>>> (actually
>>>>>>> IndexedDataset) and calculate row/doc similarities. It works now but
>>>>>>>
>>>>>> only
>>>>>
>>>>>> using LLR. This is OK when thinking of the items as tags or metadata
>>>>>>> but
>>>>>>> for text tokens something like cosine may be better.
>>>>>>>
>>>>>>> I’d imagine a downsampling phase that would precede TF-IDF using LLR
>>>>>>> a
>>>>>>>
>>>>>> lot
>>>>>
>>>>>> like how CF preferences are downsampled. This would produce an
>>>>>>>
>>>>>> sparsified
>>>>>
>>>>>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight
>>>>>>> the
>>>>>>> terms before row similarity uses cosine. This is not so good for
>>>>>>> search
>>>>>>> but
>>>>>>> should produce much better similarities than Solr’s “moreLikeThis”
>>>>>>> and
>>>>>>> does
>>>>>>> it for all pairs rather than one at a time.
>>>>>>>
>>>>>>> In any case it can be used to do a create a personalized
>>>>>>> content-based
>>>>>>> recommender or augment a CF recommender with one more indicator type.
>>>>>>>
>>>>>>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>>>>>>
>>>>>>>  On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>>>>>>
>>>>>>>>  Some issues WRT lower level Spark integration:
>>>>>>>>> 1) interoperability with Spark data. TF-IDF is one example I
>>>>>>>>> actually
>>>>>>>>>
>>>>>>>>>  looked at. There may be other things we can pick up from their
>>>>>>>>
>>>>>>> committers
>>>>>
>>>>>> since they have an abundance.
>>>>>>>
>>>>>>>  2) wider acceptance of Mahout DSL. The DSL’s power was illustrated
>>>>>>>> to
>>>>>>>> me when someone on the Spark list asked about matrix transpose and
>>>>>>>> an
>>>>>>>>
>>>>>>> MLlib
>>>>>>> committer’s answer was something like “why would you want to do
>>>>>>> that?”.
>>>>>>> Usually you don’t actually execute the transpose but they don’t even
>>>>>>> support A’A, AA’, or A’B, which are core to what I work on. At
>>>>>>> present
>>>>>>>
>>>>>> you
>>>>>
>>>>>> pretty much have to choose between MLlib or Mahout for sparse matrix
>>>>>>> stuff.
>>>>>>> Maybe a half-way measure is some implicit conversions (ugh, I know).
>>>>>>> If
>>>>>>> the
>>>>>>> DSL could interchange datasets with MLlib, people would be pointed to
>>>>>>>
>>>>>> the
>>>>>
>>>>>> DSL for all of a bunch of “why would you want to do that?” features.
>>>>>>>
>>>>>> MLlib
>>>>>
>>>>>> seems to be algorithms, not math.
>>>>>>>
>>>>>>>  3) integration of Streaming. DStreams support most of the RDD
>>>>>>>> interface. Doing a batch recalc on a moving time window would nearly
>>>>>>>>
>>>>>>> fall
>>>>>
>>>>>> out of DStream backed DRMs. This isn’t the same as incremental updates
>>>>>>>
>>>>>> on
>>>>>
>>>>>> streaming but it’s a start.
>>>>>>>
>>>>>>>  Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>>>>> faster compute engines. So we jumped. Now the need is for streaming
>>>>>>>> and
>>>>>>>>
>>>>>>> especially incrementally updated streaming. Seems like we need to
>>>>>>>
>>>>>> address
>>>>>
>>>>>> this.
>>>>>>>
>>>>>>>  Andrew, regardless of the above having TF-IDF would be super
>>>>>>>> helpful—row similarity for content/text would benefit greatly.
>>>>>>>> I will put a PR up soon.
>>>>>>>>
>>>>>>>>  Just to clarify, I'll be porting over the (very simple) TF, TFIDF
>>>>>>>
>>>>>> classes
>>>>>
>>>>>> and Weight interface over from mr-legacy to math-scala. They're
>>>>>>>
>>>>>> available
>>>>>
>>>>>> now in spark-shell but won't be after this refactoring.  These still
>>>>>>> require dictionary and a frequency count maps to vectorize incoming
>>>>>>>
>>>>>> text-
>>>>>
>>>>>> so they're more for use with the old MR seq2sparse and I don't think
>>>>>>>
>>>>>> they
>>>>>
>>>>>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>>>>>>> Hopefully they'll be of some use.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to