Re: TF-IDF, seq2sparse and DataFrame support

Gokhan Capan Wed, 04 Feb 2015 14:32:34 -0800

I think I have a sketch of implementation for creating a drm from a
sequence file of <Int, Text>s, a.k.a. seq2sparse, using Spark.


Give me a couple days day and I will provide an initial implementation.

Best

Gokhan

On Wed, Feb 4, 2015 at 7:16 PM, Andrew Palumbo <ap....@outlook.com> wrote:

>
> On 02/04/2015 11:13 AM, Pat Ferrel wrote:
>
>> Andrew, not sure what you mean about storing strings. If you mean
>> something like a DRM of tokens, that is a DataFrame with row=doc column =
>> token. A one row DataFrame is a slightly heavy weight string/document. A
>> DataFrame with token counts would be perfect for input TF-IDF, no? It would
>> be a vector that maintains the tokens as ids for the counts, right?
>>
>
> Yes- dataframes will be perfect for this.  The problem that i was
> referring to was that we dont have a DSL Data Structure to to do the
> initial distributed tokenizing of the documents[1] line:257, [2] . For this
> I believe we would need something like a Distributed vector of Strings that
> could be broadcast to a mapBlock closure and then tokenized from there.
> Even there, MapBlock may not be perfect for this, but some of the new
> Distributed functions that Gockhan is working on may.
>
>>
>> I agree seq2sparse type input is a strong feature. Text files into an
>> all-documents DataFrame basically. Colocation?
>>
> as far as collocations i believe that the n-gram are computed and counted
> in the CollocDriver [3] (i might be wrong her...its been a while since i
> looked at the code...) either way, I dont think I ever looked too closely
> and i was a bit fuzzy on this...
>
> These were just some thoughts that I had when briefly looking at porting
> seq2sparse to the DSL before.. Obviously we don't have to follow this
> algorithm but its a nice starting point.
>
> [1] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/SparseVectorsFromSequenceFiles
> .java
> [2] https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/DocumentProcessor.java
> [3]https://github.com/apache/mahout/blob/master/mrlegacy/
> src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.
> java
>
>
>
>> On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <ap....@outlook.com> wrote:
>>
>> Just copied over the relevant last few messages to keep the other thread
>> on topic...
>>
>>
>> On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
>>
>>> I'd suggest to consider this: remember all this talk about
>>> language-integrated spark ql being basically dataframe manipulation DSL?
>>>
>>> so now Spark devs are noticing this generality as well and are actually
>>> proposing to rename SchemaRDD into DataFrame and make it mainstream data
>>> structure. (my "told you so" moment of sorts
>>>
>>> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
>>> DataFrame our two major structures. In particular, standardize on using
>>> DataFrame for things that may include non-numerical data and require more
>>> grace about column naming and manipulation. Maybe relevant to TF-IDF work
>>> when it deals with non-matrix content.
>>>
>> Sounds like a worthy effort to me.  We'd be basically implementing an API
>> at the math-scala level for SchemaRDD/Dataframe datastructures correct?
>>
>> On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>
>>> Seems like seq2sparse would be really easy to replace since it takes text
>>>> files to start with, then the whole pipeline could be kept in rdds. The
>>>> dictionaries and counts could be either in-memory maps or rdds for use
>>>> with
>>>> joins? This would get rid of sequence files completely from the
>>>> pipeline.
>>>> Item similarity uses in-memory maps but the plan is to make it more
>>>> scalable using joins as an alternative with the same API allowing the
>>>> user
>>>> to trade-off footprint for speed.
>>>>
>>> I think you're right- should be relatively easy.  I've been looking at
>> porting seq2sparse  to the DSL for bit now and the stopper at the DSL level
>> is that we don't have a distributed data structure for strings..Seems like
>> getting a DataFrame implemented as Dmitriy mentioned above would take care
>> of this problem.
>>
>> The other issue i'm a little fuzzy on  is the distributed collocation
>> mapping-  it's a part of the seq2sparse code that I've not spent too much
>> time in.
>>
>> I think that this would be very worthy effort as well-  I believe
>> seq2sparse is a particular strong mahout feature.
>>
>> I'll start another thread since we're now way off topic from the
>> refactoring proposal.
>>
>> My use for TF-IDF is for row similarity and would take a DRM (actually
>> IndexedDataset) and calculate row/doc similarities. It works now but only
>> using LLR. This is OK when thinking of the items as tags or metadata but
>> for text tokens something like cosine may be better.
>>
>> I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
>> like how CF preferences are downsampled. This would produce an sparsified
>> all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
>> terms before row similarity uses cosine. This is not so good for search
>> but
>> should produce much better similarities than Solr’s “moreLikeThis” and
>> does
>> it for all pairs rather than one at a time.
>>
>> In any case it can be used to do a create a personalized content-based
>> recommender or augment a CF recommender with one more indicator type.
>>
>> On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <ap....@outlook.com> wrote:
>>
>>
>> On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
>>
>>> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>>>
>>>> Some issues WRT lower level Spark integration:
>>>> 1) interoperability with Spark data. TF-IDF is one example I actually
>>>>
>>> looked at. There may be other things we can pick up from their committers
>> since they have an abundance.
>>
>>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
>>>>
>>> me when someone on the Spark list asked about matrix transpose and an
>> MLlib
>> committer’s answer was something like “why would you want to do that?”.
>> Usually you don’t actually execute the transpose but they don’t even
>> support A’A, AA’, or A’B, which are core to what I work on. At present you
>> pretty much have to choose between MLlib or Mahout for sparse matrix
>> stuff.
>> Maybe a half-way measure is some implicit conversions (ugh, I know). If
>> the
>> DSL could interchange datasets with MLlib, people would be pointed to the
>> DSL for all of a bunch of “why would you want to do that?” features. MLlib
>> seems to be algorithms, not math.
>>
>>> 3) integration of Streaming. DStreams support most of the RDD
>>>>
>>> interface. Doing a batch recalc on a moving time window would nearly fall
>> out of DStream backed DRMs. This isn’t the same as incremental updates on
>> streaming but it’s a start.
>>
>>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
>>>>
>>> faster compute engines. So we jumped. Now the need is for streaming and
>> especially incrementally updated streaming. Seems like we need to address
>> this.
>>
>>> Andrew, regardless of the above having TF-IDF would be super
>>>>
>>> helpful—row similarity for content/text would benefit greatly.
>>
>>>    I will put a PR up soon.
>>>
>> Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
>> and Weight interface over from mr-legacy to math-scala. They're available
>> now in spark-shell but won't be after this refactoring.  These still
>> require dictionary and a frequency count maps to vectorize incoming text-
>> so they're more for use with the old MR seq2sparse and I don't think they
>> can be used with Spark's HashingTF and IDF.  I'll put them up soon.
>> Hopefully they'll be of some use.
>>
>>
>

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to