Re: TF-IDF, seq2sparse and DataFrame support

Pat Ferrel Wed, 04 Feb 2015 08:15:14 -0800

Andrew, not sure what you mean about storing strings. If you mean something 
like a DRM of tokens, that is a DataFrame with row=doc column = token. A one 
row DataFrame is a slightly heavy weight string/document. A DataFrame with 
token counts would be perfect for input TF-IDF, no? It would be a vector that 
maintains the tokens as ids for the counts, right?

I agree seq2sparse type input is a strong feature. Text files into an 
all-documents DataFrame basically. Colocation?

On Feb 4, 2015, at 7:47 AM, Andrew Palumbo <[email protected]> wrote:

Just copied over the relevant last few messages to keep the other thread on 
topic...

On 02/03/2015 08:22 PM, Dmitriy Lyubimov wrote:
> I'd suggest to consider this: remember all this talk about
> language-integrated spark ql being basically dataframe manipulation DSL?
> 
> so now Spark devs are noticing this generality as well and are actually
> proposing to rename SchemaRDD into DataFrame and make it mainstream data
> structure. (my "told you so" moment of sorts
> 
> What i am getting at, i'd suggest to make DRM and Spark's newly renamed
> DataFrame our two major structures. In particular, standardize on using
> DataFrame for things that may include non-numerical data and require more
> grace about column naming and manipulation. Maybe relevant to TF-IDF work
> when it deals with non-matrix content.
Sounds like a worthy effort to me.  We'd be basically implementing an API at 
the math-scala level for SchemaRDD/Dataframe datastructures correct?

On Tue, Feb 3, 2015 at 5:01 PM, Pat Ferrel <[email protected]> wrote:
>> Seems like seq2sparse would be really easy to replace since it takes text
>> files to start with, then the whole pipeline could be kept in rdds. The
>> dictionaries and counts could be either in-memory maps or rdds for use with
>> joins? This would get rid of sequence files completely from the pipeline.
>> Item similarity uses in-memory maps but the plan is to make it more
>> scalable using joins as an alternative with the same API allowing the user
>> to trade-off footprint for speed.

I think you're right- should be relatively easy.  I've been looking at porting 
seq2sparse  to the DSL for bit now and the stopper at the DSL level is that we 
don't have a distributed data structure for strings..Seems like getting a 
DataFrame implemented as Dmitriy mentioned above would take care of this 
problem.

The other issue i'm a little fuzzy on  is the distributed collocation mapping-  
it's a part of the seq2sparse code that I've not spent too much time in.

I think that this would be very worthy effort as well-  I believe seq2sparse is 
a particular strong mahout feature.

I'll start another thread since we're now way off topic from the refactoring 
proposal.

My use for TF-IDF is for row similarity and would take a DRM (actually
IndexedDataset) and calculate row/doc similarities. It works now but only
using LLR. This is OK when thinking of the items as tags or metadata but
for text tokens something like cosine may be better.

I’d imagine a downsampling phase that would precede TF-IDF using LLR a lot
like how CF preferences are downsampled. This would produce an sparsified
all-docs DRM. Then (if the counts were saved) TF-IDF would re-weight the
terms before row similarity uses cosine. This is not so good for search but
should produce much better similarities than Solr’s “moreLikeThis” and does
it for all pairs rather than one at a time.

In any case it can be used to do a create a personalized content-based
recommender or augment a CF recommender with one more indicator type.

On Feb 3, 2015, at 3:37 PM, Andrew Palumbo <[email protected]> wrote:

On 02/03/2015 12:44 PM, Andrew Palumbo wrote:
> On 02/03/2015 12:22 PM, Pat Ferrel wrote:
>> Some issues WRT lower level Spark integration:
>> 1) interoperability with Spark data. TF-IDF is one example I actually
looked at. There may be other things we can pick up from their committers
since they have an abundance.
>> 2) wider acceptance of Mahout DSL. The DSL’s power was illustrated to
me when someone on the Spark list asked about matrix transpose and an MLlib
committer’s answer was something like “why would you want to do that?”.
Usually you don’t actually execute the transpose but they don’t even
support A’A, AA’, or A’B, which are core to what I work on. At present you
pretty much have to choose between MLlib or Mahout for sparse matrix stuff.
Maybe a half-way measure is some implicit conversions (ugh, I know). If the
DSL could interchange datasets with MLlib, people would be pointed to the
DSL for all of a bunch of “why would you want to do that?” features. MLlib
seems to be algorithms, not math.
>> 3) integration of Streaming. DStreams support most of the RDD
interface. Doing a batch recalc on a moving time window would nearly fall
out of DStream backed DRMs. This isn’t the same as incremental updates on
streaming but it’s a start.
>> Last year we were looking at Hadoop Mapreduce vs Spark, H2O, Flink
faster compute engines. So we jumped. Now the need is for streaming and
especially incrementally updated streaming. Seems like we need to address
this.
>> Andrew, regardless of the above having TF-IDF would be super
helpful—row similarity for content/text would benefit greatly.
>   I will put a PR up soon.
Just to clarify, I'll be porting over the (very simple) TF, TFIDF classes
and Weight interface over from mr-legacy to math-scala. They're available
now in spark-shell but won't be after this refactoring.  These still
require dictionary and a frequency count maps to vectorize incoming text-
so they're more for use with the old MR seq2sparse and I don't think they
can be used with Spark's HashingTF and IDF.  I'll put them up soon.
Hopefully they'll be of some use.

Re: TF-IDF, seq2sparse and DataFrame support

Reply via email to