On Sat, Nov 21, 2009 at 10:22 PM, Ted Dunning <[email protected]> wrote:

> Please keep in mind that the approach I am suggesting here is untried on
> your kind of data.  I have used it in text and transactional streams, but
> who knows if it will generalize.
>
> If we take you example of "a b a a c" and let us assume that "a a" and "a
> b"
> have been observed to be interesting phrases.  We would reduce your
> original
> sequence to "ab aa c".  At that point, we assume that all interesting
> temporality is captured in the ab and aa.  Almost by definition, we don't
> need anything else because in previous analysis, other phrases seemed to
> occur without predictive value.  Thus, we can switch to bag-of-terms format
> at this point without loss of (important) information.  From there, we are
> only concerned with the sequence x phrase occurrence matrix and can proceed
> with SVD or random indexing or whatever else we want.
>

While I'll add the caveat also that I haven't done this for temporal data
(other than
the case which Ted is referring to, where text technically has some temporal
nature
to it by its sequentiality), doing this kind of thing with "significant
ngrams" as Ted
describes can allow you to arbitrarily keep higher order correlations if you
do
this in a randomized SVD: intead of just keeping the interesting bi-grams,
keep
*all* ngrams up to some fixed size (even as large as 5, say), then to random
projection on your bag-of-ngrams to map it down from the huge
numUniqueSymbols^{5} dimensional space (well technically this probably
overflows
sizeof(long), so you probably are wrapping around mod some big prime close
to
2^64, but collisions will be still be rare and will just act as noise) down
to some
reasonable space still larger than you think is necessary (maybe 1k-10k),
*then*
do the SVD there.

At that point, your similarities should be capturing a fair amount of the
higher order
time series effects.  I'm advocating Ted's (a) basically, but keeping even
more
information than you think is necessary, because why not - if you're doing a

randomized projection before SVD, the only thing to lose is that you add too
much
noise, which while it could be a problem, weighting by some combination of
log-likelihood of the subsequences and an effective IDF should help (here
I'm
being pretty vague, I'll admit - this weighting is probably pretty important
if you
keep as much information as I'm advocating).

The machinery to do the above in parallel on "ridiculously big" data on
Hadoop
should be coming in soon with some of the stuff I'm working on contributing
to Mahout.

  -jake

Reply via email to