Re: mahout examples

Ted Dunning Sun, 22 Nov 2009 10:59:43 -0800

This has connections back to the theoretical literature on dynamical systems
where the term "symbolic dynamics" is used to refer to the investigation of
the sequences of symbols that can be produced by various complex systems
depending on the quantization of the state space.  The important take-away
from that world is that essentially all of the important dynamical
properties of a system can be described by using the appropriate
quantization.  Of course, figuring out what the appropriate quantization is
can be a major problem, but for many systems, almost any non-trivial
quantization is useful.

The use of a long n-grams also has analogies in state-space embedding of
dynamical systems where you don't have access to the full state of a system
but instead use repeated measurements to substitute for the the full state.
Again, in many cases you can do as much with an embedded state-space as you
could with the (unobservable) full state.  A great example is the classic
dripping faucet problem (Shaw's paper isn't available, but this follow-up
is: http://metal.elte.hu/botond/pdf/CHA00059.pdf).  One of the original
papers on the subject of state-space reconstruction is available on-line at
http://cos.cumt.edu.cn/jpkc/dxwl/zl/zl1/Physical%20Review%20Classics/statistical/132.pdf

The point here is that n-gram techniques for time series are really not at
all far-fetched.

On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <[email protected]> wrote:

> While I'll add the caveat also that I haven't done this for temporal data
> (other than the case which Ted is referring to, where text technically has
> some temporal
> nature to it by its sequentiality), doing this kind of thing with
> "significant
> ngrams" as Ted describes can allow you to arbitrarily keep higher order
> correlations if you
> do this in a randomized SVD: intead of just keeping the interesting
> bi-grams,
> keep *all* ngrams up to some fixed size (even as large as 5, say), then to
> random
> projection on your bag-of-ngrams to map it down from the huge
> numUniqueSymbols^{5} dimensional space (well technically this probably
> overflows sizeof(long), so you probably are wrapping around mod some big
> prime close
> to 2^64, but collisions will be still be rare and will just act as noise)
> down
> to some reasonable space still larger than you think is necessary (maybe
> 1k-10k),
> *then* do the SVD there.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: mahout examples

Reply via email to