This has connections back to the theoretical literature on dynamical systems where the term "symbolic dynamics" is used to refer to the investigation of the sequences of symbols that can be produced by various complex systems depending on the quantization of the state space. The important take-away from that world is that essentially all of the important dynamical properties of a system can be described by using the appropriate quantization. Of course, figuring out what the appropriate quantization is can be a major problem, but for many systems, almost any non-trivial quantization is useful.
The use of a long n-grams also has analogies in state-space embedding of dynamical systems where you don't have access to the full state of a system but instead use repeated measurements to substitute for the the full state. Again, in many cases you can do as much with an embedded state-space as you could with the (unobservable) full state. A great example is the classic dripping faucet problem (Shaw's paper isn't available, but this follow-up is: http://metal.elte.hu/botond/pdf/CHA00059.pdf). One of the original papers on the subject of state-space reconstruction is available on-line at http://cos.cumt.edu.cn/jpkc/dxwl/zl/zl1/Physical%20Review%20Classics/statistical/132.pdf The point here is that n-gram techniques for time series are really not at all far-fetched. On Sun, Nov 22, 2009 at 9:07 AM, Jake Mannix <[email protected]> wrote: > While I'll add the caveat also that I haven't done this for temporal data > (other than the case which Ted is referring to, where text technically has > some temporal > nature to it by its sequentiality), doing this kind of thing with > "significant > ngrams" as Ted describes can allow you to arbitrarily keep higher order > correlations if you > do this in a randomized SVD: intead of just keeping the interesting > bi-grams, > keep *all* ngrams up to some fixed size (even as large as 5, say), then to > random > projection on your bag-of-ngrams to map it down from the huge > numUniqueSymbols^{5} dimensional space (well technically this probably > overflows sizeof(long), so you probably are wrapping around mod some big > prime close > to 2^64, but collisions will be still be rare and will just act as noise) > down > to some reasonable space still larger than you think is necessary (maybe > 1k-10k), > *then* do the SVD there. > -- Ted Dunning, CTO DeepDyve
