Please keep in mind that the approach I am suggesting here is untried on
your kind of data.  I have used it in text and transactional streams, but
who knows if it will generalize.

If we take you example of "a b a a c" and let us assume that "a a" and "a b"
have been observed to be interesting phrases.  We would reduce your original
sequence to "ab aa c".  At that point, we assume that all interesting
temporality is captured in the ab and aa.  Almost by definition, we don't
need anything else because in previous analysis, other phrases seemed to
occur without predictive value.  Thus, we can switch to bag-of-terms format
at this point without loss of (important) information.  From there, we are
only concerned with the sequence x phrase occurrence matrix and can proceed
with SVD or random indexing or whatever else we want.

There is a serious question in this approach about what to do with
overlapping phrases.  My two answers are to either (a) keep both or (b) pick
the tiling that gives us the most interesting sequence of tokens.  Approach
(b) is basically the same algorithm that is used to segment Chinese.   See
http://technology.chtsai.org/mmseg/ for a great example.

Another class of techniques entirely which deals a bit better with the
problem of (slightly) longer range dependencies is some variant on random
indexing.  Here, you start with a random vector for each symbol and then
move each symbol vector a skoshi in the direction of the symbols it cooccurs
with.  If you use separate forward and backward looking vectors then you get
strong directionality and sequencing.  The vector of a sequence is then just
the sum (perhaps weighted by rarity) of the trained vectors for each term.
The size of the cooccurrence window is an important parameter here.

Surprisingly, these techniques have some pretty deep connections on the
mathematical level.  The first approach + random basis project or SVD is
very comparable to the second approach except that the second handles
dependencies that skip over intervening symbols more directly.

In any case, you lose some sequence information.  This is, in fact, exactly
what gives you a nice metric over similar sequences.

On Sat, Nov 21, 2009 at 10:00 PM, prasenjit mukherjee <
[email protected]> wrote:

> On Sun, Nov 22, 2009 at 9:54 AM, Ted Dunning <[email protected]>
> wrote:
> > Expressing your symbolic sequences by tiling these phrases gives you much
> of
> > the temporality that you are interested and lets you use algorithms like
> > k-means pretty much directly.
>
> Approach sounds interesting . Can you explain a bit on how you intend
> to represent a sequence as a vector here ? Assuming sequence being  "a
> b a a c". I was thinking of the following 2 approaches :
>
> If I use symbols as my basis and the coefficients as time-slices then
> I would loose the information of recurring symbols  ( symbol a in my
> example ) .  e.g. vector representation of "a b a a c": 1(a)+ 2(b) +
> 5(c) ( problem : how to incorporate 3a,4a )
>
> On the other hand if I use time-slices as my basis and some mapping of
> terms as its coefficients then my simple euclidean measure wont make
> any sense.  e.g. let's a->1, b->2, c->3, then vector representation of
> "a b a a c":  1(t1) + 2(t2) + 1(t3) + 1(t4) + 3(t5)
>
> -Prasen
>
> >
> > If you don't have symbolic sequences, you have another problem, but you
> > might get similar results by doing vector quantization on your continuous
> > time-series expressed in terms of multi-scale localized spectral
> detectors.
> > Some problems work well with those techniques, some definitely need more
> > interesting feature detectors.  The spectral processing and vector
> > quantization are fairly natural tasks for map-reduce which is nice.  In
> > fact, vector quantization is commonly done with some variant on k-means.
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to