moin Rob, On 2011-10-01 03:54, Rob Freeman wrote: > Bryan, > > I did a quick search for your papers. Seems you're mostly looking > for "canonical" dimensions for texts. Perhaps you did some categories > for individual words earlier(?)
Yup; the only publication that came out of it was in 2005: http://www.ling.uni-potsdam.de/~moocow/pubs/jurish2005hybrid_color.pdf ... the whole sparse pdl thing happened after that; it was intended to be part of my dissertation, but I wound up getting funding for something completely different (the canonical-forms-for-historical-text stuff)... defense is Thursday ;-) > I'm characterizing words (in terms of contexts, or texts), so kind > of the inverse of characterizing texts. Characterizing words by co-occurrence distributions was what I was after, too -- the Big Idea was to induce (something like) syntactic categories from the observed distributions. > In particular what I'm doing which is new is trying to find a > principled way to syntactically combine these word level > representations, as vectors, without abstracting them into categories > first. Sounds very cool. I think I heard Katrin Erk talk about something similar at ACL 2010, but I've never tried anything like it myself. > In detail it means I'm trying to build context vectors for > unobserved pairs of words, by substituting contexts for observed > pairs between their respective elements. Sounds a bit like a brand of smoothing; although I'm not really sure what "substituting contexts for observed pairs between their respective elements" means here; unless maybe it's a sort of transitivity, e.g. you have {ab,cd,be,de} and to compare "a" and "c" you compose the adjacency relation to get {a(b)e,c(d)e} in order to get a shared attribute ("*e") for "a" and "c"... I guess that would explain your interest in the cross-product too, but I'm just guessing... > It relies linguistically on the assumption that words with common > contexts tend to behave the same, so they'll have pairs which have > the same contexts etc. Yes, very Firthian (actually very Leibnizian, but who's counting)? > Anyway, this means I'm always searching for pairs of words which > occur in sequence in a text, so I can perform my pair search as a > parallel search indexed by word position in the text. It's linear on > text size, which is still a big win over exponential on vector > size. ... in this context, I've got a pdl-ified corpus representation floating about somewhere if you're interested; basically it relies on a perl hash to map word types to integer ids and stores the corpus as a flat vector (with sentence boundaries). It's pretty easy to get co-occurrence statistics from that (see e.g. PDL::Ngrams), and they wrap nicely into PDL::CCS::Nd. > It sounds very similar to your linear parallel search on operands' > index values. Your operands' index values must have similar meaning. > There's an assumption that one sorting order is appropriate for all > searches. "Operands" in the sense of my last mail just meant "PDL::CCS::Nd sparse n-d piddles passed as arguments to some arithmetic operation". In my application, the "index values" of these pdls corresponded to word types, so e.g. {the->0,cat->1,is->2,on->3,mat->4}, the sentence "the cat is on the mat" is the vector [0 1 2 3 0 4], the sort order is just lexicographic order (so "the cat" < "the mat"). That means you've got to specify either full index-vectors to search ("the cat" vs. "* cat") or some prefix of the dimensions ("the *"). There is some support for re-sorting on a different dimension (via reorder() etc) in order to search for e.g. "* cat" -- I usually just copied the sparse pdl's indices, slice()d them to reorder (make "* cat" searchable as "cat *"), and re-sorted (qsortvec()) to do this sort of thing. The module tries to do that implicitly if and when required, but if you need to do it often, you're better off caching the inverted matrices. You could use the same strategy to make an occurrence-list in a flat corpus id-vector look like a single huge sparse pdl, with as many dimensions as your most frequent word type has tokens in the corpus; if that's what you're after -- only the 0th dimension (word type) would be "interesting", and you get a list of tokens (as indices into the flat corpus vector) for any word type in O(log(Types)), which is substantially faster than a linear search over all tokens in the corpus. You can actually get this particular operation down to O(1) if you use the ptr() method to get a Harwell-Boeing pointer for the dimension of interest. > I'm not sure if I did better on size than O(Nnz**2). I'm pretty sure > that working with arrays was just too big for me, My limited > programming knowledge didn't reach to a way to use them sparsely. I think maybe you misunderstood. The PDL::CCS::Nd arrays are basically just the ($indices,$values) pairs I mentioned, plus some flags and administrative data. Both $indices and $values are themselves pdls; so e.g. use PDL; use PDL::CCS::Nd ':vars'; my $sp = toccs( pdl( [[0,0,1],[2,0,0]] ) ); print "indices=", $sp->[$WHICH], "values=", $sp->[$VALS]; prints: indices= [ [0 1] [2 0] ] values=[2 1 0] > I had to go to Perl hashes to get something sparse. I thought you'd > done the same. Am I now understanding rightly that your CCS's are > arrays. Yes, they're arrays of pdls. These arrays are bless()ed into a perl class which tries to behave like a pdl, but really isn't. > If your CCS's are a sparse implementation with arrays that might be a > win for me. Would they be a lot smaller than Perl hashes, do you > think? Definitely, since you eliminate the overhead needed for gazillions of perl scalars. And faster, too; at least in every case I've tested. marmosets, Bryan > -Rob > > On Fri, Sep 30, 2011 at 4:50 PM, Bryan Jurish > <[email protected]> wrote: >> moin Rob, >> >> On 2011-09-29 14:16, Rob Freeman wrote: >>> Hi Bryan, >>> >>> I haven't seen this appear on the list yet. Are you sure you sent >>> from the address you are subscribed with? I've been caught by >>> that one before elsewhere. >> >> I had begun to suspect something similar myself... now testing with >> the proper address; thanks for the heads-up! >> >>> Thanks for the feedback. And David, Chris, and Mark too. >>> >>> Bryan, nice to find someone working on a similar problem. I'll >>> check out your work. At first glace your ($indices,$values) pairs >>> might be similar to how I have addressed the problem up to now. >>> Not within PDL, but as a regular Perl hash. >>> >>> Your logarithmic search time sounds interesting. I get linear >>> time searching for word pairs. Frankly I was glad to get it down >>> from exponential time in the size of the vectors. >> >> Amen. >> >>> It may not be the same search we are talking about. I'm searching >>> for pairs between every element of a vector, and every element of >>> another. >> >> Ah, that's a bit trickier. I've written a helper routine >> PDL::CCS::Ops::ccs_binop_align_mia() to do basically just this, >> which I use to implement the standard binary arithmetic operations. >> The major gotcha is that "missing" (read 'zero') values in either >> operand do not get aligned at all (which is why matrix >> multiplication rsp. dot-product are such a PITA); hence the "_mia" >> suffix (= 'missing-is-annihiliator'). It's worst-case quadratic >> time, but for very sparse pdls (including my co-occurrence >> matrices), it's much closer to linear (sorted parallel search of >> operands' index vectors). It requires a bit of twiddling on the >> perl end to ensure that all vectors get aligned (block-wise >> iteration) because PDL needs to know the size of the output pdls >> before they've been computed, which is generally A Bad Thing for >> sparse matrix operations. >> >>> However I do it though my memory usage just explodes when I try >>> a cross product. Too many intermediate products being summed, >>> nesting two or three levels of substitutions over 10k+ >>> dimensions. I need to keep all intermediate products in RAM at >>> once or it is way too slow. But with it all in RAM the memory >>> usage blows out. >> >> Yup. Cross-product is nasty even with sparse pdls. Occassionally >> I had to drill open the CCS::Nd objects (really just perl arrays) >> and work directly on the underlying ($indices,$values) pairs -- >> cross-product is pretty simple to define by direct manipulation of >> $indices, although memory usage is still O(Nnz**2), where Nnz is >> the number of non-zero values. >> >> marmosets, Bryan > -- Bryan Jurish "There is *always* one more bug." [email protected] -Lubarsky's Law of Cybernetic Entomology _______________________________________________ Perldl mailing list [email protected] http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
