Have looked at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ that comes with Postgres 9.4 and it's really really powerful and fast.
On 03/06/2015 09:25 PM, Sam Raker wrote: > I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking > individual songs into lines, which are then split into words, so you end up > with something like > > {0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 > "goose" 4 "is" 5 "dead"}...} > > (Yes, maps with integer keys is kind of dumb; I thought about using vectors, > but this is all going into MongoDB temporarily, and I'd rather just deal with > maps instead of messing with Mongo's > somewhat lacking array-handling stuff.) > > The idea, ultimately, is to build a front-end that would allow users to, > e.g., search for all songs that contain the (sub)string "aunt rhodie", or see > how many times The Rolling Stones use the word > "woman" vs how many times the Beatles do, etc. The inspiration comes largely > from projects like COCA[2]. > > I'm wondering if any of you have opinions about which database to use (Mongo > is most likely just a stopgap), and how best to architect it. I'm most > familiar with MySQL and Mongo, but I'd rather not > be limited by just those two if there's a better option out there. I'm > thinking that I'll probably end up storing tokens over types--e.g., each word > would be stored individually, as opposed to > having an entry for, e.g., "the" that stores every instance of the word > "the." I was also thinking that I'll probably have to end up storing each > token's "previous" and "next", either as full > references or just as strings. This seems potentially inefficient, however. > > (I could've just gone to StackOverflow with this, but figured I'm more likely > to get a real answer here, because you all seem so smart and nice?) > > > Thanks! > > > > [1] https://en.wikipedia.org/wiki/N-gram > [2] http://corpus.byu.edu/coca/ > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com > <mailto:clojure+unsubscr...@googlegroups.com>. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.