Have looked at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ that 
comes with Postgres 9.4 and it's really really powerful and fast.

On 03/06/2015 09:25 PM, Sam Raker wrote:
> I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking 
> individual songs into lines, which are then split into words, so you end up 
> with something like
>
> {0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 
> "goose" 4 "is" 5 "dead"}...}
>
> (Yes, maps with integer keys is kind of dumb; I thought about using vectors, 
> but this is all going into MongoDB temporarily, and I'd rather just deal with 
> maps instead of messing with Mongo's
> somewhat lacking array-handling stuff.)
>
> The idea, ultimately, is to build a front-end that would allow users to, 
> e.g., search for all songs that contain the (sub)string "aunt rhodie", or see 
> how many times The Rolling Stones use the word
> "woman" vs how many times the Beatles do, etc. The inspiration comes largely 
> from projects like COCA[2]. 
>
> I'm wondering if any of you have opinions about which database to use (Mongo 
> is most likely just a stopgap), and how best to architect it. I'm most 
> familiar with MySQL and Mongo, but I'd rather not
> be limited by just those two if there's a better option out there. I'm 
> thinking that I'll probably end up storing tokens over types--e.g., each word 
> would be stored individually, as opposed to
> having an entry for, e.g., "the" that stores every instance of the word 
> "the." I was also thinking that I'll probably have to end up storing each 
> token's "previous" and "next", either as full
> references or just as strings. This seems potentially inefficient, however. 
>
> (I could've just gone to StackOverflow with this, but figured I'm more likely 
> to get a real answer here, because you all seem so smart and nice?)
>
>
> Thanks!
>
>
>
> [1] https://en.wikipedia.org/wiki/N-gram
> [2] http://corpus.byu.edu/coca/
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com 
> <mailto:clojure+unsubscr...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to