Hey Drew, Let me know when you post a JIRA/any help you might want?
On Fri, Jan 8, 2010 at 5:03 PM, Drew Farris <[email protected]> wrote: > Jake, thanks for the review, running narrative and comments. The > Analyzer in use should be up to the user, so there will be flexibility > to mess around with lots of alternative there, but it will be nice to > provide reasonable defaults and include this sort of discussion in the > wiki page for the algo. I'll finish up the rest of the code for it and > post a patch to JIRA. > > Robin, I'll take a look at the dictionaryVectorizer, and see how they > can work together. I think something like SequenceFiles<documentId, > Text or BytesWritable> make sense as input for this job and it's > probably easier to work with than what I had to whip up to slurp in > files whole. > > Does anyone know if there is a stream based alternative to Text or > BytesWritable? > > On Thu, Jan 7, 2010 at 11:46 PM, Jake Mannix <[email protected]> > wrote: > > Ok, I lied - I think what you described here is way *faster* than what I > > was doing, because I wasn't starting with the original corpus, I had > > something like google's ngram terabyte data (a massive HDFS file with > > just "ngram ngram-frequency" on each line), which mean I had to do > > a multi-way join (which is where I needed to do a secondary sort by > > value). > > > > Starting with the corpus itself (the case we're talking about) you have > > some nice tricks in here: > > > > On Thu, Jan 7, 2010 at 6:46 PM, Drew Farris <[email protected]> > wrote: > >> > >> > >> The output of that map task is something like: > >> > >> k:(n-1)gram v:ngram > >> > > > > This is great right here - it helps you kill two birds with one stone: > the > > join > > and the wordcount phases. > > > > > >> k:ngram,ngram-frequency v:(n-1)gram,(n-1) gram freq > >> > >> e.g: > >> k:the best:1, v:best,2 > >> k:best of,1, v:best,2 > >> k:best of,1, v:of,2 > >> k:of times,1 v:of,2 > >> k:the best,1, v:the,1 > >> k:of times,1 v:1 v:times,1 > >> > > > > Yeah, once you're here, you're home free. This should be really a rather > > quick set of jobs, even on really big data, and even dealing with it as > > text. > > > > > >> I'm also wondering about the best way to handle input. Line by line > >> processing would miss ngrams spanning lines, but full document > >> processing with the StandardAnalyzer+ShingleFilter wil form ngrams > >> across sentence boundaries. > >> > > > > These effects are just minor issues: you lose a little bit of signal on > > line endings, and you pick up some noise catching ngrams across > > sentence boundaries, but it's fractional compared to your whole set. > > Don't try and to be too fancy and cram tons of lines together. If your > > data comes in different chunks than just one huge HDFS text file, you > > could certainly chunk it into bigger chunks (10, 100, 1000 lines, maybe) > > to reduce the newline error if necessary, but it's probably not needed. > > The sentence boundary part gets washed out in the LLR step anyways > > (because they'll almost always turn out to have a low LLR score). > > > > What I've found I've had to do sometimes, is something with stop words. > > If you don't use stop words at all, you end up getting a lot of > relatively > > high LLR scoring ngrams like "up into", "he would", and in general > pairings > > of a relatively rare unigram with a pronoun or preposition. Maybe there > are > > other ways of avoiding that, but I've found that you do need to take some > > care with the stop words (but removing them altogether leads to some > > weird looking ngrams if you want to display them somewhere). > > > > > >> I'm interested in whether there's a more efficient way to structure > >> the M/R passes. It feels a little funny to no-op a whole map cycle. It > >> would almost be better if one could chain two reduces together. > >> > > > > Beware premature optimization - try this on a nice big monster set on > > a real cluster, and see how long it takes. I have a feeling you'll be > > pleasantly surprised. But even before that - show us a patch, maybe > > someone will have easy low-hanging fruit optimization tricks. > > > > -jake > > > -- Zaki Rahaman
