ApacheCon US 2005 Call for Participaton

2005-07-14 Thread Erik Hatcher
Passing the word on. Lucene-related sessions are very welcome! Erik In case you missed the news, ApacheCon US 2005 has been scheduled for 10-14 December 2005 in San Diego, California! See http://ApacheCon.Com/2005/US/ And the Call for Participation is open, so if you want to be a speake

SIPs and CAPs

2005-07-14 Thread Erik Hatcher
Has anyone developed code to extract SIPs (statistically improbable phrases) and CAPs (capitalized phrases) from a Lucene index, such as Amazon does with it's books as shown here?

Re: SIPs and CAPs

2005-07-14 Thread mark harwood
I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each page in the example book as a Lucene document, crea

Re: SIPs and CAPs

2005-07-14 Thread Erik Hatcher
On Jul 14, 2005, at 7:17 AM, mark harwood wrote: I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each

Re: SIPs and CAPs

2005-07-14 Thread mark harwood
> Do you just do this with terms or do you also > extract phrases? The scheme involves these phases: 1) Identify top terms (using algo described) 2) Identify all term "runs" in original text. 3) Identify sensible phrases from large list of term runs 4) Provide shortlist of top scoring terms AND

RE: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Peter Gelderbloem
I am currently looking into building a similar system and came across this architecture: http://www.eecs.harvard.edu/~mdw/proj/seda/ I am just reading up on it now. Does anyone have experience building a lucene system based on this architecture? Any advice would be greatly appreciated. Peter Geld

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
I had a crack at whipping up something along this lines during a 1 day hackathon we held here at work, using ActiveMQ as the bus between the 'co-ordinator' (Queen bee) and the 'worker" bees. The index work was segmented as jobs on a work queue, and the workers feed the relatively smal inde

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Otis Gospodnetic
Interesting. I'm planning on doing something similar for some new Simpy features. Why are your worker bees sending whole indices to the Queen bee? Wouldn't it be easier to send in Documents and have the Queen index them in the same index? Maybe you need those individual, smaller indices to be s

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
My punt was that having workers create sub-indexs (creating the documents and making a partial index) and shipping the partial index back to the queen to merge may be more efficient. It's probably not, I was just using the day as a chance to see if it looked promising, and get my hands dir

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Erik Hatcher
Paul - it sounds an awful like my (perhaps incorrect) understanding of the MapReduce capability of Nutch that is under development. Perhaps the work that Doug and others have done there are applicable to your situation. Erik On Jul 14, 2005, at 7:38 PM, Paul Smith wrote: My punt was

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
Cl, I should go have a look at that.. That begs another question though, where does Nutch stand in terms of the ASF? Did I read (or dream) that Nutch may be coming in under ASF? I guess I should get myself subscribed to the Nutch mailing lists. thanks Erik. Paul On 15/07/2005, at 11

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
answering my own question: nutch.org -> lucene.apache.org/nutch/ Excellent! Paul On 15/07/2005, at 11:45 AM, Paul Smith wrote: Cl, I should go have a look at that.. That begs another question though, where does Nutch stand in terms of the ASF? Did I read (or dream) that Nutch may be c

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Erik Hatcher
On Jul 14, 2005, at 9:45 PM, Paul Smith wrote: Cl, I should go have a look at that.. That begs another question though, where does Nutch stand in terms of the ASF? Did I read (or dream) that Nutch may be coming in under ASF? I guess I should get myself subscribed to the Nutch mailing

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Otis Gospodnetic
The problem that I saw (from your email only) with the "ship the full little index to the Queen" approach is that, from what I understand, you eventually do addIndexes(Directory[]) in there, and as this optimizes things in the end, this means your whole index gets re-written to disk after each such

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-07-14 Thread Paul Smith
On 15/07/2005, at 3:57 PM, Otis Gospodnetic wrote: The problem that I saw (from your email only) with the "ship the full little index to the Queen" approach is that, from what I understand, you eventually do addIndexes(Directory[]) in there, and as this optimizes things in the end, this means y