Re: Concurrency and multiple merge threads
Sounds like a nice machine! It's frustrating that RAMFile even has any sync'd methods... Lucene is write once, so once a RAMFile is written we don't need any sync to read it. Maybe on creating a RAMInputStream we could make a new ReadOnlyRAMFile, holding the same buffers without sync. That said the ops inside the sync are tiny so it's strange if this really is the cause of the contention... It could just be a profiling ghost and something else is the real bottleneck... Mike On Feb 18, 2012, at 9:21 PM, Benson Margulies wrote: > Using Lucene 3.5.0, on a 32-core machine, I have coded something shaped like: > > make a writer on a RAMDirectory. > > start: > > Create a near-real-time searcher from it. > > farm work out to multiple threads, each of which performs a search > and retrieves some docs. > > When all are done, write some new docs. > > back to start. > > The returns of adding threads diminish faster than I would like. > According to YourKit, a major contribution when I try 16 is conflict > on the RAMFile monitor. > > The conflict shows five Lucene Merge Threads holding the monitor, plus > my own threads. I'm not sure that I'm interpreting this correctly; > perhaps there were five different occasions when a merge thread > blocked my threads. > > In any case, I'm fairly stumped as to how my threads manage to > materially block each other, since the synchronized methods used on > the search side in RAMFile are pretty tiny. > > YourKit claims that the problem is in RAMFile.numBuffers, but I have > not been able to catch this being called in a search. > > I did spot the following backtrace. > > In any case, I'd be grateful if anyone could tell me if this is a > familiar story or one for which there's a solution. > > >RAMFile.getBuffer(int) line: 75 >RAMInputStream.switchCurrentBuffer(boolean) line: 107 >RAMInputStream.seek(long) line: 144 >SegmentNorms.bytes() line: 163 >SegmentNorms.bytes() line: 143 >ReadOnlySegmentReader(SegmentReader).norms(String) line: 599 >TermQuery$TermWeight.scorer(IndexReader, boolean, boolean) line: 107 >BooleanQuery$BooleanWeight.scorer(IndexReader, boolean, boolean) line: 298 > >IndexSearcher.search(Weight, Filter, Collector) line: 577 >IndexSearcher.search(Weight, Filter, int, Sort, boolean) line: 517 >IndexSearcher.search(Weight, Filter, int, Sort) line: 487 >IndexSearcher.search(Query, Filter, int, Sort) line: 400 > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to separate one index into multiple?
I think you could do as follows. taking splitting it to 3 indexes for example. you can copy the index 3 times. for copy 1 for(int i=0;i
Re: How to separate one index into multiple?
you can delete by query like -category:category1 On Sun, Feb 19, 2012 at 9:41 PM, Li Li wrote: > I think you could do as follows. taking splitting it to 3 indexes for > example. > you can copy the index 3 times. > for copy 1 > for(int i=0;i reader1.delete(i); > } > for copy > for(int i=1;i reader2.delete(i); > } > > and then optimize these 3 indexes >
Hanging with fixed thread pool in the IndexSearcher multithread code
3.5.0: I passed a fixed size executor service with one thread, and then with two threads, to the IndexSearcher constructor. It hung. With three threads, it didn't work, but I got different results than when I don't pass in an executor service at all. Is this expected? Should the javadoc say something? (I can make a patch). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Counting all the hits with parallel searching
If I have a lot of segments, and an executor service in my searcher, the following runs out of memory instantly, building giant heaps. Is there another way to express this? Should I file a JIRA that the parallel code should have some graceful behavior? int longestMentionFreq = searcher.search(longestMentionQuery, filter, Integer.MAX_VALUE).totalHits + 1; - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Counting all the hits with parallel searching
On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies wrote: > If I have a lot of segments, and an executor service in my searcher, > the following runs out of memory instantly, building giant heaps. Is > there another way to express this? Should I file a JIRA that the > parallel code should have some graceful behavior? > > int longestMentionFreq = searcher.search(longestMentionQuery, filter, > Integer.MAX_VALUE).totalHits + 1; > the _n_ you pass there is the actual number of results that you need to display to the user, in top-N order. so in most cases this should be something like 20. This is because it builds a priority queue of size _n_ to return results in sorted order. Don't pass huge numbers here: if you are not actually returning pages of results to the user, but just counting hits, then pass TotalHitCountCollector. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Counting all the hits with parallel searching
By passing Integer.MAX_VALUE you are requesting Lucene to allocate a priority queue for collecting results with that size, this OOMs. With Lucene if you are using TopDocs, the idea is to only get a limited amount of Top-Ranking documents to display search results. The user is not interested in the 2 million's result page, so pass a small number of top hits. To simply count all hits like you seem to do, there is a separate collector available: http://goo.gl/XsPVR - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Benson Margulies [mailto:bimargul...@gmail.com] > Sent: Sunday, February 19, 2012 3:22 PM > To: java-user@lucene.apache.org > Subject: Counting all the hits with parallel searching > > If I have a lot of segments, and an executor service in my searcher, the > following runs out of memory instantly, building giant heaps. Is there another > way to express this? Should I file a JIRA that the parallel code should have > some > graceful behavior? > > int longestMentionFreq = searcher.search(longestMentionQuery, filter, > Integer.MAX_VALUE).totalHits + 1; > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Counting all the hits with parallel searching
thanks, that's what I needed. On Feb 19, 2012, at 9:51 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies > wrote: >> If I have a lot of segments, and an executor service in my searcher, >> the following runs out of memory instantly, building giant heaps. Is >> there another way to express this? Should I file a JIRA that the >> parallel code should have some graceful behavior? >> >> int longestMentionFreq = searcher.search(longestMentionQuery, filter, >> Integer.MAX_VALUE).totalHits + 1; >> > > the _n_ you pass there is the actual number of results that you need > to display to the user, in top-N order. > so in most cases this should be something like 20. > > This is because it builds a priority queue of size _n_ to return > results in sorted order. > > Don't pass huge numbers here: if you are not actually returning pages > of results to the user, but just counting hits, then pass > TotalHitCountCollector. > > -- > lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Counting all the hits with parallel searching
On Sun, Feb 19, 2012 at 10:23 AM, Benson Margulies wrote: > thanks, that's what I needed. > Thanks for bringing this up, I think its a common issue, I created https://issues.apache.org/jira/browse/LUCENE-3799 to hopefully improve the docs situation. -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies wrote: > 3.5.0: I passed a fixed size executor service with one thread, and > then with two threads, to the IndexSearcher constructor. > > It hung. > > With three threads, it didn't work, but I got different results than > when I don't pass in an executor service at all. > > Is this expected? Should the javadoc say something? (I can make a patch). > I'm not sure I understand the details here, but I don't like the sound of 'different results': is it possible you can work this down into a test case that can be attached to jira? -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
I should have been clearer; the hang I can make into a test case but I wondered if is would just get closed as 'works as designed'. the result discrepancy needs some investigation, I should not have mentioned it yet. On Feb 19, 2012, at 10:40 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0: I passed a fixed size executor service with one thread, and >> then with two threads, to the IndexSearcher constructor. >> >> It hung. >> >> With three threads, it didn't work, but I got different results than >> when I don't pass in an executor service at all. >> >> Is this expected? Should the javadoc say something? (I can make a patch). >> > > I'm not sure I understand the details here, but I don't like the sound > of 'different results': is it possible you can work this down into a > test case that can be attached to jira? > > > -- > lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
and there was a dumb typo. 1 thread: hang 2 threads: hang 3 or more: no hang On Feb 19, 2012, at 10:40 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0: I passed a fixed size executor service with one thread, and >> then with two threads, to the IndexSearcher constructor. >> >> It hung. >> >> With three threads, it didn't work, but I got different results than >> when I don't pass in an executor service at all. >> >> Is this expected? Should the javadoc say something? (I can make a patch). >> > > I'm not sure I understand the details here, but I don't like the sound > of 'different results': is it possible you can work this down into a > test case that can be attached to jira? > > > -- > lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Implement a custom similarity
Hello, I am really new to Lucene, last week through this list i was really successfull into finding a solution to my problem. I have a new question now, i am trying to implement a new similarity class that uses the Jaccard coefficient, i have been reading the javadocs and a lot of other webpages on the matter, but my problem is that i still cannot understand how to do it. So far i know that i have to subclass the DefaultSimilarity and (if i am not wrong) i have to edit all the build in methods to return the corect score. Since Jaccard coefficiency is the conjuction of the query/document sets divided by the union of the two sets i think i only need the coord(q,d) and all the rest measures in the default similarity can return 1 to the score computation. My problem is that i cannot locate how to obtain the number of terms that each document has. Also do you think this approach is correct? I would be gratefull if you could give me advice or point towards a tutorial on the matter cause two days of searching were fruitless in finding an example code. Thank you in advance. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
Conveniently, all the 'wrong-result' problems disappeared when I followed your advice about counting hits. On Sun, Feb 19, 2012 at 10:39 AM, Robert Muir wrote: > On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies > wrote: >> 3.5.0: I passed a fixed size executor service with one thread, and >> then with two threads, to the IndexSearcher constructor. >> >> It hung. >> >> With three threads, it didn't work, but I got different results than >> when I don't pass in an executor service at all. >> >> Is this expected? Should the javadoc say something? (I can make a patch). >> > > I'm not sure I understand the details here, but I don't like the sound > of 'different results': is it possible you can work this down into a > test case that can be attached to jira? > > > -- > lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the hang. I think this nets out to pilot error, but maybe Javadoc could protect the next person from making the same mistake. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Hanging with fixed thread pool in the IndexSearcher multithread code
See my response. The problem is not in Lucene; its in general a problem of fixed thread pools that execute other callables from within a callable running at the moment in the same thread pool. Callables are simply waiting for each other. Use a separate thread pool for Lucene (or whenever you execute new callables from within another running callable) Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Benson Margulies [mailto:bimargul...@gmail.com] > Sent: Monday, February 20, 2012 1:47 AM > To: java-user@lucene.apache.org > Subject: Re: Hanging with fixed thread pool in the IndexSearcher multithread > code > > See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the > hang. I think this nets out to pilot error, but maybe Javadoc could protect > the > next person from making the same mistake. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
On Sun, Feb 19, 2012 at 8:07 PM, Uwe Schindler wrote: > See my response. The problem is not in Lucene; its in general a problem of > fixed thread pools that execute other callables from within a callable > running at the moment in the same thread pool. Callables are simply waiting > for each other. > > Use a separate thread pool for Lucene (or whenever you execute new callables > from within another running callable) Right. There's nothing like coding a test case to cast one's stupid errors into high relief. Sorry for all the noise. > > Uwe > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > >> -Original Message- >> From: Benson Margulies [mailto:bimargul...@gmail.com] >> Sent: Monday, February 20, 2012 1:47 AM >> To: java-user@lucene.apache.org >> Subject: Re: Hanging with fixed thread pool in the IndexSearcher multithread >> code >> >> See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the >> hang. I think this nets out to pilot error, but maybe Javadoc could protect >> the >> next person from making the same mistake. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Here a merge thread, there a merge thread ...
A long-running program of mine (which Uwe's read a model of) slowly keeps adding merge threads. I count 22 at the moment. Each one shows up, runs for a bit, and then goes to sleep for, seemingly ever. I don't do anything explicit to control merging behavior. They name themselves "Lucene Merge Thread #xxx" where xxx is a non-contiguous but ever-growing number. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Hanging with fixed thread pool in the IndexSearcher multithread code
On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler wrote: > See my response. The problem is not in Lucene; its in general a problem of > fixed > thread pools that execute other callables from within a callable running at > the > moment in the same thread pool. Callables are simply waiting for each other. What we do to get around this issue is to have a utility class which you call to submit jobs to the executor, but instead of waiting after submitting them, it starts calling get() starting from the end of the list. So if there is no other thread available on the executor, the main thread ends up doing all the work and then returns like normal. The problem with this solution is that it requires all code in the system to go through this utility to avoid the issue, and obviously Lucene is one of those things which isn't written to defend against this. Java 7's solution seems to be ForkJoinPool but I gather there is no simple way to use that with Lucene... TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org