Re: Concurrency and multiple merge threads

2012-02-19 Thread Mike McCandless
Sounds like a nice machine!

It's frustrating that RAMFile even has any sync'd methods... Lucene is write 
once, so once a RAMFile is written we don't need any sync to read it.  Maybe on 
creating a RAMInputStream we could make a new ReadOnlyRAMFile, holding the same 
buffers without sync.

That said the ops inside the sync are tiny so it's strange if this really is 
the cause of the contention... It could just be a profiling ghost and something 
else is the real bottleneck...

Mike

On Feb 18, 2012, at 9:21 PM, Benson Margulies  wrote:

> Using Lucene 3.5.0, on a 32-core machine, I have coded something shaped like:
> 
> make a writer on a RAMDirectory.
> 
> start:
> 
>  Create a near-real-time searcher from it.
> 
>  farm work out to multiple threads, each of which performs a search
> and retrieves some docs.
> 
>  When all are done, write some new docs.
> 
> back to start.
> 
> The returns of adding threads diminish faster than I would like.
> According to YourKit, a major contribution when I try 16 is conflict
> on the RAMFile monitor.
> 
> The conflict shows five Lucene Merge Threads holding the monitor, plus
> my own threads. I'm not sure that I'm interpreting this correctly;
> perhaps there were five different occasions when a merge thread
> blocked my threads.
> 
> In any case, I'm fairly stumped as to how my threads manage to
> materially block each other, since the synchronized methods used on
> the search side in RAMFile are pretty tiny.
> 
> YourKit claims that the problem is in RAMFile.numBuffers, but I have
> not been able to catch this being called in a search.
> 
> I did spot the following backtrace.
> 
> In any case, I'd be grateful if anyone could tell me if this is a
> familiar story or one for which there's a solution.
> 
> 
>RAMFile.getBuffer(int) line: 75
>RAMInputStream.switchCurrentBuffer(boolean) line: 107
>RAMInputStream.seek(long) line: 144
>SegmentNorms.bytes() line: 163
>SegmentNorms.bytes() line: 143
>ReadOnlySegmentReader(SegmentReader).norms(String) line: 599
>TermQuery$TermWeight.scorer(IndexReader, boolean, boolean) line: 107
>BooleanQuery$BooleanWeight.scorer(IndexReader, boolean, boolean) line: 298 
>
>IndexSearcher.search(Weight, Filter, Collector) line: 577
>IndexSearcher.search(Weight, Filter, int, Sort, boolean) line: 517
>IndexSearcher.search(Weight, Filter, int, Sort) line: 487
>IndexSearcher.search(Query, Filter, int, Sort) line: 400
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to separate one index into multiple?

2012-02-19 Thread Li Li
I think you could do as follows.  taking splitting it to 3 indexes for
example.
you can copy the index 3 times.
for copy 1
  for(int i=0;i

Re: How to separate one index into multiple?

2012-02-19 Thread Li Li
you can delete by query like -category:category1

On Sun, Feb 19, 2012 at 9:41 PM, Li Li  wrote:

> I think you could do as follows.  taking splitting it to 3 indexes for
> example.
> you can copy the index 3 times.
> for copy 1
>   for(int i=0;i   reader1.delete(i);
>   }
> for copy
>   for(int i=1;i   reader2.delete(i);
>  }
> 
>  and then optimize these 3 indexes
>


Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
3.5.0:  I passed a fixed size executor service with one thread, and
then with two threads, to the IndexSearcher constructor.

It hung.

With three threads, it didn't work, but I got different results than
when I don't pass in an executor service at all.

Is this expected? Should the javadoc say something? (I can make a patch).

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Counting all the hits with parallel searching

2012-02-19 Thread Benson Margulies
If I have a lot of segments, and an executor service in my searcher,
the following runs out of memory instantly, building giant heaps. Is
there another way to express this? Should I file a JIRA that the
parallel code should have some graceful behavior?

int longestMentionFreq = searcher.search(longestMentionQuery, filter,
Integer.MAX_VALUE).totalHits + 1;

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Counting all the hits with parallel searching

2012-02-19 Thread Robert Muir
On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies  wrote:
> If I have a lot of segments, and an executor service in my searcher,
> the following runs out of memory instantly, building giant heaps. Is
> there another way to express this? Should I file a JIRA that the
> parallel code should have some graceful behavior?
>
> int longestMentionFreq = searcher.search(longestMentionQuery, filter,
> Integer.MAX_VALUE).totalHits + 1;
>

the _n_ you pass there is the actual number of results that you need
to display to the user, in top-N order.
so in most cases this should be something like 20.

This is because it builds a priority queue of size _n_ to return
results in sorted order.

Don't pass huge numbers here: if you are not actually returning pages
of results to the user, but just counting hits, then pass
TotalHitCountCollector.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Counting all the hits with parallel searching

2012-02-19 Thread Uwe Schindler
By passing Integer.MAX_VALUE you are requesting Lucene to allocate a priority 
queue for collecting results with that size, this OOMs. With Lucene if you are 
using TopDocs, the idea is to only get a limited amount of Top-Ranking 
documents to display search results. The user is not interested in the 2 
million's result page, so pass a small number of top hits.

To simply count all hits like you seem to do, there is a separate collector 
available: http://goo.gl/XsPVR

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Benson Margulies [mailto:bimargul...@gmail.com]
> Sent: Sunday, February 19, 2012 3:22 PM
> To: java-user@lucene.apache.org
> Subject: Counting all the hits with parallel searching
> 
> If I have a lot of segments, and an executor service in my searcher, the
> following runs out of memory instantly, building giant heaps. Is there another
> way to express this? Should I file a JIRA that the parallel code should have 
> some
> graceful behavior?
> 
> int longestMentionFreq = searcher.search(longestMentionQuery, filter,
> Integer.MAX_VALUE).totalHits + 1;
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Counting all the hits with parallel searching

2012-02-19 Thread Benson Margulies
thanks, that's what I needed.

On Feb 19, 2012, at 9:51 AM, Robert Muir  wrote:

> On Sun, Feb 19, 2012 at 9:21 AM, Benson Margulies  
> wrote:
>> If I have a lot of segments, and an executor service in my searcher,
>> the following runs out of memory instantly, building giant heaps. Is
>> there another way to express this? Should I file a JIRA that the
>> parallel code should have some graceful behavior?
>>
>> int longestMentionFreq = searcher.search(longestMentionQuery, filter,
>> Integer.MAX_VALUE).totalHits + 1;
>>
>
> the _n_ you pass there is the actual number of results that you need
> to display to the user, in top-N order.
> so in most cases this should be something like 20.
>
> This is because it builds a priority queue of size _n_ to return
> results in sorted order.
>
> Don't pass huge numbers here: if you are not actually returning pages
> of results to the user, but just counting hits, then pass
> TotalHitCountCollector.
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Counting all the hits with parallel searching

2012-02-19 Thread Robert Muir
On Sun, Feb 19, 2012 at 10:23 AM, Benson Margulies
 wrote:
> thanks, that's what I needed.
>

Thanks for bringing this up, I think its a common issue, I created
https://issues.apache.org/jira/browse/LUCENE-3799 to hopefully improve
the docs situation.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Robert Muir
On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies  wrote:
> 3.5.0:  I passed a fixed size executor service with one thread, and
> then with two threads, to the IndexSearcher constructor.
>
> It hung.
>
> With three threads, it didn't work, but I got different results than
> when I don't pass in an executor service at all.
>
> Is this expected? Should the javadoc say something? (I can make a patch).
>

I'm not sure I understand the details here, but I don't like the sound
of 'different results': is it possible you can work this down into a
test case that can be attached to jira?


-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
I should have been clearer; the hang I can make into a test case but I
wondered if is would just get closed as 'works as designed'. the
result discrepancy needs some investigation, I should not have
mentioned it yet.

On Feb 19, 2012, at 10:40 AM, Robert Muir  wrote:

> On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies  
> wrote:
>> 3.5.0:  I passed a fixed size executor service with one thread, and
>> then with two threads, to the IndexSearcher constructor.
>>
>> It hung.
>>
>> With three threads, it didn't work, but I got different results than
>> when I don't pass in an executor service at all.
>>
>> Is this expected? Should the javadoc say something? (I can make a patch).
>>
>
> I'm not sure I understand the details here, but I don't like the sound
> of 'different results': is it possible you can work this down into a
> test case that can be attached to jira?
>
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
and there was a dumb typo.

1 thread: hang
2 threads: hang
3 or more: no hang

On Feb 19, 2012, at 10:40 AM, Robert Muir  wrote:

> On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies  
> wrote:
>> 3.5.0:  I passed a fixed size executor service with one thread, and
>> then with two threads, to the IndexSearcher constructor.
>>
>> It hung.
>>
>> With three threads, it didn't work, but I got different results than
>> when I don't pass in an executor service at all.
>>
>> Is this expected? Should the javadoc say something? (I can make a patch).
>>
>
> I'm not sure I understand the details here, but I don't like the sound
> of 'different results': is it possible you can work this down into a
> test case that can be attached to jira?
>
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Implement a custom similarity

2012-02-19 Thread Damerian

Hello,
I am really new to Lucene, last week through this list i was really 
successfull into finding a solution to my problem.
I have a new question now, i am trying to implement a new similarity 
class that uses the Jaccard coefficient, i have been reading the 
javadocs and a lot of other webpages on the matter, but my problem is 
that i still cannot understand how to do it.
So far i know that i have to subclass the DefaultSimilarity and (if i am 
not wrong) i have to edit all the build in methods to return the corect 
score. Since Jaccard coefficiency is the conjuction of the 
query/document sets divided by the union of the two sets i think i only 
need the coord(q,d) and all the rest measures in the default similarity 
can return 1 to the score computation. My problem is that i cannot 
locate how to obtain the number of terms that each document has.

Also do you think this approach is correct?
I would be gratefull if you could give me advice or point towards a 
tutorial on the matter cause two days of searching were fruitless in 
finding an example code.

Thank you in advance.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
Conveniently, all the 'wrong-result' problems disappeared when I
followed your advice about counting hits.

On Sun, Feb 19, 2012 at 10:39 AM, Robert Muir  wrote:
> On Sun, Feb 19, 2012 at 9:08 AM, Benson Margulies  
> wrote:
>> 3.5.0:  I passed a fixed size executor service with one thread, and
>> then with two threads, to the IndexSearcher constructor.
>>
>> It hung.
>>
>> With three threads, it didn't work, but I got different results than
>> when I don't pass in an executor service at all.
>>
>> Is this expected? Should the javadoc say something? (I can make a patch).
>>
>
> I'm not sure I understand the details here, but I don't like the sound
> of 'different results': is it possible you can work this down into a
> test case that can be attached to jira?
>
>
> --
> lucidimagination.com
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
See https://issues.apache.org/jira/browse/LUCENE-3803 for an example
of the hang. I think this nets out to pilot error, but maybe Javadoc
could protect the next person from making the same mistake.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Uwe Schindler
See my response. The problem is not in Lucene; its in general a problem of 
fixed thread pools that execute other callables from within a callable running 
at the moment in the same thread pool. Callables are simply waiting for each 
other.

Use a separate thread pool for Lucene (or whenever you execute new callables 
from within another running callable)

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Benson Margulies [mailto:bimargul...@gmail.com]
> Sent: Monday, February 20, 2012 1:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: Hanging with fixed thread pool in the IndexSearcher multithread
> code
> 
> See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the
> hang. I think this nets out to pilot error, but maybe Javadoc could protect 
> the
> next person from making the same mistake.
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Benson Margulies
On Sun, Feb 19, 2012 at 8:07 PM, Uwe Schindler  wrote:
> See my response. The problem is not in Lucene; its in general a problem of 
> fixed thread pools that execute other callables from within a callable 
> running at the moment in the same thread pool. Callables are simply waiting 
> for each other.
>
> Use a separate thread pool for Lucene (or whenever you execute new callables 
> from within another running callable)

Right. There's nothing like coding a test case to cast one's stupid
errors into high relief. Sorry for all the noise.


>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Benson Margulies [mailto:bimargul...@gmail.com]
>> Sent: Monday, February 20, 2012 1:47 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Hanging with fixed thread pool in the IndexSearcher multithread
>> code
>>
>> See https://issues.apache.org/jira/browse/LUCENE-3803 for an example of the
>> hang. I think this nets out to pilot error, but maybe Javadoc could protect 
>> the
>> next person from making the same mistake.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Here a merge thread, there a merge thread ...

2012-02-19 Thread Benson Margulies
A long-running program of mine (which Uwe's read a model of) slowly
keeps adding merge threads. I count 22 at the moment. Each one shows
up, runs for a bit, and then goes to sleep for, seemingly ever. I
don't do anything explicit to control merging behavior.

They name themselves "Lucene Merge Thread #xxx" where xxx is a
non-contiguous but ever-growing number.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Hanging with fixed thread pool in the IndexSearcher multithread code

2012-02-19 Thread Trejkaz
On Mon, Feb 20, 2012 at 12:07 PM, Uwe Schindler  wrote:
> See my response. The problem is not in Lucene; its in general a problem of 
> fixed
> thread pools that execute other callables from within a callable running at 
> the
> moment in the same thread pool. Callables are simply waiting for each other.

What we do to get around this issue is to have a utility class which
you call to submit jobs to the executor, but instead of waiting after
submitting them, it starts calling get() starting from the end of the
list. So if there is no other thread available on the executor, the
main thread ends up doing all the work and then returns like normal.

The problem with this solution is that it requires all code in the
system to go through this utility to avoid the issue, and obviously
Lucene is one of those things which isn't written to defend against
this.

Java 7's solution seems to be ForkJoinPool but I gather there is no
simple way to use that with Lucene...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org