document with parent-child relationship

2011-04-29 Thread svonec
Hello,

I need an advice on how to create an document that has parent-child
relationship. Here is an example:

"low pressure" -> "engine"
  -> "wheel"
  -> 

"low pressure" string is the parent and "engine" and "wheel" are
children. I'd like to be able to search strings such as "low pressure
in engine" or just "low" or "engine" and the result should be an ID of
the parent. How do I create fields in the lucene document to express
this relationship?

Any advice appreciated.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: document with parent-child relationship

2011-04-29 Thread harsh srivastava
Hi,

You can create three fields for a document to index e.g.

Fields => parent_id   parent_textchild_text
Contents =>1  low pressure   engine wheel,
etc
  2  Electronics laptop
pc ...


Hope it helps.

Harsh


On Fri, Apr 29, 2011 at 12:59 PM,  wrote:

> Hello,
>
> I need an advice on how to create an document that has parent-child
> relationship. Here is an example:
>
> "low pressure" -> "engine"
>  -> "wheel"
>  -> 
>
> "low pressure" string is the parent and "engine" and "wheel" are
> children. I'd like to be able to search strings such as "low pressure
> in engine" or just "low" or "engine" and the result should be an ID of
> the parent. How do I create fields in the lucene document to express
> this relationship?
>
> Any advice appreciated.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene 3.0.3 with debug information

2011-04-29 Thread Paul Taylor
Is there a built debug version of lucene 3.0.3 so I can profile it 
properly to find what part of the search is taking the time.


Note:Ive already profiled by application and determined that it is the 
lucene/Search that is taking the time, I also had another attempt using 
luke but find it incredibly buggy and of little use.


thanks Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Are Okapi BM25 scores normalized into 0 and 1 ?

2011-04-29 Thread Patrick Diviacco
Can anybody provide me some information about it ? Even a small clue, I'm
kinda stuck on this and the owner of the libraries do not answer emails.

Thanks


On 28 April 2011 13:49, Patrick Diviacco  wrote:

> Is Okapi BM25 (its implementation in Lucene:
> nlp.uned.es/~jperezi/Lucene-BM25) returning back normalized query scores
> (in between 0 and 1) ?
>
> According to Okapi formula the final score should be normalized. Could you
> give some information about that ?
>
> thanks
>
>
>


Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
Hi,

OK, so it looks like it's not MemoryIndex and its Comparator that are funky.  
After switching from quickSort call in MemoryIndex to mergeSort, the problem 
persists:

'1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu 
time=497060.ms user time=495210.msat 
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105) 

at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) 
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104) 
at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
So something else is calling quickSort when it gets stuck.  Weirdly, when I get 
a thread dump and get the above, I don't see the original caller.  Maybe 
because 
the stack is already too deep and the printout is limited to N lines per call 
stack?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Uwe Schindler 
> To: java-user@lucene.apache.org
> Sent: Thu, April 28, 2011 5:54:44 PM
> Subject: RE: SorterTemplate.quickSort causes StackOverflowError
> 
> > Thanks for confirming, Javier! :)
> > 
> > Uwe, I assume you are  referring to this line 528 in MemoryIndex?
> > 
> >  528 if (size > 1) ArrayUtil.quickSort(entries,  termComparator);
> > 
> > And this funky Comparator from  MemoryIndex:
> > 
> > 208   private static final  Comparator termComparator = new
> > Comparator()  {
> > 209  @SuppressWarnings("unchecked")
> > 210 public  int compare(Object o1, Object o2) {
> > 211if (o1 instanceof Map.Entry) o1 =  ((Map.Entry)
> > o1).getKey();
> > 212if (o2 instanceof Map.Entry) o2 =  ((Map.Entry)
> > o2).getKey();
> > 213if (o1 == o2) return 0;
> > 214return ((Comparable) o1).compareTo((Comparable) o2);
> >  215 }
> > 216   };
> > 
> >  Will try, thanks!
> 
> Yeah, simply try with mergeSort in line 528. If that  helps, this comparator
> is buggy.
> 
> Uwe
> 
> 
> > - Original  Message 
> > > From: Uwe Schindler 
> > > To: java-user@lucene.apache.org
> >  > Sent: Thu, April 28, 2011 5:36:13 PM
> > > Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
> > >
> > > Hi  Otis,
> > >
> > > Can you reproduce this somehow and send test  code? I could look  into
> > > it. I don't expect the error in the  quicksort algorithm itself as this
> > > one is used e.g. BytesRefHash /  TermsHash, if there is a bug we would
> > > have  seen it long time  ago.
> > >
> > > I have not seen this before, but I suspect  a  problem in this very
> > > strange comparator in MemoryIndex  (which is very broken,  if you look
> > > at its code - it can  compare Strings with Map.Entry and so on,
> > > b), maybe the  comparator is not stable? In this case, quicksort
> > > can  easily  loop endless and stack overflow. In Lucene 3.0 this used
> > > stock  java  sort (which is mergesort), maybe replace the
> > >  ArrayUtils.quickSort my  ArrayUtils.mergeSort() and see if problem  is
> still
> > there?
> > >
> > > Uwe
> > >
> >  > -
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63,  D-28213  Bremen
> > > http://www.thetaphi.de
> > > eMail: u...@thetaphi.de
> > >
> >  >
> > > > -Original  Message-
> > > > From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> >  > >  Sent: Thursday, April 28, 2011 11:17 PM
> > > > To: java-user@lucene.apache.org
> >  > >  Subject: SorterTemplate.quickSort causes  StackOverflowError
> > > >
> > > >  Hi,
> > >  >
> > > > I'm looking at some code that uses MemoryIndex (Lucene  3.1)  and
> > > > that's exhibiting a strange behaviour - it  slows down over  time.
> > > > The MemoryIndex contains 1 doc, of  course, and executes a set of a
> > > > few thousand queries against  it.  The set of queries does not
> > > > change - the
> >  > same
> > > > set of queries gets executed on all incoming   documents.
> > > > This code runs very quickly. in the  beginning.   But  with time is
> gets
> > > > slower and  slower and slower. and then I get  this:
> > > >
> >  > > 4/28/11 10:32:52 PM (S) SolrException.log  :
> java.lang.StackOverflowError
> > > > at
> >  > >
> >  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> >  > >  at
> > > >
> >  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> >  > >  at
> > > >
> > > >  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:
> >  > > 104)
> > > >
> > > > I haven't profiled this code  yet (remote server, firewall in
> > > > between,
> > > can't  use
> > > > YourKit...), but does the above look familiar to   anyone?
> > > > I've looked at the code and obviously there is the  recursive  call
> > > > that's problematic here - it looks like  the recursion just gets
> > > > deeper and deeper
> > >  and
> > > > "gets stuck", eventually getting too deep for  the  JVM's taste.
> > > >
> > >

Re: Are Okapi BM25 scores normalized into 0 and 1 ?

2011-04-29 Thread Paul Libbrecht
Patrick if the question is about the code snippert at the page you mention, 
which I copy below, I believe the answer is no and the author is aware of it 
since he is adding a comment about not-normalized in the second example.

ScoreDocs and TopDocs are not returning normalized scores.
Normalized scores tend to be rare in Lucene nowadays, I believe earlier 
strategy was to divide by max-score when the latter was bigger than 1.

paul

IndexSearcher searcher = new IndexSearcher("IndexPath");

//Load average length
BM25Parameters.load(avgLengthPath);
BM25BooleanQuery query = new BM25BooleanQuery("This is my Query", 
"Search-Field",
new StandardAnalyzer());

TopDocs top = searcher.search(query, null, 10);
ScoreDoc[] docs = top.scoreDocs;

//Print results
for (int i = 0; i $<$ top.scoreDocs.length; i++) {
  System.out.println(docs[i].doc + ":"+docs[i].score);
}


Le 29 avr. 2011 à 13:20, Patrick Diviacco a écrit :

> Can anybody provide me some information about it ? Even a small clue, I'm
> kinda stuck on this and the owner of the libraries do not answer emails.
> 
> Thanks
> 
> 
> On 28 April 2011 13:49, Patrick Diviacco  wrote:
> 
>> Is Okapi BM25 (its implementation in Lucene:
>> nlp.uned.es/~jperezi/Lucene-BM25) returning back normalized query scores
>> (in between 0 and 1) ?
>> 
>> According to Okapi formula the final score should be normalized. Could you
>> give some information about that ?
>> 
>> thanks
>> 
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread jm
maybe http://youdebug.kenai.com/ could be useful. If you are lucky you could
get it to set a breakpoint when the recursive call has reached depth X.

On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> OK, so it looks like it's not MemoryIndex and its Comparator that are
> funky.
> After switching from quickSort call in MemoryIndex to mergeSort, the
> problem
> persists:
>
> '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu
> time=497060.ms user time=495210.msat
> org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105)
>
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> So something else is calling quickSort when it gets stuck.  Weirdly, when I
> get
> a thread dump and get the above, I don't see the original caller.  Maybe
> because
> the stack is already too deep and the printout is limited to N lines per
> call
> stack?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Uwe Schindler 
> > To: java-user@lucene.apache.org
> > Sent: Thu, April 28, 2011 5:54:44 PM
> > Subject: RE: SorterTemplate.quickSort causes StackOverflowError
> >
> > > Thanks for confirming, Javier! :)
> > >
> > > Uwe, I assume you are  referring to this line 528 in MemoryIndex?
> > >
> > >  528 if (size > 1) ArrayUtil.quickSort(entries,
>  termComparator);
> > >
> > > And this funky Comparator from  MemoryIndex:
> > >
> > > 208   private static final  Comparator termComparator = new
> > > Comparator()  {
> > > 209  @SuppressWarnings("unchecked")
> > > 210 public  int compare(Object o1, Object o2) {
> > > 211if (o1 instanceof Map.Entry) o1 =
>  ((Map.Entry)
> > > o1).getKey();
> > > 212if (o2 instanceof Map.Entry) o2 =
>  ((Map.Entry)
> > > o2).getKey();
> > > 213if (o1 == o2) return 0;
> > > 214return ((Comparable) o1).compareTo((Comparable) o2);
> > >  215 }
> > > 216   };
> > >
> > >  Will try, thanks!
> >
> > Yeah, simply try with mergeSort in line 528. If that  helps, this
> comparator
> > is buggy.
> >
> > Uwe
> >
> >
> > > - Original  Message 
> > > > From: Uwe Schindler 
> > > > To: java-user@lucene.apache.org
> > >  > Sent: Thu, April 28, 2011 5:36:13 PM
> > > > Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
> > > >
> > > > Hi  Otis,
> > > >
> > > > Can you reproduce this somehow and send test  code? I could look
>  into
> > > > it. I don't expect the error in the  quicksort algorithm itself as
> this
> > > > one is used e.g. BytesRefHash /  TermsHash, if there is a bug we
> would
> > > > have  seen it long time  ago.
> > > >
> > > > I have not seen this before, but I suspect  a  problem in this very
> > > > strange comparator in MemoryIndex  (which is very broken,  if you
> look
> > > > at its code - it can  compare Strings with Map.Entry and so on,
> > > > b), maybe the  comparator is not stable? In this case, quicksort
> > > > can  easily  loop endless and stack overflow. In Lucene 3.0 this used
> > > > stock  java  sort (which is mergesort), maybe replace the
> > > >  ArrayUtils.quickSort my  ArrayUtils.mergeSort() and see if problem
>  is
> > still
> > > there?
> > > >
> > > > Uwe
> > > >
> > >  > -
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63,  D-28213  Bremen
> > > > http://www.thetaphi.de
> > > > eMail: u...@thetaphi.de
> > > >
> > >  >
> > > > > -Original  Message-
> > > > > From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> > >  > >  Sent: Thursday, April 28, 2011 11:17 PM
> > > > > To: java-user@lucene.apache.org
> > >  > >  Subject: SorterTemplate.quickSort causes  StackOverflowError
> > > > >
> > > > >  Hi,
> > > >  >
> > > > > I'm looking at some code that uses MemoryIndex (Lucene  3.1)  and
> > > > > that's exhibiting a strange behaviour - it  slows down over  time.
> > > > > The MemoryIndex contains 1 doc, of  course, and executes a set of a
> > > > > few thousand queries against  it.  The set of queries does not
> > > > > change - the
> > >  > same
> > > > > set of queries gets executed on all incoming   documents.
> > > > > This code runs very quickly. in the  beginning.   But  with
> time is
> > gets
> > > > > slower and  slower and slower. and then I get  this:
> > > > >
> > >  > > 4/28/11 10:32:52 PM (S) SolrException.log  :
> > java.lang.StackOverflowError
> > > > > at
> > >  > >
> > >
>  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> > >  > >  at
> > > > >
> > >
>  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> > >  > >  at
> > > > >
> > > > >
>  org.apache.lucene.util.SorterTemplate.quickSo

Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Dawid Weiss
Don't know if this helps, but debugging stuff like this I simply add a
(manually inserted or aspectj-injected) recursion count, add a breakpoint
inside an if checking for recursion count >> X and run the vm with an
attached socket debugger. This lets you run at (nearly) full speed and once
you hit the breakpoint, inspect the stack, variables, etc...

Dawid

On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> OK, so it looks like it's not MemoryIndex and its Comparator that are
> funky.
> After switching from quickSort call in MemoryIndex to mergeSort, the
> problem
> persists:
>
> '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=, total cpu
> time=497060.ms user time=495210.msat
> org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105)
>
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> at org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> So something else is calling quickSort when it gets stuck.  Weirdly, when I
> get
> a thread dump and get the above, I don't see the original caller.  Maybe
> because
> the stack is already too deep and the printout is limited to N lines per
> call
> stack?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Uwe Schindler 
> > To: java-user@lucene.apache.org
> > Sent: Thu, April 28, 2011 5:54:44 PM
> > Subject: RE: SorterTemplate.quickSort causes StackOverflowError
> >
> > > Thanks for confirming, Javier! :)
> > >
> > > Uwe, I assume you are  referring to this line 528 in MemoryIndex?
> > >
> > >  528 if (size > 1) ArrayUtil.quickSort(entries,
>  termComparator);
> > >
> > > And this funky Comparator from  MemoryIndex:
> > >
> > > 208   private static final  Comparator termComparator = new
> > > Comparator()  {
> > > 209  @SuppressWarnings("unchecked")
> > > 210 public  int compare(Object o1, Object o2) {
> > > 211if (o1 instanceof Map.Entry) o1 =
>  ((Map.Entry)
> > > o1).getKey();
> > > 212if (o2 instanceof Map.Entry) o2 =
>  ((Map.Entry)
> > > o2).getKey();
> > > 213if (o1 == o2) return 0;
> > > 214return ((Comparable) o1).compareTo((Comparable) o2);
> > >  215 }
> > > 216   };
> > >
> > >  Will try, thanks!
> >
> > Yeah, simply try with mergeSort in line 528. If that  helps, this
> comparator
> > is buggy.
> >
> > Uwe
> >
> >
> > > - Original  Message 
> > > > From: Uwe Schindler 
> > > > To: java-user@lucene.apache.org
> > >  > Sent: Thu, April 28, 2011 5:36:13 PM
> > > > Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
> > > >
> > > > Hi  Otis,
> > > >
> > > > Can you reproduce this somehow and send test  code? I could look
>  into
> > > > it. I don't expect the error in the  quicksort algorithm itself as
> this
> > > > one is used e.g. BytesRefHash /  TermsHash, if there is a bug we
> would
> > > > have  seen it long time  ago.
> > > >
> > > > I have not seen this before, but I suspect  a  problem in this very
> > > > strange comparator in MemoryIndex  (which is very broken,  if you
> look
> > > > at its code - it can  compare Strings with Map.Entry and so on,
> > > > b), maybe the  comparator is not stable? In this case, quicksort
> > > > can  easily  loop endless and stack overflow. In Lucene 3.0 this used
> > > > stock  java  sort (which is mergesort), maybe replace the
> > > >  ArrayUtils.quickSort my  ArrayUtils.mergeSort() and see if problem
>  is
> > still
> > > there?
> > > >
> > > > Uwe
> > > >
> > >  > -
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63,  D-28213  Bremen
> > > > http://www.thetaphi.de
> > > > eMail: u...@thetaphi.de
> > > >
> > >  >
> > > > > -Original  Message-
> > > > > From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> > >  > >  Sent: Thursday, April 28, 2011 11:17 PM
> > > > > To: java-user@lucene.apache.org
> > >  > >  Subject: SorterTemplate.quickSort causes  StackOverflowError
> > > > >
> > > > >  Hi,
> > > >  >
> > > > > I'm looking at some code that uses MemoryIndex (Lucene  3.1)  and
> > > > > that's exhibiting a strange behaviour - it  slows down over  time.
> > > > > The MemoryIndex contains 1 doc, of  course, and executes a set of a
> > > > > few thousand queries against  it.  The set of queries does not
> > > > > change - the
> > >  > same
> > > > > set of queries gets executed on all incoming   documents.
> > > > > This code runs very quickly. in the  beginning.   But  with
> time is
> > gets
> > > > > slower and  slower and slower. and then I get  this:
> > > > >
> > >  > > 4/28/11 10:32:52 PM (S) SolrException.log  :
> > java.lang.StackOverflowError
> > > > > at
> > >  > >
> > >
>  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplat

ComplexPhraseQueryParser with multiple fields

2011-04-29 Thread Chris Salem
Hi,
I've just started using the ComplexPhraseQueryParser and it works great with 
one field but is there a way for it to work with multiple fields?  For example, 
right now the query:
job_title: "sales man*" AND NOT contact_name: "Chris Salem"
throws this exception 
Caused by: org.apache.lucene.queryParser.ParseException: Cannot have clause for 
field "job_title" nested in phrase for field "contact_name"
What is the best way to work around this?
Sincerely,
Chris Salem


Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Simon Willnauer
Hey paul,

you can simply checkout the tag or download the sources right?
http://svn.apache.org/repos/asf/lucene/java/tags/lucene_3_0_3/
or http://ftp.download-by.net/apache//lucene/java/3.0.3/

simon

On Fri, Apr 29, 2011 at 1:09 PM, Paul Taylor  wrote:
> Is there a built debug version of lucene 3.0.3 so I can profile it properly
> to find what part of the search is taking the time.
>
> Note:Ive already profiled by application and determined that it is the
> lucene/Search that is taking the time, I also had another attempt using luke
> but find it incredibly buggy and of little use.
>
> thanks Paul
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
> lucene/Search that is taking the time, I also had another attempt using
> luke
> > but find it incredibly buggy and of little use
>

Can you expand on this too? What kind of "incredible bugs" did you see?
Without feedback there is little progress, so bug reports count.

Dawid


RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul,

What did you find about Luke that's buggy?  Bug reports are very useful; please 
contribute in this way.

The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline 
argument to javac - by default, though, only line number and source file 
information is generated.  If you want local variable information too, you 
could download the source and make your own debug-enabled jar(s), right?:

0. Install Ant 1.7.1: 

1. svn checkout http://svn.apache.org/repos/asf/lucene/java/tags/lucene_3_0_3

2. Add 'debuglevel="lines,source,vars"' to the "compile"  in 
common-build.xml 

 in the  task invocation, e.g.:

545: debuglevel="lines,source,vars"
...

3. run "ant clean jar" from the command line.  The Lucene core jar will be in 
the build/ directory.  (If you need one of the contrib jars, run "ant package" 
instead.)

Steve

> -Original Message-
> From: Paul Taylor [mailto:paul_t...@fastmail.fm]
> Sent: Friday, April 29, 2011 7:09 AM
> To: java-user@lucene.apache.org
> Subject: Lucene 3.0.3 with debug information
> 
> Is there a built debug version of lucene 3.0.3 so I can profile it
> properly to find what part of the search is taking the time.
> 
> Note:Ive already profiled by application and determined that it is the
> lucene/Search that is taking the time, I also had another attempt using
> luke but find it incredibly buggy and of little use.
> 
> thanks Paul
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Otis Gospodnetic
Hi,

Yeah, that's what we were going to do, but instead we did:
* changed MemoryIndex to use ArrayUtil.mergeSort
* ran the up and did a thread dump that shows that SorterTemplate.quickSort in 
deep recursion again!
* looked for other places where this call is made - found it in 
MultiPhraseQuery$MultiPhraseWeight and changed that call from 
ArrayUtil.quickSort to ArrayUtil.mergeSort
* now we no longer see SorterTemplate.quickSort in deep recursion when we do a 
thread dump
* we now occasionally catch SorterTemplate.mergeSort in our thread dumps, but 
only a few levels deep, which looks healthy

I don't think we'll be able to reproduce this easily - this happens with 
MemoryIndex and a few thousand stored queries that are confidential customer 
data :(

I'll be back if after a while mergeSort starts behaving the same as quickSort.

Thanks!
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Dawid Weiss 
> To: java-user@lucene.apache.org
> Sent: Fri, April 29, 2011 7:51:39 AM
> Subject: Re: SorterTemplate.quickSort causes StackOverflowError
> 
> Don't know if this helps, but debugging stuff like this I simply add  a
> (manually inserted or aspectj-injected) recursion count, add a  breakpoint
> inside an if checking for recursion count >> X and run the  vm with an
> attached socket debugger. This lets you run at (nearly) full speed  and once
> you hit the breakpoint, inspect the stack, variables,  etc...
> 
> Dawid
> 
> On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic  <
> otis_gospodne...@yahoo.com>  wrote:
> 
> > Hi,
> >
> > OK, so it looks like it's not MemoryIndex  and its Comparator that are
> > funky.
> > After switching from  quickSort call in MemoryIndex to mergeSort, the
> > problem
> >  persists:
> >
> > '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=,  total cpu
> > time=497060.ms user time=495210.msat
> >  org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:105)
> >
> >  at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> >  at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> >  at  
org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> >  So something else is calling quickSort when it gets stuck.  Weirdly, when  
I
> > get
> > a thread dump and get the above, I don't see the original  caller.  Maybe
> > because
> > the stack is already too deep and  the printout is limited to N lines per
> > call
> >  stack?
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> > > From: Uwe Schindler 
> > > To: java-user@lucene.apache.org
> >  > Sent: Thu, April 28, 2011 5:54:44 PM
> > > Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
> > >
> > >  > Thanks for confirming, Javier! :)
> > > >
> > > > Uwe,  I assume you are  referring to this line 528 in MemoryIndex?
> > >  >
> > > >  528 if (size > 1)  ArrayUtil.quickSort(entries,
> >  termComparator);
> > >  >
> > > > And this funky Comparator from  MemoryIndex:
> >  > >
> > > > 208   private static final   Comparator termComparator = new
> > > >  Comparator()  {
> > > > 209   @SuppressWarnings("unchecked")
> > > >  210 public  int compare(Object o1, Object o2) {
> > >  > 211if (o1 instanceof  Map.Entry) o1 =
> >  ((Map.Entry)
> > >  > o1).getKey();
> > > > 212 if (o2 instanceof Map.Entry) o2 =
> >   ((Map.Entry)
> > > > o2).getKey();
> > > >  213if (o1 == o2) return 0;
> > >  > 214return ((Comparable)  o1).compareTo((Comparable) o2);
> > > >  215  }
> > > > 216   };
> > >  >
> > > >  Will try, thanks!
> > >
> > > Yeah,  simply try with mergeSort in line 528. If that  helps, this
> >  comparator
> > > is buggy.
> > >
> > > Uwe
> >  >
> > >
> > > > - Original  Message 
> >  > > > From: Uwe Schindler 
> > > > > To: java-user@lucene.apache.org
> >  > >  > Sent: Thu, April 28, 2011 5:36:13 PM
> > > >  > Subject: RE:  SorterTemplate.quickSort causes  StackOverflowError
> > > > >
> > > > > Hi   Otis,
> > > > >
> > > > > Can you reproduce this  somehow and send test  code? I could look
> >  into
> > >  > > it. I don't expect the error in the  quicksort algorithm itself  as
> > this
> > > > > one is used e.g. BytesRefHash /   TermsHash, if there is a bug we
> > would
> > > > > have   seen it long time  ago.
> > > > >
> > > > > I  have not seen this before, but I suspect  a  problem in this  very
> > > > > strange comparator in MemoryIndex  (which is  very broken,  if you
> > look
> > > > > at its code - it  can  compare Strings with Map.Entry and so on,
> > > > >  b), maybe the  comparator is not stable? In this case,  quicksort
> > > > > can  easily  loop endless and stack  overflow. In Lucene 3.0 this used
> >

RE: SorterTemplate.quickSort causes StackOverflowError

2011-04-29 Thread Uwe Schindler
Hi Otis,

Thanks for trying out. From what I see, the problem is at all not in
MemoryIndex, so I suggest that you replace the mergeSort by quicksort again
(for MemoryIndex, see below). The problem seem to be the comparators that's
are in those Queries, which have no tie-breaker. MergeSort can handle them
better, because mergeSort is stable in comparison to quicksort.

I did some testing with random data and did not get a stack overflow at all
(with standard terms / integers). A integer sort showed that even 200
million integers sorted a) much faster with quickSort and did not stack
overflow (in reality, for good comparators, integers should at most do 31
recursions, but only with 2^31 integers in an array!!!), so quickSort is
fine for strings and integers.

Mike McCandless did some tests in TermsHash/BytesRefHash (Lucene Core), that
showed that quicksort is 20% faster than mergeSort. The code is similar to
MemoryIndex, so this is why I suggest to not change MemoryIndex at all. From
your description of the issue its also unlikely that MemoryIndex is causing
this, because sorting is only done on building the index, not when queries
are running! So the bad guys are the PhraseQueries. We should fix them ASAP,
as this may affect other users, too.

More on https://issues.apache.org/jira/browse/LUCENE-3054,
Thanks Robert!

I will review later, I am heavy busy at the moment.
Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> Sent: Friday, April 29, 2011 7:13 PM
> To: java-user@lucene.apache.org
> Subject: Re: SorterTemplate.quickSort causes StackOverflowError
> 
> Hi,
> 
> Yeah, that's what we were going to do, but instead we did:
> * changed MemoryIndex to use ArrayUtil.mergeSort
> * ran the up and did a thread dump that shows that
> SorterTemplate.quickSort in deep recursion again!
> * looked for other places where this call is made - found it in
> MultiPhraseQuery$MultiPhraseWeight and changed that call from
> ArrayUtil.quickSort to ArrayUtil.mergeSort
> * now we no longer see SorterTemplate.quickSort in deep recursion when
> we do a thread dump
> * we now occasionally catch SorterTemplate.mergeSort in our thread dumps,
> but only a few levels deep, which looks healthy
> 
> I don't think we'll be able to reproduce this easily - this happens with
> MemoryIndex and a few thousand stored queries that are confidential
> customer data :(
> 
> I'll be back if after a while mergeSort starts behaving the same as
quickSort.
> 
> Thanks!
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem
> search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
> > From: Dawid Weiss 
> > To: java-user@lucene.apache.org
> > Sent: Fri, April 29, 2011 7:51:39 AM
> > Subject: Re: SorterTemplate.quickSort causes StackOverflowError
> >
> > Don't know if this helps, but debugging stuff like this I simply add
> > a (manually inserted or aspectj-injected) recursion count, add a
> > breakpoint inside an if checking for recursion count >> X and run the
> > vm with an attached socket debugger. This lets you run at (nearly)
> > full speed  and once you hit the breakpoint, inspect the stack,
variables,
> etc...
> >
> > Dawid
> >
> > On Fri, Apr 29, 2011 at 1:40 PM, Otis Gospodnetic  <
> > otis_gospodne...@yahoo.com>  wrote:
> >
> > > Hi,
> > >
> > > OK, so it looks like it's not MemoryIndex  and its Comparator that
> > > are funky.
> > > After switching from  quickSort call in MemoryIndex to mergeSort,
> > > the problem
> > >  persists:
> > >
> > > '1205215856@qtp-684754483-7' Id=18, RUNNABLE on lock=,  total cpu
> > > time=497060.ms user time=495210.msat
> > >
> > > org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:
> > > 105)
> > >
> > >  at
> org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> > >  at
> org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> > >  at
> org.apache.lucene.util.SorterTemplate.quickSort(SorterTemplate.java:104)
> > >  So something else is calling quickSort when it gets stuck.
> > > Weirdly, when
> I
> > > get
> > > a thread dump and get the above, I don't see the original  caller.
> > > Maybe because the stack is already too deep and  the printout is
> > > limited to N lines per call  stack?
> > >
> > > Otis
> > > 
> > > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch Lucene
> > > ecosystem search :: http://search-lucene.com/
> > >
> > >
> > >
> > > - Original  Message 
> > > > From: Uwe Schindler 
> > > > To: java-user@lucene.apache.org
> > >  > Sent: Thu, April 28, 2011 5:54:44 PM
> > > > Subject: RE:  SorterTemplate.quickSort causes StackOverflowError
> > > >
> > > >  > Thanks for confirming, Javier! :)
> > > > >
> > > > > Uwe,  I assume you are  referring to this line 528 in MemoryIndex?
> > > >  >
> > > > >  528   

Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Paul Taylor

On 29/04/2011 16:03, Steven A Rowe wrote:

Hi Paul,

What did you find about Luke that's buggy?  Bug reports are very useful; please 
contribute in this way.

Please see previous post, in summary mistake on my part.

The official Lucene 3.0.3 distribution jars were compiled using the -g cmdline 
argument to javac - by default, though, only line number and source file 
information is generated.  If you want local variable information too, you 
could download the source and make your own debug-enabled jar(s), right?:

Hmm maybe that is enough, Im not sure. I'm profiling with 
YourkitProfiler and it doesnt show anything within the lucene classes so 
I assumed this meant they didnt contain the neccessary debugging info 
but I would have thought that -g is all I need


thanks Paul

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
Instead of profiling, provide some more info about the following:

- what are the problematic (slow) queries -- are they generated from the
code, are they parsed from text? What are they? Certain query types are
slow(er) than other query types.

- what is the index built from? Natural language (text)? Something else?

If you describe the above folks may tell you right away why your queries are
slow -- people on this list continue to amaze me with the insight they have
even without looking at the code ;)

Dawid

On Fri, Apr 29, 2011 at 10:11 PM, Paul Taylor  wrote:

>  On 29/04/2011 15:17, Dawid Weiss wrote:
>
>
>
>  > lucene/Search that is taking the time, I also had another attempt using
>> luke
>> > but find it incredibly buggy and of little use
>>
>
>  Can you expand on this too? What kind of "incredible bugs" did you see?
> Without feedback there is little progress, so bug reports count.
>
> Dawid
>
> Sorry, I'll withdraw that. I was getting all kinds of stacktraces and
> exceptions when I tried to do searches but the problem was my fault. Because
> I wanted to use my own analyzer  I had a shells script that added it to the
> classpath when I ran luke, however I had put it before the ant jar and my
> jar built with maven also included lucene 3.0.3 and because luke 1.0.1 is
> packaged with 3.0.0 it was confusing it, but I didnt realize this until I
> notice done exception complained a lucene method was missing.
>
> But having got it working I cannot see anything to help me work out why the
> queries are taking too long, is it useful for this or just for refining your
> queries ?
>
> Paul
>


Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Paul Taylor

On 29/04/2011 21:14, Paul Taylor wrote:


Hmm maybe that is enough, Im not sure. I'm profiling with 
YourkitProfiler and it doesnt show anything within the lucene classes 
so I assumed this meant they didnt contain the neccessary debugging 
info but I would have thought that -g is all I need


thanks Paul
Aah, not using the filter correctly in Yourkit Profiler properly, 
getting the info now


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Hi Paul,

On 4/29/2011 at 4:14 PM, Paul Taylor wrote:
> On 29/04/2011 16:03, Steven A Rowe wrote:
> > What did you find about Luke that's buggy?  Bug reports are very
> > useful; please contribute in this way.
>
> Please see previous post, in summary mistake on my part.

Okay... Which previous post?  I searched for posts by you to Lucene mailing 
lists, and found no mention of Luke other than the one complaining about bugs?

Steve



Lucene 3.0.3 with debug information

2011-04-29 Thread Dawid Weiss
This is the e-mail you're looking for, Steven (it wasn't forwarded to the
list, apparently).

Dawid

-- Forwarded message --
From: Paul Taylor 
Date: Fri, Apr 29, 2011 at 10:11 PM
Subject: Re: Lucene 3.0.3 with debug information
To: Dawid Weiss 


 On 29/04/2011 15:17, Dawid Weiss wrote:



 > lucene/Search that is taking the time, I also had another attempt using
> luke
> > but find it incredibly buggy and of little use
>

 Can you expand on this too? What kind of "incredible bugs" did you see?
Without feedback there is little progress, so bug reports count.

Dawid

Sorry, I'll withdraw that. I was getting all kinds of stacktraces and
exceptions when I tried to do searches but the problem was my fault. Because
I wanted to use my own analyzer  I had a shells script that added it to the
classpath when I ran luke, however I had put it before the ant jar and my
jar built with maven also included lucene 3.0.3 and because luke 1.0.1 is
packaged with 3.0.0 it was confusing it, but I didnt realize this until I
notice done exception complained a lucene method was missing.

But having got it working I cannot see anything to help me work out why the
queries are taking too long, is it useful for this or just for refining your
queries ?

Paul


Link to nightly build test reports on main Lucene site needs updating

2011-04-29 Thread Burton-West, Tom
Hello,

I went to look at the "Hudson nightly builds" and tried to follow the link from 
the main Lucene page
http://lucene.apache.org/java/docs/developer-resources.html#Nightly


The links  to the Clover Test Coverage Reports  point to 
http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/clover/
  but apparently hudson.zones.apache.org is no longer being used.  I think the 
link should point to somewhere on  
https://builds.apache.org/hudson/job/Lucene-trunk/.
Is this the right list to alert whoever maintains the main Lucene pages on 
lucene.apache.org?
Tom




Re: Lucene 3.0.3 with debug information

2011-04-29 Thread Michael McCandless
On Fri, Apr 29, 2011 at 4:25 PM, Paul Taylor  wrote:

>> Hmm maybe that is enough, Im not sure. I'm profiling with YourkitProfiler
>> and it doesnt show anything within the lucene classes so I assumed this
>> meant they didnt contain the neccessary debugging info but I would have
>> thought that -g is all I need
>>
>> thanks Paul
>
> Aah, not using the filter correctly in Yourkit Profiler properly, getting
> the info now

Right, YourKit filters out org.apache.* by default ;)  I find it amusing!

Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Lucene 3.0.3 with debug information

2011-04-29 Thread Steven A Rowe
Thanks Dawid. – Steve

From: dawid.we...@gmail.com [mailto:dawid.we...@gmail.com] On Behalf Of Dawid 
Weiss
Sent: Friday, April 29, 2011 4:45 PM
To: java-user@lucene.apache.org
Cc: Steven A Rowe
Subject: Lucene 3.0.3 with debug information


This is the e-mail you're looking for, Steven (it wasn't forwarded to the list, 
apparently).

Dawid
-- Forwarded message --
From: Paul Taylor mailto:paul_t...@fastmail.fm>>
Date: Fri, Apr 29, 2011 at 10:11 PM
Subject: Re: Lucene 3.0.3 with debug information
To: Dawid Weiss mailto:dawid.we...@gmail.com>>

On 29/04/2011 15:17, Dawid Weiss wrote:

> lucene/Search that is taking the time, I also had another attempt using luke
> but find it incredibly buggy and of little use

Can you expand on this too? What kind of "incredible bugs" did you see? Without 
feedback there is little progress, so bug reports count.

Dawid
Sorry, I'll withdraw that. I was getting all kinds of stacktraces and 
exceptions when I tried to do searches but the problem was my fault. Because I 
wanted to use my own analyzer  I had a shells script that added it to the 
classpath when I ran luke, however I had put it before the ant jar and my jar 
built with maven also included lucene 3.0.3 and because luke 1.0.1 is packaged 
with 3.0.0 it was confusing it, but I didnt realize this until I notice done 
exception complained a lucene method was missing.

But having got it working I cannot see anything to help me work out why the 
queries are taking too long, is it useful for this or just for refining your 
queries ?

Paul



[ANN] Luke 3.1.0 released

2011-04-29 Thread Andrzej Bialecki

Hi,

I'm happy to announce the release of Luke 3.1.0. This release is based 
on Lucene 3.1.0. Binaries and source code are available from the 
project's page at Google Code:


http://code.google.com/p/luke/

Changes in version 3.1.0 (released on 2011.04.30):

* Issue 35: Lucene 3.1 compatible luke version (oss.akk)
* Issue 36: XMLExporter generating invalid XML, when special characters 
are present in a TermVector field (Craig.Stires)
* Issue 17: Recent changes to DocReconstructor sometimes cause null ref 
(solrtrey)
* Issue 19: Custom directory implementation must be inherited from 
FSDirectory (mitja.lenic)
* Issue 21: luke tarball needs to extract to a "luke" directory 
(bevan.koopman, Photodeus)

* Issue 33: Term Positions increment incorrect (karolina.bernat)
* Issue 27: Cannot add or edit documents using StandardAnalyzer 
(dean.thrasher)


Thank you for contributing bug reports, patches and comments.

Happy Luke-ing!

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Reusing Query instances

2011-04-29 Thread Otis Gospodnetic
Hi,

Is there any reason why one would *not* want to reuse Query instances?

I'm using MemoryIndex with a fixed set of queries and I'm executing them all on 
each new document that comes in.  Because each document needs to have many tens 
of thousands of queries executed against it, I thought I'd just run all queries 
through QueryParser once at the beginning, and then just reuse Query instances 
on each incoming document.  What I've noticed is that my fixed set of queries 
takes longer and longer to execute as time passes (more and more time is spent 
inside memoryIndex.search() somewhere).  The problem is not heap/memory - 
there is no crazy GCing and the heap is not full, but the CPU is 100% busy.

I should note that queries I'm dealing with are ugly and big, using lots of 
wildcards, but trailing and prefix ones (and this is Lucene 3.1, so no faster 
Wildcard impl).
I should also emphasize that at this point I only *suspect* that maaaybe the 
gradual slowdown I'm seeing has something to do with the fact that I'm reusing 
Query instances.

Is there any reason why one should not reuse Query instances?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org