Re: Questions about the new query parser framework

2010-05-03 Thread Daniel Noll
On Mon, May 3, 2010 at 15:11, Adriano Crestani wrote: > I actually never liked how QueryNode -> query string is done today, using > QueryNode.toQueryString(...) method. A QueryNode shouldn't be responsible > for converting itself back to the string format, because different > SyntaxParser(s) may c

Re: Relevancy Practices

2010-05-03 Thread Ivan Provalov
Grant, We are currently working on a relevancy improvement project. We took the IBM's paper from 2007 TREC and followed the approaches they described to improve Lucene's relevance. It also gave us some idea of Lucene’s out-of-the-box precision performance (MAP). In addition to it we used som

Wich way would you recommend for successive-words similarity and scoring ?

2010-05-03 Thread Pablo
Hello, Lucene core doesn't seems to use relative word positioning (?) for scoring. For example, indexing that phrase "a b c d e f g h i j k l m n o p q r s t u v w x y z", these queries give the same results (0.19308087) :  - 1 : phrase:'e f g'  - 2 : phrase:'o k z' I'm a bit familiar with lucen

Re: Relevancy Practices

2010-05-03 Thread Peter Keegan
We discovered very soon after going to production that Lucene's scores were often 'too precise'. For example, a page of 25 results may have several different score values, and all within 15% of each other, but to the end user all 25 results were equally relevant. Thus we wanted the secondary sort f

Re: Using IndexReader in the web environment

2010-05-03 Thread Erick Erickson
The quick answer is that the session is probably the wrong place to keep an IndexReader, since that's per-user. I'd define a new server/servlet that did my searching and have my webapps use that. Makes it really simple to re-use index readers. And reopening the IndexReader for each request will p

AW: Relevancy Practices

2010-05-03 Thread Uwe Goetzke
Regarding Part3: Data quality For our search domain (catalog products) we face very often the problem that the search data is full of acronyms and abbreviations like: cable,nym-j,pvc,3x2.5mm² or dvd-/cd-/usb-carradio,4x50W,divx,bl We solved this by a combination of normalization for better data

Using IndexReader in the web environment

2010-05-03 Thread Vijay Veeraraghavan
Hi all, In a clustered environment I search the index from the web application. In the web application I am creating IndexReader on each request. is it expensive to do like this? I read somewhere in the web that try using the same reader as much as possible. Can i keep the initially created IndexR

Re: Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan
dear all, as replied below, does searching again for the document in the index and if found skip the indexing else index it, is this not similar to indexing all pdf documents once again, is not this overhead? As I am not going to index the details of the pdf (so if an indexed pdf was recreated i n

Re: Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan
dear, Thanks for you reply Mr. simon, I found it very useful. I have another doubt, I create the index in a clustered environment (2 physical systems and 2 virtual). A shared system among the nodes is where this index will be created. The scheduler runs in another remote system which will create an

Re: Indexing only newly created files

2010-05-03 Thread Simon Willnauer
Hey there, you might have to implement a some kind of unique identifier using an indexed lucene field. When you are indexing you should fire a query with the uuid of your document (maybe the path to you pdf document) and check if the document is in the index already. You could also do a boolean qu

Indexing only newly created files

2010-05-03 Thread Vijay Veeraraghavan
Dear all, I am using lucene 3.0 to index the pdf reports that I generate dynamically. I index the pdf file name (without extension), file path and its absolute path as fields. I search with the file name without extension; it retrieves a list, as usually 2 or more files are present in the same name