Re: Different ranking results

2010-07-22 Thread Ian Lea
They look the same to me too.

What does q.getClass().getName() say in each case? q.toString()?
searcher.explain(q, n)?

What version of lucene?


--
Ian.




On Wed, Jul 21, 2010 at 10:25 PM, Philippe  wrote:
> Hi,
>
> I just performed two queries which, in my opinion, should lead to the same
> document rankings. However, the document ranking differ between these two
> queries. For better understanding I prepared  minimal examples for both
> queries. In my understanding both queries perform the same task. Namely
> search for "lucene" in two different fields.
>
> Maybe someone can explain me my misunderstanding?
>
>
> String[] fields = {"TITLE", "BOOK"};
> MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29,
> fields, new StandardAnalyzer(Version.LUCENE_29));
>
> 1.)
> Query q = parser.parse("lucene");
>
> 2.)
> Query q = parser.parse(TITLE:lucene OR BOOK:lucene);
>
> Regards,
>    philippe
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: on-the-fly "filters" from docID lists

2010-07-22 Thread Michael McCandless
It sounds like you should implement a custom Filter?

Its getDocIdSet would consult your foreign key-value store and iterate
through the allowed docIDs, per segment.

Mike

On Wed, Jul 21, 2010 at 8:37 AM, Martin J  wrote:
> Hello, we are trying to implement a query type for Lucene (with eventual
> target being Solr) where the query string passed in needs to be "filtered"
> through a large list of document IDs per user. We can't store the user ID
> information in the lucene index per document so we were planning to pull the
> list of documents owned by user X from a key-value store at query time and
> then build some sort of filter in memory before doing the Lucene/Solr query.
> For example:
>
> content:"cars" user_id:X567
>
> would first pull the list of docIDs that user_id:X567 has "access" to from a
> keyvalue store and then we'd query the main index with content:"cars" but
> only allow the docIDs that came back to be part of the response. The list of
> docIDs can near the hundreds of thousands.
>
> What should I be looking at to implement such a feature?
>
> Thank you
> Martin
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Different ranking results

2010-07-22 Thread Philippe

Hi Ian,

I'm using Version 2.93 of lucene.

q.getClass() and q.toString() are exactly equal:
org.apache.lucene.search.BooleanQuery
TITLE:672 BOOK:672

However, the results for searcher.explain(q,n) significantly differ. It 
seems to me that  "Query q = parser.parse("672");" searches only one the 
Book field, whereas  "Query q = parser.parse("TITLE:672 BOOK:672");" 
searches on both fields. Do you have a explanation for this behaviour? I 
only observed this problem for this field...


I appended the result of both explain strings below:

"Query q = parser.parse("672");"
9.987344E10 = (MATCH) sum of:
  9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of:
0.6583703 = queryWeight(BOOK:672), product of:
  6.085349 = idf(docFreq=109, maxDocs=17780)
  0.10818941 = queryNorm
1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of:
  3.3166249 = tf(termFreq(BOOK:672)=11)
  6.085349 = idf(docFreq=109, maxDocs=17780)
  7.5161928E9 = fieldNorm(field=BOOK, doc=7078)


"Query q = parser.parse("TITLE:672 BOOK:672");"
9.5225594E10 = (MATCH) sum of:
  9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of:
0.6583703 = queryWeight(BOOK:672), product of:
  6.085349 = idf(docFreq=109, maxDocs=17780)
  0.10818941 = queryNorm
1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of:
  3.1622777 = tf(termFreq(BOOK:672)=10)
  6.085349 = idf(docFreq=109, maxDocs=17780)
  7.5161928E9 = fieldNorm(field=BOOK, doc=4979)
  52.366344 = (MATCH) weight(TITLE:672 in 4979), product of:
0.7526941 = queryWeight(TITLE:672), product of:
  6.957188 = idf(docFreq=45, maxDocs=17780)
  0.10818941 = queryNorm
69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of:
  1.0 = tf(termFreq(TITLE:672)=1)
  6.957188 = idf(docFreq=45, maxDocs=17780)
  10.0 = fieldNorm(field=TITLE, doc=4979)

Cheers,
P.
Am 22.07.2010 10:02, schrieb Ian Lea:

They look the same to me too.

What does q.getClass().getName() say in each case? q.toString()?
searcher.explain(q, n)?

What version of lucene?


--
Ian.




On Wed, Jul 21, 2010 at 10:25 PM, Philippe  wrote:
   

Hi,

I just performed two queries which, in my opinion, should lead to the same
document rankings. However, the document ranking differ between these two
queries. For better understanding I prepared  minimal examples for both
queries. In my understanding both queries perform the same task. Namely
search for "lucene" in two different fields.

Maybe someone can explain me my misunderstanding?


String[] fields = {"TITLE", "BOOK"};
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29,
fields, new StandardAnalyzer(Version.LUCENE_29));

1.)
Query q = parser.parse("lucene");

2.)
Query q = parser.parse(TITLE:lucene OR BOOK:lucene);

Regards,
philippe

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Different ranking results

2010-07-22 Thread Ian Lea
No, I don't have an explanation.  Perhaps a minimal self-contained
program or test case would help.


--
Ian.


On Thu, Jul 22, 2010 at 10:23 AM, Philippe  wrote:
> Hi Ian,
>
> I'm using Version 2.93 of lucene.
>
> q.getClass() and q.toString() are exactly equal:
> org.apache.lucene.search.BooleanQuery
> TITLE:672 BOOK:672
>
> However, the results for searcher.explain(q,n) significantly differ. It
> seems to me that  "Query q = parser.parse("672");" searches only one the
> Book field, whereas  "Query q = parser.parse("TITLE:672 BOOK:672");"
> searches on both fields. Do you have a explanation for this behaviour? I
> only observed this problem for this field...
>
> I appended the result of both explain strings below:
>
> "Query q = parser.parse("672");"
> 9.987344E10 = (MATCH) sum of:
>  9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of:
>    0.6583703 = queryWeight(BOOK:672), product of:
>      6.085349 = idf(docFreq=109, maxDocs=17780)
>      0.10818941 = queryNorm
>    1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of:
>      3.3166249 = tf(termFreq(BOOK:672)=11)
>      6.085349 = idf(docFreq=109, maxDocs=17780)
>      7.5161928E9 = fieldNorm(field=BOOK, doc=7078)
>
>
> "Query q = parser.parse("TITLE:672 BOOK:672");"
> 9.5225594E10 = (MATCH) sum of:
>  9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of:
>    0.6583703 = queryWeight(BOOK:672), product of:
>      6.085349 = idf(docFreq=109, maxDocs=17780)
>      0.10818941 = queryNorm
>    1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of:
>      3.1622777 = tf(termFreq(BOOK:672)=10)
>      6.085349 = idf(docFreq=109, maxDocs=17780)
>      7.5161928E9 = fieldNorm(field=BOOK, doc=4979)
>  52.366344 = (MATCH) weight(TITLE:672 in 4979), product of:
>    0.7526941 = queryWeight(TITLE:672), product of:
>      6.957188 = idf(docFreq=45, maxDocs=17780)
>      0.10818941 = queryNorm
>    69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of:
>      1.0 = tf(termFreq(TITLE:672)=1)
>      6.957188 = idf(docFreq=45, maxDocs=17780)
>      10.0 = fieldNorm(field=TITLE, doc=4979)
>
> Cheers,
>    P.
> Am 22.07.2010 10:02, schrieb Ian Lea:
>>
>> They look the same to me too.
>>
>> What does q.getClass().getName() say in each case? q.toString()?
>> searcher.explain(q, n)?
>>
>> What version of lucene?
>>
>>
>> --
>> Ian.
>>
>>
>>
>>
>> On Wed, Jul 21, 2010 at 10:25 PM, Philippe
>>  wrote:
>>
>>>
>>> Hi,
>>>
>>> I just performed two queries which, in my opinion, should lead to the
>>> same
>>> document rankings. However, the document ranking differ between these two
>>> queries. For better understanding I prepared  minimal examples for both
>>> queries. In my understanding both queries perform the same task. Namely
>>> search for "lucene" in two different fields.
>>>
>>> Maybe someone can explain me my misunderstanding?
>>>
>>>
>>> String[] fields = {"TITLE", "BOOK"};
>>> MultiFieldQueryParser parser = new
>>> MultiFieldQueryParser(Version.LUCENE_29,
>>> fields, new StandardAnalyzer(Version.LUCENE_29));
>>>
>>> 1.)
>>> Query q = parser.parse("lucene");
>>>
>>> 2.)
>>> Query q = parser.parse(TITLE:lucene OR BOOK:lucene);
>>>
>>> Regards,
>>>    philippe
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Holding and changing index wide information

2010-07-22 Thread jan.kurella
Hi,

When using incremental updating via Solr, we want to know, which update is in 
the current index. Each update has a number.
How can we store/change/retrieve this number with the index. We want to store 
it in the index to replicate it to any slaves as well.

So basically can I store/change/retrieve a number index wide in lucen/Solr?

Jan


Re: Different ranking results

2010-07-22 Thread Philippe

Well,

that's difficult at the moment as I can also just reproduce this error 
for some few cases. But I will try to generate such an example..


Cheers,
Philippe

Am 22.07.2010 12:34, schrieb Ian Lea:

No, I don't have an explanation.  Perhaps a minimal self-contained
program or test case would help.


--
Ian.


On Thu, Jul 22, 2010 at 10:23 AM, Philippe  wrote:
   

Hi Ian,

I'm using Version 2.93 of lucene.

q.getClass() and q.toString() are exactly equal:
org.apache.lucene.search.BooleanQuery
TITLE:672 BOOK:672

However, the results for searcher.explain(q,n) significantly differ. It
seems to me that  "Query q = parser.parse("672");" searches only one the
Book field, whereas  "Query q = parser.parse("TITLE:672 BOOK:672");"
searches on both fields. Do you have a explanation for this behaviour? I
only observed this problem for this field...

I appended the result of both explain strings below:

"Query q = parser.parse("672");"
9.987344E10 = (MATCH) sum of:
  9.987344E10 = (MATCH) weight(BOOK:672 in 7078), product of:
0.6583703 = queryWeight(BOOK:672), product of:
  6.085349 = idf(docFreq=109, maxDocs=17780)
  0.10818941 = queryNorm
1.51697965E11 = (MATCH) fieldWeight(BOOK:672 in 7078), product of:
  3.3166249 = tf(termFreq(BOOK:672)=11)
  6.085349 = idf(docFreq=109, maxDocs=17780)
  7.5161928E9 = fieldNorm(field=BOOK, doc=7078)


"Query q = parser.parse("TITLE:672 BOOK:672");"
9.5225594E10 = (MATCH) sum of:
  9.5225594E10 = (MATCH) weight(BOOK:672 in 4979), product of:
0.6583703 = queryWeight(BOOK:672), product of:
  6.085349 = idf(docFreq=109, maxDocs=17780)
  0.10818941 = queryNorm
1.44638345E11 = (MATCH) fieldWeight(BOOK:672 in 4979), product of:
  3.1622777 = tf(termFreq(BOOK:672)=10)
  6.085349 = idf(docFreq=109, maxDocs=17780)
  7.5161928E9 = fieldNorm(field=BOOK, doc=4979)
  52.366344 = (MATCH) weight(TITLE:672 in 4979), product of:
0.7526941 = queryWeight(TITLE:672), product of:
  6.957188 = idf(docFreq=45, maxDocs=17780)
  0.10818941 = queryNorm
69.571884 = (MATCH) fieldWeight(TITLE:672 in 4979), product of:
  1.0 = tf(termFreq(TITLE:672)=1)
  6.957188 = idf(docFreq=45, maxDocs=17780)
  10.0 = fieldNorm(field=TITLE, doc=4979)

Cheers,
P.
Am 22.07.2010 10:02, schrieb Ian Lea:
 

They look the same to me too.

What does q.getClass().getName() say in each case? q.toString()?
searcher.explain(q, n)?

What version of lucene?


--
Ian.




On Wed, Jul 21, 2010 at 10:25 PM, Philippe
  wrote:

   

Hi,

I just performed two queries which, in my opinion, should lead to the
same
document rankings. However, the document ranking differ between these two
queries. For better understanding I prepared  minimal examples for both
queries. In my understanding both queries perform the same task. Namely
search for "lucene" in two different fields.

Maybe someone can explain me my misunderstanding?


String[] fields = {"TITLE", "BOOK"};
MultiFieldQueryParser parser = new
MultiFieldQueryParser(Version.LUCENE_29,
fields, new StandardAnalyzer(Version.LUCENE_29));

1.)
Query q = parser.parse("lucene");

2.)
Query q = parser.parse(TITLE:lucene OR BOOK:lucene);

Regards,
philippe

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



   


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Holding and changing index wide information

2010-07-22 Thread Ian Lea
Just add/update a dedicated document in the index.

k=updatenumber
v=whatever.

Retrieve it with a search for k:updatenumber, update with
iw.updateDocument(whatever).


--
Ian.


On Thu, Jul 22, 2010 at 12:55 PM,   wrote:
> Hi,
>
> When using incremental updating via Solr, we want to know, which update is in 
> the current index. Each update has a number.
> How can we store/change/retrieve this number with the index. We want to store 
> it in the index to replicate it to any slaves as well.
>
> So basically can I store/change/retrieve a number index wide in lucen/Solr?
>
> Jan
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Holding and changing index wide information

2010-07-22 Thread findbestopensource
 Hi Jan,

I think, you require version number for each commit OR updates. Say
you added 10 docs then it is update 1, then modifed or added some more
then it is update 2.. If it is so then my advice would be to have
field named field-type, version-number and version-date-time as part
of the field in the index. You could set this field as like any other
field. Retrieve the record by filtering the value to the field
field-type.

Regards
Aditya
www.findbestopensource.com



On Thu, Jul 22, 2010 at 5:25 PM,   wrote:
> Hi,
>
> When using incremental updating via Solr, we want to know, which update is in 
> the current index. Each update has a number.
> How can we store/change/retrieve this number with the index. We want to store 
> it in the index to replicate it to any slaves as well.
>
> So basically can I store/change/retrieve a number index wide in lucen/Solr?
>
> Jan
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Question to the writer of MultiPassIndexSplitter

2010-07-22 Thread Yatir Ben Shlomo
Hi,
I heard work is being done on re-writing MultiPassIndexSplitter so it will be a 
single pass and work quicker.
I was wondering if this is already done or when is it due ?
Thanks



Inserting data from multiple databases in same index

2010-07-22 Thread L Duperval
Hi,

We are creating an index containing data from two databases. What we are trying
to achieve is to make our search locate and return information no matter where
the data came from. (BTW, we are using Compass, if it matters any)

My problem is that I am not sure how to create such an index.

Do I index in two passes, one for each database, while adding the content of the
second SELECT to the first one? Or a different approach?

I'm pretty sure this is (pretty much) a FAQ but I didn't find what I was looking
for.

Thanks,

L



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Inserting data from multiple databases in same index

2010-07-22 Thread Chris Lu

You can either
1) create one index for each database, and merge the results during search.
2) create the 2 indexes individually and merge them
3) merge records during SQL select.

The 1) approach should be easy to scale linearly as your database grows.
You can even distribute the indexes onto several boxes and achieve 
sharded search.


--
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes: 
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) 
got 2.6 Million Euro funding!


On 7/22/2010 10:07 AM, L Duperval wrote:

Hi,

We are creating an index containing data from two databases. What we are trying
to achieve is to make our search locate and return information no matter where
the data came from. (BTW, we are using Compass, if it matters any)

My problem is that I am not sure how to create such an index.

Do I index in two passes, one for each database, while adding the content of the
second SELECT to the first one? Or a different approach?

I'm pretty sure this is (pretty much) a FAQ but I didn't find what I was looking
for.

Thanks,

L



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

   



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring exact matches higher in a stemmed field

2010-07-22 Thread Shai Erera
>
> Ideally, that would be through a class or a function I can override or
> extend
>

How is that different than extending QP?

About the "song of songs" example -- the result you describe is already what
will happen. A document which contains just the word 'song' will score lower
than a document containing "song of songs". Also, what I'd do in such a case
is search for the phrase (in addition to the rest), 'cause documents
containing the word "songs" 100 times will score higher than the single
document that will contain "song of songs" once ...

If you just want a query "abc def" to rank higher if a document contains the
exact words, then I'd go w/ the QP extension approach, or do other
sophistication like searching for 'abc' '\"abc\"' etc. or something like
that. There are many tricks you can do on your end, w/o overriding much in
Lucene. Still, IMO extending QP is the easiest and gives you the control you
need.

Shai

On Mon, Jul 19, 2010 at 9:24 PM, Itamar Syn-Hershko wrote:

> On 19/7/2010 5:50 PM, Shai Erera wrote:
>
>> If your analyzer outputs b and b$ in the same position, then the below
>> query
>> will already be what the QP output today If you want to incorporate
>> boosting, I can suggest that you extend QP, override newTermQuery for
>> example, and if the term is a stemmed term, then set the query's boost
>> (Query.setBoost) accordingly. Would that work for you?
>>
>>
> I want to avoid overriding the QP, and do this as a pluggable extension.
> What other options do I have other than what you've suggested?
>
> Ideally, that would be through a class or a function I can override or
> extend, so each term hit while searching will be examined. By checking its
> type and text (for suffix), that interface could double its weight (or
> boost). The similarity functions I mentioned could have provided this
> ability (see below). How can this be done without them?
>
>  You'll need to check whether you want to boost terms inside phrases, or
>> entire phrases, and then override more methods from QP. But that approach
>> will get you the native product of the engine, I think.
>>
> Just to make sure we are on the same page here, here's an example (assuming
> the default tf/idf implementation in Lucene).
>
> I want to make sure anyone searching for "song of songs" will find texts
> discussing the biblical book, and have them ranked the highest, instead of
> having short texts containing one word "song" score higher.
>
> So what I do is have my stemming analyzer save the string "song of songs"
> like this, where each parenthesis represents a token position: (song song$)
> (song songs$).
>
> The part I'm missing is how to score terms with suffixes higher. The best
> approach seem to be looking at the term read by IndexReader and boost this
> finding somehow. The assumption is if IndexReader has read the term songs$
> it has been looked for, and therefore this is the exact word that has been
> queried for.
>
> Which is the best Lucene part to hijack for this mission?
>
>  Alternatively, you
>> can set a payload on the stemmed terms and incorporate that into
>> Similarity,
>> but that's more costly.
>>
>>
> I had mentioned Payloads - this will get me exactly what I want but as you
> say are quite costly when used for almost every term in the index. If I
> could replace the suffix with Payloads I would have done this (byte vs.
> byte), but I'm using the suffix for one other thing.
>
>  I don't follow that's been deprecated on Sim that you cannot use anymore?
>> All I see are 3 deprecated static methods which are related to norms ...
>>
>>
> In 2.3.2 there were these functions:
>
>public float idf(Term term, Searcher searcher)
>
>public float idf(Collection terms, Searcher searcher)
>
> These have been deprecated somewhere between that version and 2.9.2, and it
> seems like I could have used those for what I'm trying to do.
>
> Thanks,
>
> Itamar.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


RE: on-the-fly "filters" from docID lists

2010-07-22 Thread Burton-West, Tom
Hi Mike and Martin,

We have a similar use-case.   Is there a scalability/performance issue with the 
getDocIdSet having to iterate through hundreds of thousands of docIDs? 

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: Thursday, July 22, 2010 5:20 AM
To: java-user@lucene.apache.org
Subject: Re: on-the-fly "filters" from docID lists

It sounds like you should implement a custom Filter?

Its getDocIdSet would consult your foreign key-value store and iterate
through the allowed docIDs, per segment.

Mike

On Wed, Jul 21, 2010 at 8:37 AM, Martin J  wrote:
> Hello, we are trying to implement a query type for Lucene (with eventual
> target being Solr) where the query string passed in needs to be "filtered"
> through a large list of document IDs per user. We can't store the user ID
> information in the lucene index per document so we were planning to pull the
> list of documents owned by user X from a key-value store at query time and
> then build some sort of filter in memory before doing the Lucene/Solr query.
> For example:
>
> content:"cars" user_id:X567
>
> would first pull the list of docIDs that user_id:X567 has "access" to from a
> keyvalue store and then we'd query the main index with content:"cars" but
> only allow the docIDs that came back to be part of the response. The list of
> docIDs can near the hundreds of thousands.
>
> What should I be looking at to implement such a feature?
>
> Thank you
> Martin
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring exact matches higher in a stemmed field

2010-07-22 Thread Itamar Syn-Hershko

On 22/7/2010 9:20 PM, Shai Erera wrote:

How is that different than extending QP?
   
Mainly because the problem I'm having isn't there, and doing it from 
there doesn't feel right, and definitely not like solving the issue. I 
want to explore what other options there are before doing anything, and 
I started this thread because I hit a dead end after seeing Similarity 
can no more be of help.



About the "song of songs" example -- the result you describe is already what
will happen. A document which contains just the word 'song' will score lower
than a document containing "song of songs".
Incorrect, and I have a sample app to show that (this is how I thought 
of this example for the first place).


Since while indexing the 2 words will be saved into index as 1:(song 
song$) 2:(song songs$), short documents with one word "song" will score 
higher than longer documents with "song of songs". This is a product of 
Lucene's default tf/idf implementation which cares about a field's 
length, and at this stage I want to avoid replacing it (with BM25 for 
example).



Also, what I'd do in such a case
is search for the phrase (in addition to the rest), 'cause documents
containing the word "songs" 100 times will score higher than the single
document that will contain "song of songs" once ...
   
In one of my applications I am providing an "as typed" capability, which 
does exactly what you are suggesting (looking for the $-ed terms only), 
but I want my original analyzer (the one that also looks for non $-ed 
terms) to do better scoring. Without this the implementation is somewhat 
broken...



If you just want a query "abc def" to rank higher if a document contains the
exact words, then I'd go w/ the QP extension approach, or do other
sophistication like searching for 'abc' '\"abc\"' etc. or something like
that. There are many tricks you can do on your end, w/o overriding much in
Lucene. Still, IMO extending QP is the easiest and gives you the control you
need.
   
I am overriding stuff in Lucene either way. I also don't want an exact 
match of a phrase to rank higher; I want an original term (saved as-is 
with a $ marker) to score higher than a stemmed / lemmatized one 
(without the marker). Sorry if the thread's title is misleading.


I'd have used payloads if it wasn't costly. So my question is: where do 
I have control over boosting (or scoring), and also have access to the 
term's text?


Thanks,

Itamar.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using lucene for substring matching

2010-07-22 Thread Geir Gullestad Pettersen
Hi,

I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:

String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
   if ( text.substring( wordOrPhrase ) >= 0 ) {
  matches.add( wordOrPhrase );
   }
}

I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.

I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:

(pseudo)

lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
   if( lucene.search( wordOrPharse, text_id) gives hit ) {
  matches.add(wordOrPhrase)
   }
}

I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?

Thanks,
Geir


Re: on-the-fly "filters" from docID lists

2010-07-22 Thread Michael McCandless
Well, Lucene can apply such a filter rather quickly; but, your custom
code first has to build it... so it's really a question of whether
your custom code can build up / iterate the filter scalably.

Mike

On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom  wrote:
> Hi Mike and Martin,
>
> We have a similar use-case.   Is there a scalability/performance issue with 
> the getDocIdSet having to iterate through hundreds of thousands of docIDs?
>
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
>
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Thursday, July 22, 2010 5:20 AM
> To: java-user@lucene.apache.org
> Subject: Re: on-the-fly "filters" from docID lists
>
> It sounds like you should implement a custom Filter?
>
> Its getDocIdSet would consult your foreign key-value store and iterate
> through the allowed docIDs, per segment.
>
> Mike
>
> On Wed, Jul 21, 2010 at 8:37 AM, Martin J  wrote:
>> Hello, we are trying to implement a query type for Lucene (with eventual
>> target being Solr) where the query string passed in needs to be "filtered"
>> through a large list of document IDs per user. We can't store the user ID
>> information in the lucene index per document so we were planning to pull the
>> list of documents owned by user X from a key-value store at query time and
>> then build some sort of filter in memory before doing the Lucene/Solr query.
>> For example:
>>
>> content:"cars" user_id:X567
>>
>> would first pull the list of docIDs that user_id:X567 has "access" to from a
>> keyvalue store and then we'd query the main index with content:"cars" but
>> only allow the docIDs that came back to be part of the response. The list of
>> docIDs can near the hundreds of thousands.
>>
>> What should I be looking at to implement such a feature?
>>
>> Thank you
>> Martin
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Hot to get word importance in lucene index

2010-07-22 Thread Xaida

Hi all!

hmmm, i need to get how important is the word in entire document collection
that is indexed in the lucene index. I need to extract some "representable
words", lets say concepts that are common and can be representable to whole
collection. Or collection "keywords". I did the fulltext indexing and the
only field i am using are text contents, because titles of the documents are
mostly not representable(numbers, codes etc)

So, if i calculate tfidf, it gives me importance of single term with respect
to single document. But if that word is repeating in the documents, how can
i calculate its total importance within index?

All help appreciated!! Thank you!!!

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Hot-to-get-word-importance-in-lucene-index-tp988836p988836.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Databases

2010-07-22 Thread manjula wijewickrema
Hi,

Normally, when I am building my index directory for indexed documents, I
used to keep my indexed files simply in a directory called 'filesToIndex'.
So in this case, I do not use any standar database management system such
as mySql or any other.

1) Will it be possible to use mySql or any other for the purpose of manage
indexed documents in Lucene?

2) Is it necessary to follow such kind of methodology with Lucene?

3) If we do not use such type of database management system, will there be
any disadvantages with large number of indexed files?

Appreciate any reply from you.
Thanks,
Manjula.


Re: Databases

2010-07-22 Thread Glen Newton
LuSql is a tool specifically oriented to extracting from JDBC
accessible databases and indexing the contents.
You can find it here:
 http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
User manual:
 http://cuvier.cisti.nrc.ca/~gnewton/lusql/v0.9/lusqlManual.pdf.html

A new version is coming out in the next  month, but the existing one
should be fine for what you have described.
If you have any questions, just let me know.

Note that if you are interested in using Solr for your application,
the data import handler (DIH) is a very flexible way of doing what you
are describing, in a Solr context.
http://wiki.apache.org/solr/DataImportHandler

Thanks,
-Glen Newton
LuSql author
http://zzzoot.blogspot.com/

On 23 July 2010 15:46, manjula wijewickrema  wrote:
> Hi,
>
> Normally, when I am building my index directory for indexed documents, I
> used to keep my indexed files simply in a directory called 'filesToIndex'.
> So in this case, I do not use any standar database management system such
> as mySql or any other.
>
> 1) Will it be possible to use mySql or any other for the purpose of manage
> indexed documents in Lucene?
>
> 2) Is it necessary to follow such kind of methodology with Lucene?
>
> 3) If we do not use such type of database management system, will there be
> any disadvantages with large number of indexed files?
>
> Appreciate any reply from you.
> Thanks,
> Manjula.
>



-- 

-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Reverse Lucene queries

2010-07-22 Thread skant
Hi all, I have an interesting problem...instead of going from a query
to a document collection, is it possible to come up with the best fit
query for a given document collection (results)? "Best fit" being a
query which maximizes the hit scores of the resulting document
collection.

How should I approach this? All suggestions appreciated.

Thanks
Shashi

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: on-the-fly "filters" from docID lists

2010-07-22 Thread Mark Harwood
Re scalability of filter construction - the database is likely to hold stable 
primary keys not lucene doc ids which are unstable in the face of updates. You 
therefore need a quick way of converting stable database keys read from the db 
into current lucene doc ids to create the filter. That could involve a lot of 
disk seeks unless you cache a pk->docid lookup in ram.  You should use 
cachingwrapperfilter too to cache the computed  user permissions from one 
search to the next. 
This can get messy. If the access permissions are centred around roles/groups 
it is normally faster to tag docs with these group names and query them with 
the list of roles the user holds. 
If individual user-doc-level perms are required you could also consider 
dynamically looking up perms for just the top n results being shown at the risk 
of needing to repeat the query with a larger n if insufficient matches pass the 
lookup. 

Cheers 
Mark



On 23 Jul 2010, at 01:55, Michael McCandless  wrote:

> Well, Lucene can apply such a filter rather quickly; but, your custom
> code first has to build it... so it's really a question of whether
> your custom code can build up / iterate the filter scalably.
> 
> Mike
> 
> On Thu, Jul 22, 2010 at 4:37 PM, Burton-West, Tom  wrote:
>> Hi Mike and Martin,
>> 
>> We have a similar use-case.   Is there a scalability/performance issue with 
>> the getDocIdSet having to iterate through hundreds of thousands of docIDs?
>> 
>> Tom Burton-West
>> http://www.hathitrust.org/blogs/large-scale-search
>> 
>> -Original Message-
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Thursday, July 22, 2010 5:20 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: on-the-fly "filters" from docID lists
>> 
>> It sounds like you should implement a custom Filter?
>> 
>> Its getDocIdSet would consult your foreign key-value store and iterate
>> through the allowed docIDs, per segment.
>> 
>> Mike
>> 
>> On Wed, Jul 21, 2010 at 8:37 AM, Martin J  wrote:
>>> Hello, we are trying to implement a query type for Lucene (with eventual
>>> target being Solr) where the query string passed in needs to be "filtered"
>>> through a large list of document IDs per user. We can't store the user ID
>>> information in the lucene index per document so we were planning to pull the
>>> list of documents owned by user X from a key-value store at query time and
>>> then build some sort of filter in memory before doing the Lucene/Solr query.
>>> For example:
>>> 
>>> content:"cars" user_id:X567
>>> 
>>> would first pull the list of docIDs that user_id:X567 has "access" to from a
>>> keyvalue store and then we'd query the main index with content:"cars" but
>>> only allow the docIDs that came back to be part of the response. The list of
>>> docIDs can near the hundreds of thousands.
>>> 
>>> What should I be looking at to implement such a feature?
>>> 
>>> Thank you
>>> Martin
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org