Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Robert Muir
On Wed, Nov 28, 2012 at 12:27 AM, Trejkaz  wrote:

> On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir  wrote:
> >
> > I don't understand how a filter could become invalid even though the
> reader
> > has not changed.
>
> I did state two ways in my last email, but just to re-iterate:
>
> (1): The filter reflects a query constructed from lines in a text
> file. If some other application modifies the text file, that filter is
> now invalid.
>
> (2): The filter reflects the results of an SQL query against a
> separate database. If someone inserts a new value into that table,
> then that filter is now invalid.
>
> Case 1 occurs for things like word lists. Case 2 occurs for things
> like tags. Neither of these would ever be possible to implement purely
> using Lucene, so it is a fact of life that they will become invalid
> for reasons other than the reader changing.
>
>
My point is really that lucene (especially clear in 4.0) assumes
indexreaders are immutable points in time. I don't think it makes sense for
us to provide any e.g. filtercaching or similar otherwise, because this is
a key simplification to the design. If you depart from this, by scoring or
filtering from mutable stuff outside the inverted index, things are likely
going to get complicated.


Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Wed, Nov 28, 2012 at 2:09 AM, Robert Muir  wrote:
>
> I don't understand how a filter could become invalid even though the reader
> has not changed.

I did state two ways in my last email, but just to re-iterate:

(1): The filter reflects a query constructed from lines in a text
file. If some other application modifies the text file, that filter is
now invalid.

(2): The filter reflects the results of an SQL query against a
separate database. If someone inserts a new value into that table,
then that filter is now invalid.

Case 1 occurs for things like word lists. Case 2 occurs for things
like tags. Neither of these would ever be possible to implement purely
using Lucene, so it is a fact of life that they will become invalid
for reasons other than the reader changing.

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



native, versioned XML-DBMS (that is full text search in versioned document collections)

2012-11-27 Thread Johannes.Lichtenberger

Hello,

as posted some time ago I'm working on a native, versioned XML-DBMS [1]. 
I'd like to provide a full text index and I recently read about 
customized Codecs which can be plugged in. Usually data (for instance 
XML nodes) are stored on RecordPages. I'm still not sure if it is 
possible and makes sense to implement PostingsFormat and possibly Directory.


What I want to achieve is to be able to use my infrastructure for 
transaction-safe versioning. That is I need some kind of record for the 
different types (I think fields, terms, documents and term positions) 
with a simple record-ID to retrieve the record from disk and which kind 
the record is. Furthermore all I need is a serialization/deserialization 
mechanism for each record type. Probably I can simply reuse the default 
serialization/deserialization routine. I'm furthermore not sure if it 
would be nice to provide a B+-tree implementation which always clusters 
for instance the fields, the terms, then the documents and the term 
positions. I don't know what index structure Lucene uses per default, 
but I think it must be something which is performant with any kind of 
disks (reading/writing blocks of data).


Any hints and suggestions would be nice.

kind regards,
Johannes

[1] https://github.com/JohannesLichtenberger/sirix

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Michael McCandless
Flexible indexing is the ability to make your own codec, which
controls the reading and writing of all index parts (postings, stored
fields, term vectors, deleted docs, etc.).

So for example if you want to store some postings as a bit set instead
of the block format that's the default coming up in 4.1, that's easy
to do.

But what is less easy (as I described below) is changing what is
actually stored in the postings, eg adding a new per-position
attribute.

The original goal was to allow arbitrary attributes beyond the known
docs/freqs/positions/offsets that Lucene supports today, so that you
could easily make new application-dependent per-term, per-doc,
per-position things, pull them from the analyzer, save them to the
index, and access them from an IndexReader / query, but while some
APIs do expose this, it's not very well explored yet (eg, you'd have
to make a custom indexing chain to get the attributes "through"
IndexWriter down to your codec).  It would be great to make progress
making this easier, so ideas are very welcome :)

Mike McCandless

http://blog.mikemccandless.com

On Tue, Nov 27, 2012 at 3:37 PM, Wu, Stephen T., Ph.D.
 wrote:
> Following up on a previous question...
> What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
> easily make new postings formats/codecs -- but a response below says that
> would be "tricky"?
>
> stephen
>
>
> On 11/27/12 11:48 AM, "David Causse"  wrote:
>
>> Hi,
>>
>> We use payloads but we can't use the whole lucene API.
>> For example we use it to do some relation query for example :
>>
>> @quote(@speaker(obama) @discourse(health))
>>
>> Search for all documents that contains a quote by Obama talking about
>> health.
>> We encode linguistic informations (standoff annotations) inside payloads
>> and use custom search API to query the index.
>> I didn't found a convenable way to attach my code to lucene
>> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
>> Query stack.
>> In short if you want to go with Payloads that do more than boosting a
>> term there's chances that you'll need to rewrite a big part of the query
>> stack.
>>
>>
>> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>>> I think we're looking at doing something related.  I haven't explored the
>>> Enums or know how to make a postings codec... But what is "flexible
>>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>>>
>>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>>> also like to try out some interesting ways to score things that go beyond
>>> just tokens.
>>>
>>> We were considering using Attributes instead of Payloads, because it seems
>>> like using Payloads ties you to a particular kind of scoring -- just a
>>> weight on a token.  Can Payloads be used for more general scoring functions?
>>> E.g., considering a span of text alongside multiple Payloads?
>>>
>>> Does it make sense to move outside of Payloads here?
>>>
>>> Thanks!
>>>
>>> stephen
>>>
>>>
>>>
>>>
>>> On 11/19/12 8:14 AM, "Michael McCandless"  wrote:
>>>
 A new postings format would be tricky because you have new attributes
 you want to index.

 The DocsAndPositionsEnum does have an attributes source, but this is
 not well explored, and there are known problems (they can't be easily
 merged in the composite reader case).

 So that's why I suggested packing your information into a payload ...

 Mike McCandless

 http://blog.mikemccandless.com

 On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:
> thx, mike.
> about the 3th question, "encode them all into the payload" is better than
> "a new postings format with the codec" ??
> I mean replace the orginal posting item (position, startOffset, endOffset,
> payload) with my own inverted item such as
> class TestPostingItem
> {
>  int termId;
>  long startOffset;
>  long endOffset;
>  float score;
>  int segId;
>  long timeStamp;
> }
> ?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
> nd
> PositionsEnum-for-tp4020933p4020968.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org

>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional comm

What is "flexible indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-11-27 Thread Wu, Stephen T., Ph.D.
Following up on a previous question...
What is "flexible indexing" in Lucene 4.0?  We assumed it was the ability to
easily make new postings formats/codecs -- but a response below says that
would be "tricky"?

stephen


On 11/27/12 11:48 AM, "David Causse"  wrote:

> Hi,
> 
> We use payloads but we can't use the whole lucene API.
> For example we use it to do some relation query for example :
> 
> @quote(@speaker(obama) @discourse(health))
> 
> Search for all documents that contains a quote by Obama talking about
> health.
> We encode linguistic informations (standoff annotations) inside payloads
> and use custom search API to query the index.
> I didn't found a convenable way to attach my code to lucene
> Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole
> Query stack.
> In short if you want to go with Payloads that do more than boosting a
> term there's chances that you'll need to rewrite a big part of the query
> stack.
> 
> 
> Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :
>> I think we're looking at doing something related.  I haven't explored the
>> Enums or know how to make a postings codec... But what is "flexible
>> indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?
>> 
>> We're trying to incorporate attributes onto terms/spans in indexes.  We'd
>> also like to try out some interesting ways to score things that go beyond
>> just tokens.
>> 
>> We were considering using Attributes instead of Payloads, because it seems
>> like using Payloads ties you to a particular kind of scoring -- just a
>> weight on a token.  Can Payloads be used for more general scoring functions?
>> E.g., considering a span of text alongside multiple Payloads?
>> 
>> Does it make sense to move outside of Payloads here?
>> 
>> Thanks!
>> 
>> stephen
>> 
>> 
>> 
>> 
>> On 11/19/12 8:14 AM, "Michael McCandless"  wrote:
>> 
>>> A new postings format would be tricky because you have new attributes
>>> you want to index.
>>> 
>>> The DocsAndPositionsEnum does have an attributes source, but this is
>>> not well explored, and there are known problems (they can't be easily
>>> merged in the composite reader case).
>>> 
>>> So that's why I suggested packing your information into a payload ...
>>> 
>>> Mike McCandless
>>> 
>>> http://blog.mikemccandless.com
>>> 
>>> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:
 thx, mike.
 about the 3th question, "encode them all into the payload" is better than
 "a new postings format with the codec" ??
 I mean replace the orginal posting item (position, startOffset, endOffset,
 payload) with my own inverted item such as
 class TestPostingItem
 {
  int termId;
  long startOffset;
  long endOffset;
  float score;
  int segId;
  long timeStamp;
 }
 ?
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsA
 nd
 PositionsEnum-for-tp4020933p4020968.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
>> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: info on how lucene conducsts a search?

2012-11-27 Thread Ian Lea
As you can tell from the title, Lucene In Action is more about using
lucene than how it works internally, but yes, it is good and is worth
buying.  If you're worried about how up to date it is, keep a copy of
the release notes and migration guides for later versions to hand.


--
Ian.


On Tue, Nov 27, 2012 at 4:19 PM, geeky2  wrote:
> hello,
>
> thanks for the info.
>
> as you suggested - i did do a general search and found this slide
> presentation - which had some good general info.  i am not sure what the
> source of this preso, how qualified the author (although he/she seems very
> good) or how current the information is?
>
> http://www.slideshare.net/nitin_stephens/lucene-basics#btnNext
>
> i have been working with solr for over a year - but feel like i am missing
> the larger picture and want to know more.
>
> is the lucene in action book good and worth buying - it looks like it covers
> lucene 3.0 but may be 2 years old now.
>
> thx
> mark
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665p4022676.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: handling different scores related to queries

2012-11-27 Thread Jack Krupansky
Call the IndexSearch#explain method to get the technical details on how any 
query is scored. Call Explanation#toString to get the English description 
for the scoring.


Or, using Solr, add the &debugQuery=true parameter to your query request and 
look at the "explain" section for scoring calculations.


Some of these complex queries are "constant score" for performance reasons.

-- Jack Krupansky

-Original Message- 
From: sri krishna

Sent: Tuesday, November 27, 2012 12:38 PM
To: java-user
Subject: handling different scores related to queries

for a search string hello*~ how the scoring is calculated?

as the formula given in the url:
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/Similarity.html,
doesn't take into consideration of edit distance(levenshtein distance) and
prefix term corresponding factors into account.

Does lucene add up the scores obtained from each type of query included i.e
for the above query actual score=default scoring+1/(edit distance)+prefix
match score ?, If so, there is no normalization between scores, else what
is the approach lucene follows starting from seperating each query based
identifiers like (~(edit distance), *(prefix query) etc) to actual scoring. 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How does lucene handle the wildcard and fuzzy queries ?

2012-11-27 Thread Jack Krupansky
The proper answer to all of these questions is the same and very simple: If 
you want "internal" details, read the source code first. If you have 
specific questions then, fine, ask specific questions - but only after 
you've checked the code first.


Also, questions or issues related to "internals" aren't appropriate on 
"user" lists.


-- Jack Krupansky

-Original Message- 
From: sri krishna

Sent: Tuesday, November 27, 2012 12:36 PM
To: java-user@lucene.apache.org
Subject: How does lucene handle the wildcard and fuzzy queries ?

How does lucene handle the prefix queries(wild card) and fuzzy queries
internally?

Lucene stores date in in for of inverted index in segments, i.e term->doc
id's. How does it search a word in the term list efficiently? And how does
it handle the adv queries on same the inverted index?


Thanks 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-27 Thread David Causse

Hi,

We use payloads but we can't use the whole lucene API.
For example we use it to do some relation query for example :

@quote(@speaker(obama) @discourse(health))

Search for all documents that contains a quote by Obama talking about 
health.
We encode linguistic informations (standoff annotations) inside payloads 
and use custom search API to query the index.
I didn't found a convenable way to attach my code to lucene 
Query/Scorer/Weight API. Like SpanQuery you have to rewrite the whole 
Query stack.
In short if you want to go with Payloads that do more than boosting a 
term there's chances that you'll need to rewrite a big part of the query 
stack.



Le 27/11/2012 16:59, Wu, Stephen T., Ph.D. a écrit :

I think we're looking at doing something related.  I haven't explored the
Enums or know how to make a postings codec... But what is "flexible
indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

We're trying to incorporate attributes onto terms/spans in indexes.  We'd
also like to try out some interesting ways to score things that go beyond
just tokens.

We were considering using Attributes instead of Payloads, because it seems
like using Payloads ties you to a particular kind of scoring -- just a
weight on a token.  Can Payloads be used for more general scoring functions?
E.g., considering a span of text alongside multiple Payloads?

Does it make sense to move outside of Payloads here?

Thanks!

stephen




On 11/19/12 8:14 AM, "Michael McCandless"  wrote:


A new postings format would be tricky because you have new attributes
you want to index.

The DocsAndPositionsEnum does have an attributes source, but this is
not well explored, and there are known problems (they can't be easily
merged in the composite reader case).

So that's why I suggested packing your information into a payload ...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:

thx, mike.
about the 3th question, "encode them all into the payload" is better than
"a new postings format with the codec" ??
I mean replace the orginal posting item (position, startOffset, endOffset,
payload) with my own inverted item such as
class TestPostingItem
{
 int termId;
 long startOffset;
 long endOffset;
 float score;
 int segId;
 long timeStamp;
}
?




--
View this message in context:
http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd
PositionsEnum-for-tp4020933p4020968.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






--
David Causse
Spotter
http://www.spotter.com/


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: info on how lucene conducsts a search?

2012-11-27 Thread Apostolis Xekoukoulotakis
http://wiki.apache.org/lucene-java/LucenePapers

Many people have come to this list asking the same question,including
myself.

Most answers are practical ones.

But lucene has so many interesting ideas in it, which triggers everyones
academic curiosity, without caring for the results.



2012/11/27 geeky2 

> Ian Lea wrote
> >
> > The question on cores might be better asked on the solr list, assuming
> > you are talking about Solr cores.  But I bet the answer will be a
> > variant on either "it depends" or, my favourite, "whatever works for
> > you".
>
> yes - i am referring to solr cores.
>
> i was hoping to find a more academic explanation to a few of my questions.
> for example - is a lucene search done as a "full table scan" and therefore
> linear in performance or O(n)??
>
> knowing things like this - would help me make better core/index design
> decisions (along with other factors - of course).
>
> thx
> mark
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665p4022683.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 


Sincerely yours,

 Apostolis Xekoukoulotakis


Re: info on how lucene conducsts a search?

2012-11-27 Thread geeky2
Ian Lea wrote
> 
> The question on cores might be better asked on the solr list, assuming
> you are talking about Solr cores.  But I bet the answer will be a
> variant on either "it depends" or, my favourite, "whatever works for
> you".

yes - i am referring to solr cores.

i was hoping to find a more academic explanation to a few of my questions. 
for example - is a lucene search done as a "full table scan" and therefore
linear in performance or O(n)??

knowing things like this - would help me make better core/index design
decisions (along with other factors - of course).

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665p4022683.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: info on how lucene conducsts a search?

2012-11-27 Thread geeky2
hello,

thanks for the info.

as you suggested - i did do a general search and found this slide
presentation - which had some good general info.  i am not sure what the
source of this preso, how qualified the author (although he/she seems very
good) or how current the information is?

http://www.slideshare.net/nitin_stephens/lucene-basics#btnNext

i have been working with solr for over a year - but feel like i am missing
the larger picture and want to know more.

is the lucene in action book good and worth buying - it looks like it covers
lucene 3.0 but may be 2 years old now.

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665p4022676.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: info on how lucene conducsts a search?

2012-11-27 Thread Ian Lea
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/package-summary.html#package_description
might help.  Or Google something like "how does lucene work".

The question on cores might be better asked on the solr list, assuming
you are talking about Solr cores.  But I bet the answer will be a
variant on either "it depends" or, my favourite, "whatever works for
you".


--
Ian.

On Tue, Nov 27, 2012 at 3:55 PM, geeky2  wrote:
> Hello all,
>
> can someone point me to info or docs on how a lucene search is conducted?
>
> i would like to have a better understanding of how this works in general -
> but also from a design perspective.
>
> for instance - a question that keeps coming up is, should we add content to
> a given core - or break it out in to another core - for performance reasons.
>
> thx
> mark
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: what is the offsets and payload in DocsAndPositionsEnum for ??

2012-11-27 Thread Wu, Stephen T., Ph.D.
I think we're looking at doing something related.  I haven't explored the
Enums or know how to make a postings codec... But what is "flexible
indexing" in Lucene 4.0 if it's not the ability to make new postings codecs?

We're trying to incorporate attributes onto terms/spans in indexes.  We'd
also like to try out some interesting ways to score things that go beyond
just tokens. 

We were considering using Attributes instead of Payloads, because it seems
like using Payloads ties you to a particular kind of scoring -- just a
weight on a token.  Can Payloads be used for more general scoring functions?
E.g., considering a span of text alongside multiple Payloads?

Does it make sense to move outside of Payloads here?

Thanks!

stephen




On 11/19/12 8:14 AM, "Michael McCandless"  wrote:

> A new postings format would be tricky because you have new attributes
> you want to index.
> 
> The DocsAndPositionsEnum does have an attributes source, but this is
> not well explored, and there are known problems (they can't be easily
> merged in the composite reader case).
> 
> So that's why I suggested packing your information into a payload ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Sun, Nov 18, 2012 at 8:33 PM, wgggfiy  wrote:
>> thx, mike.
>> about the 3th question, "encode them all into the payload" is better than
>> "a new postings format with the codec" ??
>> I mean replace the orginal posting item (position, startOffset, endOffset,
>> payload) with my own inverted item such as
>> class TestPostingItem
>> {
>> int termId;
>> long startOffset;
>> long endOffset;
>> float score;
>> int segId;
>> long timeStamp;
>> }
>> ?
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/what-is-the-offsets-and-payload-in-DocsAnd
>> PositionsEnum-for-tp4020933p4020968.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



info on how lucene conducsts a search?

2012-11-27 Thread geeky2
Hello all,

can someone point me to info or docs on how a lucene search is conducted?

i would like to have a better understanding of how this works in general -
but also from a design perspective.

for instance - a question that keeps coming up is, should we add content to
a given core - or break it out in to another core - for performance reasons.

thx
mark




--
View this message in context: 
http://lucene.472066.n3.nabble.com/info-on-how-lucene-conducsts-a-search-tp4022665.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Robert Muir
On Tue, Nov 27, 2012 at 6:17 AM, Trejkaz  wrote:

>
> Ah, yeah... I should have been clearer on what I meant there.
>
> If you want to make a filter which relies on data that isn't in the
> index, there is no mechanism for invalidation. One example of it is if
> you have a filter which essentially constructs a query based on the
> contents of a text file (like a word list.) Another example is with
> tagging, with the tags stored in an external database.
>

I don't understand how a filter could become invalid even though the reader
has not changed.

If this is the case in your design, then you have much bigger problems.


Re: Does anyone have tips on managing cached filters?

2012-11-27 Thread Trejkaz
On Tue, Nov 27, 2012 at 9:31 AM, Robert Muir  wrote:
> On Thu, Nov 22, 2012 at 11:10 PM, Trejkaz  wrote:
>
>>
>> As for actually doing the invalidation, CachingWrapperFilter itself
>> doesn't appear to have any mechanism for invalidation at all, so I
>> imagine I will be building a variation of it with additional methods
>> to invalidate parts of the cache.
>>
>>
> Actually it does, it uses a weakhashmap keyed on either the segment
> (core+deletes) or just the segment's core.

Ah, yeah... I should have been clearer on what I meant there.

If you want to make a filter which relies on data that isn't in the
index, there is no mechanism for invalidation. One example of it is if
you have a filter which essentially constructs a query based on the
contents of a text file (like a word list.) Another example is with
tagging, with the tags stored in an external database.

At the moment we use a separate level of filter cache which asks the
contained filter whether it's still OK to use (if the timestamp on the
file changes, it gets ejected from the cache.) I suspect the same
cache is useful anyway, as it also holds onto the filter instances so
that they don't get collected too soon (filters can come out of our
query parser, so the caller can't conveniently hold onto the instances
in all cases. Sometimes they do two similar queries which happen to
call the same filter, so caching the entire resulting query doesn't
help either.)

An interesting, somewhat-related issue is that for some filters, we
can't keep the contents of the file itself in memory due to size
limits, so we have to read it on the fly. When there are multiple
segments, the file gets read multiple times. So it's a rare case where
computing the filter across all readers might actually come out faster
than computing it per-segment...

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: sort by field and score

2012-11-27 Thread Ian Lea
What are you getting for the scores?  If it's NaN I think you'll need
to use a TopFieldCollector.  See for example
http://www.gossamer-threads.com/lists/lucene/java-user/86309


--
Ian.


On Tue, Nov 27, 2012 at 3:51 AM, Andy Yu  wrote:
> Hi All,
>
>
> Now  I want to sort by a field and the relevance
> For example
>
> SortField sortField[] = {new SortField("id", new
> CustomComparatorSource(bitSet)),SortField.FIELD_SCORE};
> Sort sort = new Sort(sortField);
> TopDocs topDocs = indexSearcher.search(query, 10,sort);
>
> if (0 < topDocs.totalHits) {
> for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
>
> System.out.println(indexSearcher.doc(scoreDoc.doc).get("id"));
> System.out.println("score is " + scoreDoc.score);
>
>  System.out.println(indexSearcher.doc(scoreDoc.doc).get("name"));
> }
> }
>
> I found that the search result sort just by [new SortField("id", new
> CustomComparatorSource(bitSet))]
> [SortField.FIELD_SCORE] does not work at all
>
>
> PS: my lucene version is 3.6
>
> does anybodu know the reason or how to solve it ?
>
>
> Thanks ,
> Andy

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org