[jira] Commented: (LUCENE-1333) Token implementation needs improvements

2008-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621101#action_12621101
 ] 

Michael McCandless commented on LUCENE-1333:


OK since you pulled it all together under this issue, I think we should commit 
this one instead of LUCENE-1350.  I'll review the [massive] patch -- thanks DM!

> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.1
> Environment: All
>Reporter: DM Smith
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, 
> LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, 
> LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, 
> LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, 
> LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, 
> LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, 
> LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference 
> so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/[EMAIL 
> PROTECTED]
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to 
> use the class.
> 2. Deprecations are incomplete. The constructors that take String as an 
> argument and the methods that take and/or return String should *all* be 
> deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting 
> buffer can be over-allocated. A less aggressive algorithm would be better. In 
> the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be 
> upgraded now rather than waiting for 3.0. As it stands, filter chains that 
> alternate between char[] and String are sub-optimal. Currently, it is used in 
> core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so 
> that reuse is maximized and descendant classes should be rewritten to 
> over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add 
> constructors that took a Token. This would simplify the use of the two 
> together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[4]: lucene scoring

2008-08-08 Thread J. Delgado
The only score that I can think of that can measure "quality" across
different queries are invariant scores such as pagerank. That is to score
the document on its general information value and then use that as a filter
regardless of the query. This is very different than the problem of trying
to nomalize the score on the same query over different shards (indexes) in a
federated query setting, which has been researched extensively.

The reason why two queries have different "scale" for scores is because of
the probabilistic nature of the algorithms which view word occurences as
independent random variables. Thus the occurence of each word in a document
is treated as an independent event. Joint and conditional probabilities can
estimated looking at word co-occurence, which could be used to compare two
specific results (i.e. how relevant is document X to both "baby kittens" and
"death metal" or if "baby kitten" is present in a doc how likely is that
"death metal" is present too), but to use the TF-IDF based score as as
absolute measure is like trying to compare Pears with Apples. Trying to
nomalize it is an ill-defined task.

-- J.D.



2008/8/8 Александр Аристов <[EMAIL PROTECTED]>

> Relevance ranking is an option but we still won't be able compare results.
> Lets say we have distributed searching - in this case top 10 from one server
> is not the same as those which are from another. Even worse we may get that
> in the resulting set a document with most top score is worse than others.
>
> what if we disable normalization or make it constant will results be
> absolutely dummy?
>
> And anther approach, can we calculate most possible top value? Or just
> maybe approximation of it? we then would be able to compare results with it.
>
> Alex
>
>
> -Original Message-
> From: Grant Ingersoll <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Date: Thu, 7 Aug 2008 15:54:41 -0400
> Subject: Re: Re[2]: lucene scoring
>
>
> On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:
>
> > I want implement searching with ability to set so-called a
> > confidence level below which I would treat documents as garbage. I
> > cannot defile the level per query as the level should be relevant
> > for all documents.
> >
> > With current scoring implementation the level would mean nothing. I
> > don't believe that since that time (the thread is of 2005year)
> > nothing has been made towards the resolving the issue.
>
> That's because there is no resolution to be had, as far as I know, but
> I'm open to suggestions (patches are even better.)  What would it mean
> to say that a score of 0.5 for "baby kittens" is comparable to a score
> of 0.5 for "death metal"?  Like I said, I don't think that 0.5 for
> "baby kittens" is even comparable later if you added other documents
> that contain any of the query terms.
>
> >
> >
> > Do you think any workarounds like implementing more sophisticated
> > queries so that we have approximately the same normalization values?
>
> I just don't think you will be successful with this, and I don't
> believe it is a Lucene issue alone, but one that applies to all search
> engines, but I could be wrong.
>
> I get what you are trying to do, though, I've wanted to do it from
> time to time.   Another approach may be to look for significant
> differences between scores w/in a result set.   For example, if doc 1
> is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could
> argue that doc 3 is garbage, but even that is somewhat of a stretch.
> Garbage truly is in the eye of the beholder.
>
> Another option is to do more relevance tuning to make sure your top 10
> are as good as possible so that your garbage is minimized.
>
> -Grant
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621099#action_12621099
 ] 

Michael McCandless commented on LUCENE-1219:


bq. realized I am missing actual length we read in LazyField

Duh, right.

Though: couldn't you just call document.getFieldable(name), and then call 
binaryValue(byte[] result) on that Fieldable, and then get the length from it 
(getBinaryLength()) too?  (Trying to minimize API changes).

bq. Some sort of lower level callback interface to populate a Document might 
even eliminate the need for some of the FieldSelector stuff... or at least it 
would mostly be independent of the field reading code and users could create 
more advanced implementations.

This sounds interesting... but how would you re-use your own byte[] with this 
approach?

> support array/offset/ length setters for Field with binary data
> ---
>
> Key: LUCENE-1219
> URL: https://issues.apache.org/jira/browse/LUCENE-1219
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
> LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
> LUCENE-1219.take2.patch, LUCENE-1219.take3.patch
>
>
> currently Field/Fieldable interface supports only compact, zero based byte 
> arrays. This forces end users to create and copy content of new objects 
> before passing them to Lucene as such fields are often of variable size. 
> Depending on use case, this can bring far from negligible  performance  
> improvement. 
> this approach extends Fieldable interface with 3 new methods   
> getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
> to the array)
>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2008-08-08 Thread Matthew Mastracci (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621091#action_12621091
 ] 

Matthew Mastracci commented on LUCENE-753:
--

bq. Is the index itself corrupt, ie, NIOFSDirectory did something bad when 
writing the index? Or, is it only in reading the index with NIOFSDirectory that 
you see this? IE, can you swap in FSDirectory on your existing index and the 
problem goes away?

I haven't seen any issues with writing the index under NIOFSDirectory.  The 
failures seem to happen only when reading.  When I switch to FSDirectory (or 
MMapDirectory), the same index that fails under NIOFSDirectory works flawlessly 
(indicating that the index is not corrupt).

The error with NIOFSDirectory is determinate and repeatable (same error every 
time, same location, same query during warmup).

I couldn't reproduce this on a smaller index, unfortunately.


> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FileReadTest.java, FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FSDirectoryPool.patch, FSIndexInput.patch, FSIndexInput.patch, 
> lucene-753.patch, lucene-753.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1333) Token implementation needs improvements

2008-08-08 Thread DM Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DM Smith updated LUCENE-1333:
-

Attachment: LUCENE-1333.patch

This patch includes all the previous ones.

Note: It includes the functionality solving  LUCENE-1350. If this patch is 
applied before LUCENE-1350, then that issue is resolved. If it is done after 
then the patch will need to be rebuilt.

I did not do the "reuse" API mentioned in LUCENE-1350.

> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 2.3.1
> Environment: All
>Reporter: DM Smith
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1333-analysis.patch, LUCENE-1333-analyzers.patch, 
> LUCENE-1333-core.patch, LUCENE-1333-highlighter.patch, 
> LUCENE-1333-instantiated.patch, LUCENE-1333-lucli.patch, 
> LUCENE-1333-memory.patch, LUCENE-1333-miscellaneous.patch, 
> LUCENE-1333-queries.patch, LUCENE-1333-snowball.patch, 
> LUCENE-1333-wikipedia.patch, LUCENE-1333-wordnet.patch, 
> LUCENE-1333-xml-query-parser.patch, LUCENE-1333.patch, LUCENE-1333.patch, 
> LUCENE-1333.patch, LUCENE-1333a.txt
>
>
> This was discussed in the thread (not sure which place is best to reference 
> so here are two):
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200805.mbox/[EMAIL 
> PROTECTED]
> or to see it all at once:
> http://www.gossamer-threads.com/lists/lucene/java-dev/62851
> Issues:
> 1. JavaDoc is insufficient, leading one to read the code to figure out how to 
> use the class.
> 2. Deprecations are incomplete. The constructors that take String as an 
> argument and the methods that take and/or return String should *all* be 
> deprecated.
> 3. The allocation policy is too aggressive. With large tokens the resulting 
> buffer can be over-allocated. A less aggressive algorithm would be better. In 
> the thread, the Python example is good as it is computationally simple.
> 4. The parts of the code that currently use Token's deprecated methods can be 
> upgraded now rather than waiting for 3.0. As it stands, filter chains that 
> alternate between char[] and String are sub-optimal. Currently, it is used in 
> core by Query classes. The rest are in contrib, mostly in analyzers.
> 5. Some internal optimizations can be done with regard to char[] allocation.
> 6. TokenStream has next() and next(Token), next() should be deprecated, so 
> that reuse is maximized and descendant classes should be rewritten to 
> over-ride next(Token)
> 7. Tokens are often stored as a String in a Term. It would be good to add 
> constructors that took a Token. This would simplify the use of the two 
> together.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-08-08 Thread DM Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DM Smith updated LUCENE-1350:
-

Attachment: LUCENE-1350.patch

{quote}
Should we just absorb this issue into LUCENE-1333? DM, of your list
above (of filters that lose payload), are there any that are not fixed
in LUCENE-1333? I'm confused on the overlap and it's hard to work
with all the patches. Actually if in LUCENE-1333 you could
consolidate down to a single patch (big toplevel "svn diff"), that'd
be great
{quote}

LUCENE-1333 will have to include all of this. I have already created a patch 
for LUCENE-1350 and LUCENE-1333, which satisfies this requirement. If 
LUCENE-1350 goes first, then the patch for LUCENE-1333 will need to be 
re-built. If LUCENE-1333 goes first then this one can be closed.

I really don't care which is done first. If both are going to be in the next 
release, then I think just do LUCENE-1333. But if for some reason, we are going 
to do a release before 2.4 and only LUCENE-1350 is going in it, then that's 
fine with me.

As to the effort I have already done the work. And I was happy to do it :)

> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.3.3
>
> Attachments: LUCENE-1350.patch, LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621043#action_12621043
 ] 

Yonik Seeley commented on LUCENE-1219:
--

bq. Also ... it'd be nice to have a way to do this re-use in the non-lazy case. 
Ie somehow load a stored doc, but passing in your own Document result which 
would attempt to re-use the Field instances & byte[] for the binary fields.

Right.  This also gets back to the fact that the Document you retrieve should 
probably be different than the Document that you get by loading the stored 
fields.

Some sort of lower level callback interface to populate a Document might even 
eliminate the need for some of the FieldSelector stuff... or at least it would 
mostly be independent of the field reading code and users could create more 
advanced implementations.

> support array/offset/ length setters for Field with binary data
> ---
>
> Key: LUCENE-1219
> URL: https://issues.apache.org/jira/browse/LUCENE-1219
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
> LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
> LUCENE-1219.take2.patch, LUCENE-1219.take3.patch
>
>
> currently Field/Fieldable interface supports only compact, zero based byte 
> arrays. This forces end users to create and copy content of new objects 
> before passing them to Lucene as such fields are often of variable size. 
> Depending on use case, this can bring far from negligible  performance  
> improvement. 
> this approach extends Fieldable interface with 3 new methods   
> getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
> to the array)
>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-08-08 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621041#action_12621041
 ] 

doronc edited comment on LUCENE-1350 at 8/8/08 1:18 PM:
-

Mike, thanks for clearing things...

You're right - this is fixed by LUCENE-1333. 
If LUCENE-1333 gets committed soon there's no point in 
doing this here, just making more work for DM in reworking 1333.
The only motivation to do this is if there will be another
fix release 2.3.3.3, in which case it would make sense to
fix this issue, but not the deprecation of the non-reuse 
API done by 1333. Or do you agree with DM that since payloads
and flags are marked experimental they can remain broken 
(in regard of this issue) until 2.4? (not perfect, but I can 
live with it).

For the reuse methods names, I like *reinit()*...


  was (Author: doronc):
Mike, thanks for clearing things...

You're right - this is fixed by LUCENE-1333. 
If LUCENE-1333 gets committed soon there's no point in 
doing this here, just making more work for DM in reworking 1333.
The only motivation to do this is if there will be another
fix release 2.3.3.3, in which case it would make sense to
fix this issue, but not the deprecation of the non-reuse 
API done by 1333. Or do you agree with DM that since payloads
and flags are marked experimental they can remain broken 
(in regard of this issue) until 2.4? (I not perfect, but I can 
live with it).

For the reuse methods names, I like *reinit()*...

  
> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.3.3
>
> Attachments: LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-08-08 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621041#action_12621041
 ] 

Doron Cohen commented on LUCENE-1350:
-

Mike, thanks for clearing things...

You're right - this is fixed by LUCENE-1333. 
If LUCENE-1333 gets committed soon there's no point in 
doing this here, just making more work for DM in reworking 1333.
The only motivation to do this is if there will be another
fix release 2.3.3.3, in which case it would make sense to
fix this issue, but not the deprecation of the non-reuse 
API done by 1333. Or do you agree with DM that since payloads
and flags are marked experimental they can remain broken 
(in regard of this issue) until 2.4? (I not perfect, but I can 
live with it).

For the reuse methods names, I like *reinit()*...


> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.3.3
>
> Attachments: LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621036#action_12621036
 ] 

Eks Dev commented on LUCENE-1219:
-

bq. could we instead add this to Field:
byte[] binaryValue(byte[] result)

this is exactly where I started, but then realized I am missing actual length 
we read in LazyField, without it you would have to relocate each time, except 
in case where your buffer length equals toRead in LazyField... simply, the 
question is, how the caller of   byte[] getBinaryValue(String name, byte[] 
result) could know what is the length in this returned byte[]

Am I missing something obvious?

> support array/offset/ length setters for Field with binary data
> ---
>
> Key: LUCENE-1219
> URL: https://issues.apache.org/jira/browse/LUCENE-1219
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
> LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
> LUCENE-1219.take2.patch, LUCENE-1219.take3.patch
>
>
> currently Field/Fieldable interface supports only compact, zero based byte 
> arrays. This forces end users to create and copy content of new objects 
> before passing them to Lucene as such fields are often of variable size. 
> Depending on use case, this can bring far from negligible  performance  
> improvement. 
> this approach extends Fieldable interface with 3 new methods   
> getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
> to the array)
>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re[2]: lucene scoring

2008-08-08 Thread Doron Cohen
Following suggestion is weaker than the requested functionality, but
maybe you'll find the concept useful to ignore so called "garbage" results.

Assume that the query is a simple OR query made of a few words.
By examining the frequencies of these words in the index
(their DFs) devise a synthetic document which is the worst
document you will be willing to accept as a useful result.
Alternatively ignore DFs, but create a few documents like this -
each perhaps containing one or few of the query words (and likely
many other words). Now virtually add the synthetic document(s)
to the index. Can be done by creating a small in memory index,
and creating a multiIndexReader on top of the real index and
the dummy one. Now execute the query, with a filter that
accepts only the synthetic documents. The score of the worst
acceptable document(s) can be used as a threshold when
running the query on the original index.

It is inefficient - should be done for each query, and would be hard
to implement for general queries, and I never tried it...

Doron

2008/8/8 Александр Аристов <[EMAIL PROTECTED]>

> Query independent means that the threshold should have the same relevance
> for all queries and discard found docs below it. Current scoring
> implementation doesn't give guaranties that, say two documents found in two
> queries and which got the same score 0.5 are of the same quality.
>
> I don't want discarding docs from being indexed, no. But I want to be sure
> that two docs with the same score in two different queries have the same
> quality (they contain the same set of found terms, lenght etc.)
>
> Alexander
>
> -Original Message-
> From: Andrzej Bialecki <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Date: Thu, 07 Aug 2008 22:44:46 +0200
> Subject: Re: lucene scoring
>
>
> Александр Аристов wrote:
> > I want implement searching with ability to set so-called a confidence
> > level below which I would treat documents as garbage. I cannot defile
> > the level per query as the level should be relevant for all
> > documents.
>
> Hmm .. I'm not sure if I understand it properly - if the level is
> query-independent, then it's a constant factor, which you can put in a
> field during the index creation - and then you could use a Filter or
> FunctionQuery to exclude documents with this factor below the threshold.
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


[jira] Commented: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted

2008-08-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12621024#action_12621024
 ] 

Yonik Seeley commented on LUCENE-1329:
--

bq. I didn't do this one yet ... it makes me a bit nervous because it means 
that people who just do IndexReader.open on an index with no deletions pay the 
RAM cost upfront of allocating the BitVector.

Right, which is why I said it was for a "deleting-reader" (which presumes the 
existence of read-only-readers).


> Remove synchronization in SegmentReader.isDeleted
> -
>
> Key: LUCENE-1329
> URL: https://issues.apache.org/jira/browse/LUCENE-1329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1329.patch, lucene-1329.patch
>
>
> Removes SegmentReader.isDeleted synchronization by using a volatile 
> deletedDocs variable on Java 1.5 platforms.  On Java 1.4 platforms 
> synchronization is limited to obtaining the deletedDocs reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1329) Remove synchronization in SegmentReader.isDeleted

2008-08-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1329:
---

Attachment: LUCENE-1329.patch

I took a first cut at creating an explicit read only IndexReader
(attached), which is an alternative to the first patch here.

I added "boolean readOnly" to 3 of the IndexReader open methods, and
created ReadOnlySegmentReader and ReadOnlyMultiSegmentReader.  The
classes are trivial -- they subclass the parent and just override
acquireWriteLock (to throw an exception) and isDeleted.

reopen() also preserves readOnly-ness, and I fixed merging to open readOnly
IndexReaders.

bq. We could safely do this for a deleting-reader by pre-allocating the 
BitVector objects, thus eliminating the possibility of a thread seeing a 
partially constructed object.

I didn't do this one yet ... it makes me a bit nervous because it
means that people who just do IndexReader.open on an index with no
deletions pay the RAM cost upfront of allocating the BitVector.

Really, I think we want to default IndexReader.open to be
readOnly=true (which we can't do until 3.0) at which point doing the
above seems safer since you'd have to go out of your way to open a
non-readOnly IndexReader.


> Remove synchronization in SegmentReader.isDeleted
> -
>
> Key: LUCENE-1329
> URL: https://issues.apache.org/jira/browse/LUCENE-1329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1329.patch, lucene-1329.patch
>
>
> Removes SegmentReader.isDeleted synchronization by using a volatile 
> deletedDocs variable on Java 1.5 platforms.  On Java 1.4 platforms 
> synchronization is limited to obtaining the deletedDocs reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1350) Filters which are "consumers" should not reset the payload or flags and should better reuse the token

2008-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620970#action_12620970
 ] 

Michael McCandless commented on LUCENE-1350:



It seems like there are three different things, here:

  # Many filters (eg SnowballFilter) incorrectly erase the Payload,
token Type and token flags, because they are basically doing
their own Token cloning.  This is pre-existing (before re-use API
was created).
  # Separately, these filters do not use the re-use API, which we are
wanting to migrate to anyway.
  # Adding new "reuse" methods on Token which are like clear() except
they also take args to replace the termBuffer, start/end offset,
etc, and they do not clear the payload/flags to their defaults.

Since in LUCENE-1333 we are aggressively moving all Lucene core &
contrib TokenStream & TokenFilters to use the re-use API (formally
deprecating the original non-reuse API), we may as well fix 1 & 2 at
once.

I think the reuse API proposal is reasonable: it mirrors the current
constructors on Token.  But, since we are migrating to reuse api, you
need the analog (of all these constructors) without making a new
Token.

But maybe change the name from "reuse" to maybe "update", "set",
"reset", "reinit", or "change"?  But: I think this method should still
reset payload, position incr, etc, to defaults?  Ie calling this
method should get you the same result as creating a new Token(...)
passing in the termBuffer, start/end offset, etc, I think?

Should we just absorb this issue into LUCENE-1333?  DM, of your list
above (of filters that lose payload), are there any that are not fixed
in LUCENE-1333?  I'm confused on the overlap and it's hard to work
with all the patches.  Actually if in LUCENE-1333 you could
consolidate down to a single patch (big toplevel "svn diff"), that'd
be great :)


> Filters which are "consumers" should not reset the payload or flags and 
> should better reuse the token
> -
>
> Key: LUCENE-1350
> URL: https://issues.apache.org/jira/browse/LUCENE-1350
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis, contrib/*
>Reporter: Doron Cohen
>Assignee: Doron Cohen
> Fix For: 2.3.3
>
> Attachments: LUCENE-1350.patch
>
>
> Passing tokens with payloads through SnowballFilter results in tokens with no 
> payloads.
> A workaround for this is to apply stemming first and only then run whatever 
> logic creates the payload, but this is not always convenient.
> Other "consumer" filters have similar problem.
> These filters can - and should - reuse the token, by implementing 
> next(Token), effectively also fixing the unwanted resetting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620960#action_12620960
 ] 

Michael McCandless commented on LUCENE-1219:


Eks, could we instead add this to Field:

  byte[] binaryValue(byte[] result)

and then default the current binaryValue() to just call binaryValue(null)?

And similar in Document add:

  byte[] getBinaryValue(String name, byte[] result)

These would work the same way that TokenStream.next(Token result) works, ie, 
the method should try to use the result passed in, if it works, and return 
that; else it's free to allocate its own new byte[] and return it?

And then only LazyField's implementation of binaryValue(byte[] result) would 
use byte[] result if it's large enough?

Also ... it'd be nice to have a way to do this re-use in the non-lazy case.  Ie 
somehow load a stored doc, but passing in your own Document result which would 
attempt to re-use the Field instances & byte[] for the binary fields.  But we 
should open another issue to explore that...

> support array/offset/ length setters for Field with binary data
> ---
>
> Key: LUCENE-1219
> URL: https://issues.apache.org/jira/browse/LUCENE-1219
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
> LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
> LUCENE-1219.take2.patch, LUCENE-1219.take3.patch
>
>
> currently Field/Fieldable interface supports only compact, zero based byte 
> arrays. This forces end users to create and copy content of new objects 
> before passing them to Lucene as such fields are often of variable size. 
> Depending on use case, this can bring far from negligible  performance  
> improvement. 
> this approach extends Fieldable interface with 3 new methods   
> getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
> to the array)
>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1219) support array/offset/ length setters for Field with binary data

2008-08-08 Thread Eks Dev (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eks Dev updated LUCENE-1219:


Attachment: LUCENE-1219.extended.patch

Mike, 
This new patch includes take3  and adds the following:

Fieldable  Document.getStoredBinaryField(String name, byte[] scratch);

where scratch param represents user byte buffer that will be used in case it is 
big enough, if not, it will be simply allocated like today. If scratch is used, 
you get the same object through Fieldable.getByteValue()


for this to work, I added one new method in Fieldable 
abstract Fieldable getBinaryField(byte[] scratch);

the only interesting implementation is in LazyField 

The reason for this is in my previous comment

this does not affect issues from take3 at all, but is dependant on it, as you 
need to know the length of byte[] you read. take3 remains good to commit, I 
just did not know how to make one isolated patch with only these changes 
without too much work in text editor 
 

> support array/offset/ length setters for Field with binary data
> ---
>
> Key: LUCENE-1219
> URL: https://issues.apache.org/jira/browse/LUCENE-1219
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Eks Dev
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1219.extended.patch, LUCENE-1219.patch, 
> LUCENE-1219.patch, LUCENE-1219.patch, LUCENE-1219.patch, 
> LUCENE-1219.take2.patch, LUCENE-1219.take3.patch
>
>
> currently Field/Fieldable interface supports only compact, zero based byte 
> arrays. This forces end users to create and copy content of new objects 
> before passing them to Lucene as such fields are often of variable size. 
> Depending on use case, this can bring far from negligible  performance  
> improvement. 
> this approach extends Fieldable interface with 3 new methods   
> getOffset(); gettLenght(); and getBinaryValue() (this only returns reference 
> to the array)
>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[4]: lucene scoring

2008-08-08 Thread Александр Аристов
Relevance ranking is an option but we still won't be able compare results. Lets 
say we have distributed searching - in this case top 10 from one server is not 
the same as those which are from another. Even worse we may get that in the 
resulting set a document with most top score is worse than others.

what if we disable normalization or make it constant will results be absolutely 
dummy?

And anther approach, can we calculate most possible top value? Or just maybe 
approximation of it? we then would be able to compare results with it.

Alex


-Original Message-
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Date: Thu, 7 Aug 2008 15:54:41 -0400
Subject: Re: Re[2]: lucene scoring


On Aug 7, 2008, at 3:05 PM, Александр Аристов wrote:

> I want implement searching with ability to set so-called a  
> confidence level below which I would treat documents as garbage. I  
> cannot defile the level per query as the level should be relevant  
> for all documents.
>
> With current scoring implementation the level would mean nothing. I  
> don't believe that since that time (the thread is of 2005year)  
> nothing has been made towards the resolving the issue.

That's because there is no resolution to be had, as far as I know, but  
I'm open to suggestions (patches are even better.)  What would it mean  
to say that a score of 0.5 for "baby kittens" is comparable to a score  
of 0.5 for "death metal"?  Like I said, I don't think that 0.5 for  
"baby kittens" is even comparable later if you added other documents  
that contain any of the query terms.

>
>
> Do you think any workarounds like implementing more sophisticated  
> queries so that we have approximately the same normalization values?

I just don't think you will be successful with this, and I don't  
believe it is a Lucene issue alone, but one that applies to all search  
engines, but I could be wrong.

I get what you are trying to do, though, I've wanted to do it from  
time to time.   Another approach may be to look for significant  
differences between scores w/in a result set.   For example, if doc 1  
is 0.8, doc 2 is 0.79 and then doc 3 is 0.2, then maybe one could  
argue that doc 3 is garbage, but even that is somewhat of a stretch.   
Garbage truly is in the eye of the beholder.

Another option is to do more relevance tuning to make sure your top 10  
are as good as possible so that your garbage is minimized.

-Grant
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: lucene scoring

2008-08-08 Thread Александр Аристов
Query independent means that the threshold should have the same relevance for 
all queries and discard found docs below it. Current scoring implementation 
doesn't give guaranties that, say two documents found in two queries and which 
got the same score 0.5 are of the same quality.   

I don't want discarding docs from being indexed, no. But I want to be sure that 
two docs with the same score in two different queries have the same quality 
(they contain the same set of found terms, lenght etc.)

Alexander

-Original Message-
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Date: Thu, 07 Aug 2008 22:44:46 +0200
Subject: Re: lucene scoring


Александр Аристов wrote:
> I want implement searching with ability to set so-called a confidence
> level below which I would treat documents as garbage. I cannot defile
> the level per query as the level should be relevant for all
> documents.

Hmm .. I'm not sure if I understand it properly - if the level is 
query-independent, then it's a constant factor, which you can put in a 
field during the index creation - and then you could use a Filter or 
FunctionQuery to exclude documents with this factor below the threshold.

-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]