date:20071120

Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-20 Thread robert engels


Thanks!

On Nov 21, 2007, at 1:35 AM, Michael Busch wrote:


robert engels wrote:


We are still using Lucene 1.9.1+, and I am wondering if there has  
been
any improvements in searching on AND clauses when some of the  
terms are

very infrequent...



multi-level skipping should help when an AND query has frequent and
infrequent terms. See LUCENE-866 for some performance numbers.

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-20 Thread Michael Busch

robert engels wrote:
> 
> We are still using Lucene 1.9.1+, and I am wondering if there has been
> any improvements in searching on AND clauses when some of the terms are
> very infrequent...
> 

multi-level skipping should help when an AND query has frequent and
infrequent terms. See LUCENE-866 for some performance numbers.

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-20 Thread robert engels

Sorry if this is somewhat off topic, but it seems at least marginally  
related to this...


We are still using Lucene 1.9.1+, and I am wondering if there has  
been any improvements in searching on AND clauses when some of the  
terms are very infrequent...


This change seems appropriate.  Are there others associated with the  
performance gains?


If you were going to back-port some of the later changes, can anyone  
give some advice as to the biggest "bang for the buck".  Hopefully  
those not involving an index format change.


Thanks.
Robert

On Nov 21, 2007, at 1:16 AM, Yonik Seeley (JIRA) wrote:



 [ https://issues.apache.org/jira/browse/LUCENE-693? 
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]


Yonik Seeley updated LUCENE-693:


Attachment: conjunction.patch

Whew... I'd forgotten about this issue.  I brushed up one of the  
last versions I had lying around from a year ago (see lastest  
conjunction.patch), fixed up my synthetic tests a bit, and got some  
decent results:


1% faster in top level term conjunctions (wheee)
49% faster in a conjunction of nested term conjunctions (no sort  
per call to skipTo)

5% faster in a top level ConstantScoreQuery conjunction
144% faster in a conjunction of nested ConstantScoreQuery conjunctions

A sort is done the first time, and the scorers are ordered so that  
the highest will skip first (the idea being that there may be a  
little info in the first skip about which scorer is most sparse).


Michael Busch recently brought up a related idea... that one could  
skip on low df terms first... but that would of course require some  
terms in the conjunction.



ConjunctionScorer - more tuneup
---

Key: LUCENE-693
URL: https://issues.apache.org/jira/browse/LUCENE-693
Project: Lucene - Java
 Issue Type: Bug
 Components: Search
   Affects Versions: 2.1
Environment: Windows Server 2003 x64, Java 1.6, pretty  
large index

   Reporter: Peter Keegan
Attachments: conjunction.patch, conjunction.patch,  
conjunction.patch, conjunction.patch, conjunction.patch.nosort1



(See also: #LUCENE-443)
I did some profile testing with the new ConjuctionScorer in 2.1  
and discovered a new bottleneck in ConjunctionScorer.sortScorers.  
The java.utils.Arrays.sort method is cloning the Scorers array on  
every sort, which is quite expensive on large indexes because of  
the size of the 'norms' array within, and isn't necessary.

Here is one possible solution:
  private void sortScorers() {
// squeeze the array down for the sort
//if (length != scorers.length) {
//  Scorer[] temps = new Scorer[length];
//  System.arraycopy(scorers, 0, temps, 0, length);
//  scorers = temps;
//}
insertionSort( scorers,length );
// note that this comparator is not consistent with equals!
//Arrays.sort(scorers, new Comparator() { // sort the  
array

//public int compare(Object o1, Object o2) {
//  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
//}
//  });

first = 0;
last = length - 1;
  }
  private void insertionSort( Scorer[] scores, int len)
  {
  for (int i=0; i  for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc 
();j-- ) {

  swap (scores, j, j-1);
  }
  }
  return;
  }
  private void swap(Object[] x, int a, int b) {
Object t = x[a];
x[a] = x[b];
x[b] = t;
  }

The squeezing of the array is no longer needed.
We also initialized the Scorers array to 8 (instead of 2) to avoid  
having to grow the array for common queries, although this  
probably has less performance impact.

This change added about 3% to query throughput in my testing.
Peter


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup

2007-11-20 Thread Yonik Seeley (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yonik Seeley updated LUCENE-693:


Attachment: conjunction.patch

Whew... I'd forgotten about this issue.  I brushed up one of the last versions 
I had lying around from a year ago (see lastest conjunction.patch), fixed up my 
synthetic tests a bit, and got some decent results:

1% faster in top level term conjunctions (wheee)
49% faster in a conjunction of nested term conjunctions (no sort per call to 
skipTo)
5% faster in a top level ConstantScoreQuery conjunction
144% faster in a conjunction of nested ConstantScoreQuery conjunctions

A sort is done the first time, and the scorers are ordered so that the highest 
will skip first (the idea being that there may be a little info in the first 
skip about which scorer is most sparse).

Michael Busch recently brought up a related idea... that one could skip on low 
df terms first... but that would of course require some terms in the 
conjunction.

> ConjunctionScorer - more tuneup
> ---
>
> Key: LUCENE-693
> URL: https://issues.apache.org/jira/browse/LUCENE-693
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Windows Server 2003 x64, Java 1.6, pretty large index
>Reporter: Peter Keegan
> Attachments: conjunction.patch, conjunction.patch, conjunction.patch, 
> conjunction.patch, conjunction.patch.nosort1
>
>
> (See also: #LUCENE-443)
> I did some profile testing with the new ConjuctionScorer in 2.1 and 
> discovered a new bottleneck in ConjunctionScorer.sortScorers. The 
> java.utils.Arrays.sort method is cloning the Scorers array on every sort, 
> which is quite expensive on large indexes because of the size of the 'norms' 
> array within, and isn't necessary. 
> Here is one possible solution:
>   private void sortScorers() {
> // squeeze the array down for the sort
> //if (length != scorers.length) {
> //  Scorer[] temps = new Scorer[length];
> //  System.arraycopy(scorers, 0, temps, 0, length);
> //  scorers = temps;
> //}
> insertionSort( scorers,length );
> // note that this comparator is not consistent with equals!
> //Arrays.sort(scorers, new Comparator() { // sort the array
> //public int compare(Object o1, Object o2) {
> //  return ((Scorer)o1).doc() - ((Scorer)o2).doc();
> //}
> //  });
>   
> first = 0;
> last = length - 1;
>   }
>   private void insertionSort( Scorer[] scores, int len)
>   {
>   for (int i=0; i   for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) {
>   swap (scores, j, j-1);
>   }
>   }
>   return;
>   }
>   private void swap(Object[] x, int a, int b) {
> Object t = x[a];
> x[a] = x[b];
> x[b] = t;
>   }
>  
> The squeezing of the array is no longer needed. 
> We also initialized the Scorers array to 8 (instead of 2) to avoid having to 
> grow the array for common queries, although this probably has less 
> performance impact.
> This change added about 3% to query throughput in my testing.
> Peter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544175
 ] 

Doron Cohen commented on LUCENE-1063:
-

{quote}
So I don't think we need to change anything.
{quote}
(y) sounds good to me. 

(i) looking close at TokenStream it is interesting that next() and next(Token) 
as written will loop forever. So if a subclass just implements say next() by 
calling (super's) next(new Token()) it is an infinite loop. However anything 
like this would be buggy anyhow because no meaningful token is created this 
way. To summarize there's no action item here. (I thought about modifying the 
javadoc NOTE to: ??subclasses must +create the next Token+ by overriding at 
least one of next() or next(Token)??, but I am not convinced it is any clearer.)

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-20 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1058:


Attachment: LUCENE-1058.patch

Here's a patch that modifies the DocumentsWriter to not throw an 
IllegalArgumentException if no Reader is specified.  Thus, an Analyzer needs to 
be able to handle a null Reader (this still needs to be documented).  
Basically, the semantics of it are that the Analyzer is producing Tokens from 
some other means.  I probably should spell this out in a new Field constructor 
as well, but this should suffice for now, and I will revisit it after the break.

 I also added in a TestCollaboratingAnalyzer.  All tests pass.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-20 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544145
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

Some javadoc comments for the modifyToken method in BufferingTokenFilter should 
be sufficient, right?  Something to the effect that if this TokenFilter is not 
the last in the chain that it should make a full copy.  

As for the CachedTokenizer and CachedAnalyzer, those should be implied, since 
the user is passing them in to begin with.

The other thing of interest, is that calling Analyzer.tokenStream(String, 
Reader) is not needed.  In fact, this somewhat suggests having a new Fieldable 
property akin to tokenStreamValue(), etc. that says don't even ask the 
Fieldable for a value.  

Let me take a crack at what that means and post a patch.  It will mean some 
changes to invertField() in DocumentsWriter and possibly changing it to not 
require that one of tokenStreamValue, readerValue() or stringValue() be 
defined.  Not sure if that is a good idea or not.  



> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136
 ] 

Chuck Williams commented on LUCENE-1052:


I can report that in our application having a formula is critical.  We have no 
control over the content our users index, nor in fact do they.  These are 
arbitrary documents.  We find a surprising number of them contain embedded 
encoded binary data.  When those are indexed, lucene's memory consumption 
skyrockets, either bringing the whole app down with an OOM or slowing 
performance to a crawl due to excessive GC's reclaiming a tiny remaining 
working memory space.

Our users won't accept a solution like, wait until the problem occurs and then 
increment your termIndexDivisor.  They expect our app to manage this 
automatically.

I agree that making TermInfosReader, SegmentReader, etc. public classes is not 
a great solution  The current patch does not do that.  It simply adds a 
configurable class that can be used to provide formula parameters as opposed to 
just value parameters.  At least for us, this special case is sufficiently 
important to outweigh any considerations of the complexity of an additional 
class.

A single configuration class could be used at the IndexReader level that 
provides for both static and dynamically-varying properties through getters, 
some of which take parameters.

Here is another possible solution.  My current thought is that the bound should 
always be a multiple of sqrt(numDocs).  E.g., see Heap's Law here:  
http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html

I'm currently using this formula in my TermInfosConfigurer:

int bound = (int) 
(1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL);

This has Heap's law as foundation.  I provide TERM_BOUNDING_MULTIPLIER as the 
config parameter, with 0 meaning don't do this.  I also provide a 
TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually 
specified constant amount.

If that approach would be acceptable to lucene in general, then we just need 
two static parameters.  However, I don't have enough experience with how well 
this formula works in our user base yet to know whether or not we'll tune it 
further.




> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544125
 ] 

Doug Cutting commented on LUCENE-1052:
--

What class would we put TermInfosReader-specific setters & getters on, since 
that class is not public?  Do we make TermInfosReader public or leave it 
package-private?  My intuition is to leave it package-private for now, in order 
to retain freedom to re-structure w/o breaking applications, and because making 
it public would drag a lot of other stuff into the public.  We could consider 
making SegmentReader public, so that there's a public class that corresponds to 
the concrete index implementation, but that'd also drag more stuff public (like 
DirectoryIndexReader).

I'm also not yet convinced that it is critical to support arbitrary formulae 
for this feature.  Sure, it would be nice, but it has costs, like increasing 
public APIs that must be supported.  Folks have done fine without this feature 
for many years.  Adding a simple integer divisor is a sufficient initial step 
here.

So, even if we add a configuration system, I think the setter methods could 
still end up on IndexReader.  The difference is primarily whether the methods 
are:

public void setTermIndexInterval(int interval);
public void setTermIndexDivisor(int divisor);

or

public static void setTermIndexInterval(LuceneProps props, int interval);
public static void setTermIndexDivisor(LuceneProps props, int divisor);

With the latter just a façade that uses package-private stuff.  I think the 
latter style will be handy as we start adding parameters to, e.g., Query 
classes.  In those cases we'll probably want façade's too, since a Query setter 
will probably really tweak something for a private Scorer class.  In the case 
of indexes, however, we don't have a public, concrete class.

Another option is to make a public class whose purpose is just to only such 
parameters, something like SegmentIndexParameters.  That'd be my first choice 
and was the direction I pointed in my initial proposal, but with considerably 
less explanation.

> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544115
 ] 

Michael McCandless commented on LUCENE-1044:


OK, I tested calling command-line "sync", after writing each segments
file.  It's in fact even slower than fsync on each file for these 3
cases:

Linux (2.6.22.1), reiserfs 6 drive RAID5 array 93% slower
  sync - 330.74
nosync - 171.24

Linux (2.6.22.1), ext3 single drive 60% slower
  sync - 242.02
nosync - 150.91

Mac Pro (10.4 Tiger), 4 drive RAID0 array 28% slower
  sync - 204.77
nosync - 159.90

I'll look into the separate thread to sync/close files in the
background next...


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, 
> LUCENE-1044.take3.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1001) Add Payload retrieval to Spans

2007-11-20 Thread Paul Elschot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-1001:
-

Comment: was deleted

> Add Payload retrieval to Spans
> --
>
> Key: LUCENE-1001
> URL: https://issues.apache.org/jira/browse/LUCENE-1001
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It will be nice to have access to payloads when doing SpanQuerys.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/51134
> Current API, added to Spans.java is below.  I will try to post a patch as 
> soon as I can figure out how to make it work for unordered spans (I believe I 
> have all the other cases working).
> {noformat}
>  /**
>* Returns the payload data for the current span.
>* This is invalid until [EMAIL PROTECTED] #next()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return a List of byte arrays containing the data of this payload
>* @throws IOException
>*/
>   // TODO: Remove warning after API has been finalized
>   List/**/ getPayload() throws IOException;
>   /**
>* Checks if a payload can be loaded at this position.
>* 
>* Payloads can only be loaded once per call to
>* [EMAIL PROTECTED] #next()}.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return true if there is a payload available at this position that can 
> be loaded
>*/
>   // TODO: Remove warning after API has been finalized
>   public boolean isPayloadAvailable();
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans

2007-11-20 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544108
 ] 

Paul Elschot commented on LUCENE-1001:
--

Grant,

You asked:

... how do I get access to the position payloads in the order that they occur 
in the PQ? 

The answer was already there:

... , it's easier than that: when they match, they all match, so you only need 
to keep the input Spans around in List or whatever. Then use them all as a 
source for your payloads.

Regards,
Paul Elschot

P.S. I've had my break already...


> Add Payload retrieval to Spans
> --
>
> Key: LUCENE-1001
> URL: https://issues.apache.org/jira/browse/LUCENE-1001
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It will be nice to have access to payloads when doing SpanQuerys.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/51134
> Current API, added to Spans.java is below.  I will try to post a patch as 
> soon as I can figure out how to make it work for unordered spans (I believe I 
> have all the other cases working).
> {noformat}
>  /**
>* Returns the payload data for the current span.
>* This is invalid until [EMAIL PROTECTED] #next()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return a List of byte arrays containing the data of this payload
>* @throws IOException
>*/
>   // TODO: Remove warning after API has been finalized
>   List/**/ getPayload() throws IOException;
>   /**
>* Checks if a payload can be loaded at this position.
>* 
>* Payloads can only be loaded once per call to
>* [EMAIL PROTECTED] #next()}.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return true if there is a payload available at this position that can 
> be loaded
>*/
>   // TODO: Remove warning after API has been finalized
>   public boolean isPayloadAvailable();
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1055) Remove GData from trunk

2007-11-20 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544107
 ] 

Paul Elschot commented on LUCENE-1055:
--

Hoss, 
That must have been the cause. After removing the gdata-server directory 
manually everything is in order. Thanks.

> Remove GData from trunk 
> 
>
> Key: LUCENE-1055
> URL: https://issues.apache.org/jira/browse/LUCENE-1055
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1055.patch
>
>
> GData doesn't seem to be maintained anymore. We're going to remove it before 
> we cut the 2.3 release unless there are negative votes.
> In case someones jumps in in the future and starts to maintain it, we can 
> re-add it to the trunk.
> If anyone is using GData and needs it to be in 2.3 please let us know soon!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans

2007-11-20 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544105
 ] 

Grant Ingersoll commented on LUCENE-1001:
-

Sure, but how do I get access to the position payloads in the order that they 
occur in the PQ?  I have to go and pop them all of the PQ or I need to maintain 
a separate PQ for the Payloads so that when I go to get a payload for a span, I 
can iterate over all the items by calling PQ.pop() but then I have to rebuild 
it again if getPayload is called again, right?

I think I need to take a break and come back to this after some Turkey...  :-)

> Add Payload retrieval to Spans
> --
>
> Key: LUCENE-1001
> URL: https://issues.apache.org/jira/browse/LUCENE-1001
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It will be nice to have access to payloads when doing SpanQuerys.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/51134
> Current API, added to Spans.java is below.  I will try to post a patch as 
> soon as I can figure out how to make it work for unordered spans (I believe I 
> have all the other cases working).
> {noformat}
>  /**
>* Returns the payload data for the current span.
>* This is invalid until [EMAIL PROTECTED] #next()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return a List of byte arrays containing the data of this payload
>* @throws IOException
>*/
>   // TODO: Remove warning after API has been finalized
>   List/**/ getPayload() throws IOException;
>   /**
>* Checks if a payload can be loaded at this position.
>* 
>* Payloads can only be loaded once per call to
>* [EMAIL PROTECTED] #next()}.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return true if there is a payload available at this position that can 
> be loaded
>*/
>   // TODO: Remove warning after API has been finalized
>   public boolean isPayloadAvailable();
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1063.


Resolution: Invalid

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544103
 ] 

Michael McCandless commented on LUCENE-1063:


OK it sounds like this was a false alarm on my part -- sorry!

The semantics of next() have always allowed the caller to arbitrarily
modify the returned token ("forward reuse").

So I don't think we need to change anything.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544098
 ] 

Yonik Seeley commented on LUCENE-1063:
--

> CachingTokenFilter actually does this (caching references to the tokens).

It's a bug to depend on the fact that the tokens you return won't change.

If one is supposed to be able to use CachingTokenFilter anywhere in a filter 
chain and then be able to replay the tokens exactly as CachingTokenFilter first 
saw them (which is what I would guess the use to be), then it is a bug and 
didn't work properly before token reuse either.


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1062) Improved Payloads API

2007-11-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544095
 ] 

Michael Busch commented on LUCENE-1062:
---

We want to add the following methods to Payload:

{code:java}
public void setPayload(byte[] data);
public void setPayload(byte[] data, int offset, int length);
public byte[] getPayload();
public int getPayloadOffset();

public Object clone();
{code}

Also Payload should implement Cloneable.


Furthermore, we want to add a fieldName arg to Similarity.scorePayload().

I think we can also remove the "experimental" warnings from the Payload
APIs now?


> Improved Payloads API
> -
>
> Key: LUCENE-1062
> URL: https://issues.apache.org/jira/browse/LUCENE-1062
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
>
> We want to make some optimizations to the Payloads API.
> See following thread for related discussions:
> http://www.gossamer-threads.com/lists/lucene/java-dev/54708

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544096
 ] 

Michael McCandless commented on LUCENE-1058:


I think the discussion in LUCENE-1063 is relevant to this issue: if you store 
(& re-use) Tokens you may need to return a copy of the Token from the next() 
method to ensure that nay filters that alter the Token don't mess up your 
private copy.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544093
 ] 

Doron Cohen commented on LUCENE-1063:
-

{quote}
> TokenStreams that cache tokens without "protecting" their private copy when 
> next() is called?

That would be a bug in the filter (both in the past and now).
{quote}

I think it is okay to relax this to only protect in Toknizers (where Tokens are 
created), and not worry about TokenFilters.

TokenFilters always take a TokenStream at construction and always call its 
next(Token), which eventually calls a Tokenizer.next(Token) -- which is 
protecyed -- and so the TokenFilter can rely on that protection.Right?

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544091
 ] 

Michael Busch commented on LUCENE-1063:
---

{quote}
I think it should put a cloned copy into the cache.
{quote}

Or, we could add a boolean to the ctr of CachingTokenFilter
that specifies whether or not to clone the Tokens. So if a
user knows that it is safe to simply cache the references
they can disable the cloning for performance reasons.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544088
 ] 

Michael Busch commented on LUCENE-1063:
---

{quote}
That would be a bug in the filter (both in the past and now).
{quote}

CachingTokenFilter actually does this (caching references to the tokens).
I think it should put a cloned copy into the cache.

Oh and actually I just noticed that Payload doesn't implement Cloneable!
So Token.clone() doesn't create a copy of the Payload, which I think it 
should? I will fix this with Lucene-1062.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544077
 ] 

Yonik Seeley commented on LUCENE-1063:
--

In the past, the semantics were simple... Tokenizer generated tokens, and token 
filters modified them.  I don't think it was a bug that filters modify instead 
of create new tokens.  No one cached tokens and expected them to be unchanged 
because they could be modified by a downstream filter.

> TokenStreams that cache tokens without "protecting" their private copy when 
> next() is called?

That would be a bug in the filter (both in the past and now).

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544076
 ] 

Michael McCandless commented on LUCENE-1052:



Maybe, instead, we should simply make it "easy" to subclass
TermInfosReader whenever a SegmentReader wants to instantiate it?

Ie, the formula is such an advanced use case that it seems appropriate
to subclass instead of trying to break it out into a special
interface/abstract class?

Of course, we need to know this class at SegmentReader construction
time, so I think to specify it we should in fact take Doug's suggested
approach using generic properties.

The challenge with Lucene (and Hadoop) is how can you reach deep down
into a complex IndexReader.open static method call to change various
details of the embedded *Readers while they are being constructed,
and, after they are constructed... I agree it is messy now that we
must propogate the setTermInfosIndexInterval method up the *Reader
hierarchy when not all Readers would even use a TermInfosReader.

So ... maybe we 1) implement generic Lucene properties w/ static
classes/methods to set/get these properties, then 2) remove
set/getTermInfosIndexInterval from *Reader and make a generic property
for it instead, and 3) add another property that allows you to specify
the Class (or String name) of that is your TermInfosReader subclass
(and make it non-final)?


> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544075
 ] 

Doron Cohen commented on LUCENE-1063:
-

Oh, I was locked on that calling next(null) means do-not-reuse but guess since 
we have the original next() this is not required.


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544063
 ] 

Michael McCandless commented on LUCENE-1063:


{quote}
I checked next(Token res) implementations of CharTokenizer, KeywordTokenizer 
and StandardTokenizer and non of them checks res for null.
{quote}
I think you should not pass null into this method?  (Ie you should use
next() instead).  I can clarify this in the javadocs...

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544059
 ] 

Doron Cohen commented on LUCENE-1063:
-

{quote}
and even with old style Tokens w/o Token reuse, one could always change what 
string the token pointed at.
{quote}

...right... termText is now private but it used to be package protected. 

Patch looks good for (default) TokenStream,
though it is a shame there is no magic way to know if Token was changed and is 
copying really required.

But is this good enough also for non default TokenStreams which imlpement 
next(Token)? 
mm.. I checked next(Token res) implementations of CharTokenizer, 
KeywordTokenizer and StandardTokenizer and non of them checks res for null.
Am I missing something trivial?

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Chuck Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055
 ] 

Chuck Williams commented on LUCENE-1052:


I agree a general configuration system would be much better.  Doug. we use a 
similar method to what you described in our application.

TermInfosConfigurer is slightly different though since the desired config is a 
method that implements a formula, rather than just a value.  This could still 
be done more generally by allowing methods as well as properties or setters on 
a higher level configuration object.

I didn't want to take on the broader issue just for this feature.

Michael I agree with both of your points.

I'd be happy to clean up this patch if you guys provide some guidance for what 
would make it acceptable to commit.


> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544054
 ] 

Michael McCandless commented on LUCENE-1063:


{quote}
Looking at the test, this would not have worked before token-reuse either I 
don't yet see how we are breaking backward compatibility.
Callers of next() could change the Token, so caching your own copy that you 
already passed on to someone else was never valid.
{quote}

You're right: before token reuse a filter could change the String
termText (and other fields) and mess up a cached copy held by a
TokenStream earlier in the chain.

But, our core filters now use the reuse API (for better performance),
so if you are using a TokenStream that does caching followed by one of
these core filters we will now mess up the cached copy, right?

Oh, duh: I just checked 2.2 and in fact the LowerCaseFilter,
PorterStemFilter, ISOLatin1AccentFilter all directly alter termText
rather than making a new token.

So actually this issue is pre-existing!

And then I guess we are not breaking backwards compatibility by
further propagating it.

But I think this is still a bug?

Hmm, I guess the semantics of the next() API is and has been to allow
you to arbitrarily modify the token after you receive it ("forwards
reuse") but not re-use the token on the next call to next ("backwards
reuse").  If we take that approach then the bug is in those
TokenStreams that cache tokens without "protecting" their private copy
when next() is called?


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544048
 ] 

Yonik Seeley commented on LUCENE-1063:
--

> it is the addition of Token.termBuffer() that allowed this to happen 

But old filters won't use the termBuffer.
and even with old style Tokens w/o Token reuse, one could always change what 
string the token pointed at.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544045
 ] 

Doron Cohen commented on LUCENE-1063:
-

Yes, that's what I meant - it is the addition of Token.termBuffer() 
that allowed this to happen - in 2.2 (apart from payloads) only 
immutable String could be obtained from the Token.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

2007-11-20 Thread Karl Wettin



20 nov 2007 kl. 20.28 skrev Doug Cutting:


karl wettin wrote:
On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]>  
wrote:

it is always good to have query logs

http://thepiratebay.org/tor/3783572


It doesn't look as though there's click data, so we can't use this  
for relevance experiments without manually creating judgments.


(LUCENE-626 extracts query goals from this data.)

I'll send my fellow countrymen a request for an update with query log  
containing clicks, download, or whatever they are willing to give  
out. I'm sure they wont mind.



--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Michael Busch

Grant Ingersoll wrote:
> Scratch my last comment.  I was thinking it only pertained to payloads.
> 
> In that light, I think we should modify the scorePayload method for the
> time being, then we can deprecate it when we go to per field sim.
> 
> -Grant
>

OK sounds good. Will make the change with LUCENE-1062.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544034
 ] 

Yonik Seeley commented on LUCENE-1063:
--

Looking at the test, this would not have worked before token-reuse either I 
don't yet see how we are breaking backward compatibility.
Callers of next() *could* change the Token, so caching your own copy that you 
already passed on to someone else was never valid.

> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544032
 ] 

Doug Cutting commented on LUCENE-1052:
--

I think we should be cautious about adding a new public interface or abstract 
class to support just this feature.  If we want to add a generic configuration 
API for Lucene, then I'd prefer something fully general, like what I proposed 
on the mailing list, not something specific to configuring TermInfosReader.  
Otherwise we'll keep adding new configuration interfaces and adding more 
parameters to IndexReader constructors each time we wish to make some obscure 
feature configurable.

http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421

In the model proposed there, adding a new configuration parameter involves just 
adding a new static method to the public class that implements a new 
configurable feature.


> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Grant Ingersoll


Scratch my last comment.  I was thinking it only pertained to payloads.

In that light, I think we should modify the scorePayload method for  
the time being, then we can deprecate it when we go to per field sim.


-Grant

On Nov 20, 2007, at 2:34 PM, Michael Busch wrote:


Yonik Seeley wrote:


Per field similarity would certainly be more efficient since it moves
the field->similarity lookup from the inner loop to the outer loop.



I agree. Then I'll leave the scorePayload() API as is for now. And I
don't think the per-field similarity should block 2.3, so let's work  
on

that after the release, ok?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Grant Ingersoll

Well, we are making an awful lot of improvements for Payloads, I think  
we should try to get it in now and make 2.3 wait a bit more, since we  
all have more or less agreed that 2.9 (next after 2.3) is going to be  
a deprecation release before moving to 3.0


-Grant

On Nov 20, 2007, at 2:34 PM, Michael Busch wrote:


Yonik Seeley wrote:


Per field similarity would certainly be more efficient since it moves
the field->similarity lookup from the inner loop to the outer loop.



I agree. Then I'll leave the scorePayload() API as is for now. And I
don't think the per-field similarity should block 2.3, so let's work  
on

that after the release, ok?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans

2007-11-20 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544029
 ] 

Doug Cutting commented on LUCENE-1001:
--

> Would it be simpler to just use a SortedSet?

TreeMap is slower than a PriorityQueue for this.  With PriorityQueue, 
insertions and deletions do not allocate new objects.  And, if some items are 
much more frequent than others, using adjustTop() instead of inserting and 
deleting makes merges run much faster, since most updates are then considerably 
faster than log(n).

> Add Payload retrieval to Spans
> --
>
> Key: LUCENE-1001
> URL: https://issues.apache.org/jira/browse/LUCENE-1001
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It will be nice to have access to payloads when doing SpanQuerys.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/51134
> Current API, added to Spans.java is below.  I will try to post a patch as 
> soon as I can figure out how to make it work for unordered spans (I believe I 
> have all the other cases working).
> {noformat}
>  /**
>* Returns the payload data for the current span.
>* This is invalid until [EMAIL PROTECTED] #next()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return a List of byte arrays containing the data of this payload
>* @throws IOException
>*/
>   // TODO: Remove warning after API has been finalized
>   List/**/ getPayload() throws IOException;
>   /**
>* Checks if a payload can be loaded at this position.
>* 
>* Payloads can only be loaded once per call to
>* [EMAIL PROTECTED] #next()}.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return true if there is a payload available at this position that can 
> be loaded
>*/
>   // TODO: Remove warning after API has been finalized
>   public boolean isPayloadAvailable();
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1063:
---

Attachment: LUCENE-1063.patch

Attached patch w/ unit test showing the issue, plus the fix.

The fix was actually simpler than I thought: we don't have to make a
new Token(); instead we just have to copy over the fields to the Token
that was passed in.  So the performance hit is less that I thought
it'd be (copy instead of new/GC).

I also strengthened the javadocs on the reuse & non-reuse APIs.

All tests pass.


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1063.patch
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Michael Busch

Yonik Seeley wrote:
> 
> Per field similarity would certainly be more efficient since it moves
> the field->similarity lookup from the inner loop to the outer loop.
> 

I agree. Then I'll leave the scorePayload() API as is for now. And I
don't think the per-field similarity should block 2.3, so let's work on
that after the release, ok?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

2007-11-20 Thread Doug Cutting


karl wettin wrote:

On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

it is always good to have query logs


I realize that it is not that politically correct, but the TPB
collection is released to the public domain and contains 3.2 million
user queries with session id, timestamp, category etc to go with the
150,000+500,000 documents.


http://thepiratebay.org/tor/3783572


That's a good find!  They use Lucene too!

I don't see any legal issues to us writing code that parses these files. 
 To be safest, I don't think we should republish the files, or even any 
of the queries, but I don't think we should need to.  Folks can download 
them to their own machines and use them for testing there.


It doesn't look as though there's click data, so we can't use this for 
relevance experiments without manually creating judgments.  But for 
performance benchmarking it could be useful.


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Yonik Seeley

On Nov 20, 2007 2:17 PM, Michael Busch <[EMAIL PROTECTED]> wrote:
> Grant Ingersoll wrote:
> > +1 for adding the field name.
> >
>
> The question is whether we should add the field name to the
> Similarity#scorePayload() method or if we should support a per-field
> similarity in the future?

Per field similarity would certainly be more efficient since it moves
the field->similarity lookup from the inner loop to the outer loop.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Michael Busch

Grant Ingersoll wrote:
> +1 for adding the field name.
> 
> 

The question is whether we should add the field name to the
Similarity#scorePayload() method or if we should support a per-field
similarity in the future?

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re:Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless


OK, thanks.  I'll put mine in there too.

Mike

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On Nov 20, 2007 1:49 PM, Michael McCandless <[EMAIL PROTECTED]>
> wrote:
> >
> > Will do ...
> >
> > Mike
> >
> > "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote:
> > > Could we make this a little more concrete by creating a simple test case
> > > that fails?
> 
> FWIW, I recently added mine to TestAnalyzers to check for proper
> payload copying.
> 
> -Yonik
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2007-11-20 Thread John Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Wang updated LUCENE-1061:
--

Fix Version/s: 2.3
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Affects Version/s: 2.3

> Adding a factory to QueryParser to instantiate query instances
> --
>
> Key: LUCENE-1061
> URL: https://issues.apache.org/jira/browse/LUCENE-1061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3
>Reporter: John Wang
> Fix For: 2.3
>
> Attachments: lucene_patch.txt
>
>
> With the new efforts with Payload and scoring functions, it would be nice to 
> plugin custom query implementations while using the same QueryParser.
> Included is a patch with some refactoring the QueryParser to take a factory 
> that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re:Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley

On Nov 20, 2007 1:49 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
> Will do ...
>
> Mike
>
> "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote:
> > Could we make this a little more concrete by creating a simple test case
> > that fails?

FWIW, I recently added mine to TestAnalyzers to check for proper
payload copying.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless


Will do ...

Mike

"Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote:
> 
> [
> 
> https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544005
> ] 
> 
> Yonik Seeley commented on LUCENE-1063:
> --
> 
> Could we make this a little more concrete by creating a simple test case
> that fails?
> 
> 
> > Token re-use API breaks back compatibility in certain TokenStream chains
> > 
> >
> > Key: LUCENE-1063
> > URL: https://issues.apache.org/jira/browse/LUCENE-1063
> > Project: Lucene - Java
> >  Issue Type: Bug
> >  Components: Analysis
> >Affects Versions: 2.3
> >Reporter: Michael McCandless
> >Assignee: Michael McCandless
> > Fix For: 2.3
> >
> >
> > In scrutinizing the new Token re-use API during this thread:
> >   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> > I realized we now have a non-back-compatibility when mixing re-use and
> > non-re-use TokenStreams.
> > The new "reuse" next(Token) API actually allows two different aspects
> > of re-use:
> >   1) "Backwards re-use": the subsequent call to next(Token) is allowed
> >  to change all aspects of the provided Token, meaning the caller
> >  must do all persisting of Token that it needs before calling
> >  next(Token) again.
> >   2) "Forwards re-use": the caller is allowed to modify the returned
> >  Token however it wants.  Eg the LowerCaseFilter is allowed to
> >  downcase the characters in-place in the char[] termBuffer.
> > The forwards re-use case can break backwards compatibility now.  EG:
> > if a TokenStream X providing only the "non-reuse" next() API is
> > followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> > the tokens, then the default implementation in TokenStream.java for
> > next(Token) will kick in.
> > That default implementation just returns the provided "private copy"
> > Token returned by next().  But, because of 2) above, this is not
> > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> > is actually modifying the cached copy being potentially stored by X.
> > I think the opposite case is handled correctly.
> > A simple way to fix this is to make a full copy of the Token in the
> > next(Token) call in TokenStream, just like we do in the next() method
> > in TokenStream.  The downside is this is a small performance hit.  However
> > that hit only happens at the boundary between a non-reuse and a re-use
> > tokenizer.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544005
 ] 

Yonik Seeley commented on LUCENE-1063:
--

Could we make this a little more concrete by creating a simple test case that 
fails?


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543991
 ] 

Michael McCandless commented on LUCENE-1063:


{quote}
{code}
// Filter F is calling TokenStream ts:
F.next(Token result) {
Token t = ts.next(result);
t.setSomething();
return t;
}
{code}
Problem as described: ts expects the token it returns to not be altered because 
it somehow intends to rely on its content when servicing the following call to 
next([]). In other words, it assumes that callers to next([]) would only 
consume, but not alter, the returned token.
{quote}

And, ts only defined "non-reuse" next(), thus it is the default
implemenation in TokenStream.next(Token) that is actually invoked,
which in turn invokes ts.next() and directly returns the result.

{quote}
Seems that such an expectation by ts would be problematic no matter if 
ts.next() or ts.next(Token) are used.

I mean, even if we removed next(Token) but kept Token.termBuffer(), that char 
array could be modified, and some TokenSteam implementation could still be 
broken because it assumes (following similar logic) that it can reuse its 
private copy of the char array... right?
{quote}

I don't think it's problematic for ts to expect this?  This is the
"contract" that you are supposed to follow for this API, spelled out
in the javadocs.

When you call "non-reuse" ts.next() you expect to get a private copy
that you can hold onto indefinitely and it will never be modified,
and, you accept that you must never modify this token yourself.

Whereas when you call "reuse" ts.next(Token) you accept that you must
fully consume the returned Token before you next call next(Token),
and, that you are free to alter this token.

I think that contract is well defined & consistent?

{quote}
TokenStream already does this, right? (or do you mean in the class TokenStream 
or in all implementations of TokenStream?)
{quote}
I'm talking about TokenStream's default implementation of next(Token).
It's not copying now, but it needs to in order to properly meet the
contract of this API (ie, allow caller to modify the returned token).
The default implementation of TokenStream.next() does already copy.


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1055) Remove GData from trunk

2007-11-20 Thread Hoss Man (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543982
 ] 

Hoss Man commented on LUCENE-1055:
--

contrib/gdata-server is recorded as deleted (so an "svn status" will show that 
subversion doesn't know anything about it) but if you've ever build 
gdata-server, then it contains an "ext-libs" directory which was not managed by 
subversion, so "svn update" won't delete it automatically.

> Remove GData from trunk 
> 
>
> Key: LUCENE-1055
> URL: https://issues.apache.org/jira/browse/LUCENE-1055
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1055.patch
>
>
> GData doesn't seem to be maintained anymore. We're going to remove it before 
> we cut the 2.3 release unless there are negative votes.
> In case someones jumps in in the future and starts to maintain it, we can 
> re-add it to the trunk.
> If anyone is using GData and needs it to be in 2.3 please let us know soon!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Yonik Seeley

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> > If we used a Payload object, it would save 8 bytes per Token for
> > fields not using payloads.

Of course with Token reuse, saving 8 bytes isn't important any more
either since it's only allocated once per field.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1055) Remove GData from trunk

2007-11-20 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543979
 ] 

Michael Busch commented on LUCENE-1055:
---

{quote}
After svn update, contrib/gdata-server is still in my working copy.
Is that intended, or is there still an svn delete to be done?
{quote}

Hmm that's strange. I tried svn up on a different checkout folder and
contrib/gdata-server was successfully removed. Are you sure that you
don't have any local changes in that folder that prevent it from being 
removed? 

> Remove GData from trunk 
> 
>
> Key: LUCENE-1055
> URL: https://issues.apache.org/jira/browse/LUCENE-1055
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1055.patch
>
>
> GData doesn't seem to be maintained anymore. We're going to remove it before 
> we cut the 2.3 release unless there are negative votes.
> In case someones jumps in in the future and starts to maintain it, we can 
> re-add it to the trunk.
> If anyone is using GData and needs it to be in 2.3 please let us know soon!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Michael Busch

Michael McCandless wrote:
> "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
>> On Nov 19, 2007 6:52 PM, Michael Busch <[EMAIL PROTECTED]> wrote:
>>> Yonik Seeley wrote:
 So I think we all agree to do payloads by reference (do not make a
 copy of byte[] like termBuffer does), and to allow payload reuse.

 So now we still have 3 viable options still on the table I think:
 Token{ byte[] payload, int payloadLength, ...}
 Token{ byte[] payload, int payloadOffset, int payloadLength,...}
 Token{ Payload p, ... }

>>> I'm for option 2. I agree that it is worthwhile to allow filters to
>>> modify the payloads. And I'd like to optimize for the case where lot's
>>> of tokens have payloads, and option 2 seems therefore the way to go.
>> Just to play devil's advocate, it seems like adding the byte[]
>> directly to Token gains less than we might have been thinking if we
>> have reuse in any case.  A TokenFilter could reuse the same Payload
>> object for each term in a Field, so the CPU allocation savings is
>> closer to a single Payload per field using payloads.
>>
>> If we used a Payload object, it would save 8 bytes per Token for
>> fields not using payloads.
>> Besides an initial allocation per field, the additional cost to using
>> a Payload field would be an additional dereference (but that should be
>> really minor).
> 
> These are excellent points.  I guess I would lean [back] towards
> keeping the separate Payload object and extending its API to allow
> re-use and modification of its byte[]?
> 

+1

-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

2007-11-20 Thread Chris Hostetter


: I think the safest path is simply to not publish any queries, but rather to,
: e.g., permit committers to run experiments using them and publish the results
: of the experiments.  But no queries would be made available to the general
: public on a website.

that would eliminate the goal of having datasets (docs+queries+judgements) 
that anyone could download for testing whether a patch they want to 
propose alter the scores produced by Lucene (for better or for worse)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543939
 ] 

Doron Cohen commented on LUCENE-1063:
-

In ''code words":
{code}
// Filter F is calling TokenStream ts:
F.next(Token result) {
Token t = ts.next(result);
t.setSomething();
return t;
}
{code}

Problem as described: ts expects the token it returns to not be altered because 
it somehow intends to rely on its content when servicing the following call to 
next([]). In other words, it assumes that callers to next([]) would only 
consume, but not alter, the returned token. 

Seems that such an expectation by ts would be problematic no matter if 
ts.next() or ts.next(Token) are used.

I mean, even if we removed next(Token) but kept Token.termBuffer(), that char 
array could be modified, and some TokenSteam implementation could still be 
broken because it assumes (following similar logic) that it can reuse its 
private copy of the char array... right?

{quote}
A simple way to fix this is to make a full copy of the Token in the next(Token) 
call in TokenStream, just like we do in the next() method in TokenStream.  The 
downside is this is a small performance hit.  However that hit only happens at 
the boundary between a non-reuse and a re-use
tokenizer.
{quote}

TokenStream already does this, right? (or do you mean in the class TokenStream 
or in all implementations of TokenStream?)


> Token re-use API breaks back compatibility in certain TokenStream chains
> 
>
> Key: LUCENE-1063
> URL: https://issues.apache.org/jira/browse/LUCENE-1063
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
>
> In scrutinizing the new Token re-use API during this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54708
> I realized we now have a non-back-compatibility when mixing re-use and
> non-re-use TokenStreams.
> The new "reuse" next(Token) API actually allows two different aspects
> of re-use:
>   1) "Backwards re-use": the subsequent call to next(Token) is allowed
>  to change all aspects of the provided Token, meaning the caller
>  must do all persisting of Token that it needs before calling
>  next(Token) again.
>   2) "Forwards re-use": the caller is allowed to modify the returned
>  Token however it wants.  Eg the LowerCaseFilter is allowed to
>  downcase the characters in-place in the char[] termBuffer.
> The forwards re-use case can break backwards compatibility now.  EG:
> if a TokenStream X providing only the "non-reuse" next() API is
> followed by a TokenFilter Y using the "reuse" next(Token) API to pull
> the tokens, then the default implementation in TokenStream.java for
> next(Token) will kick in.
> That default implementation just returns the provided "private copy"
> Token returned by next().  But, because of 2) above, this is not
> legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
> is actually modifying the cached copy being potentially stored by X.
> I think the opposite case is handled correctly.
> A simple way to fix this is to make a full copy of the Token in the
> next(Token) call in TokenStream, just like we do in the next() method
> in TokenStream.  The downside is this is a small performance hit.  However
> that hit only happens at the boundary between a non-reuse and a re-use
> tokenizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans

2007-11-20 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543915
 ] 

Grant Ingersoll commented on LUCENE-1001:
-

{quote}
Off the top of my head: the priority queue is used to make sure that the Spans 
are processed by increasing doc numbers and increasing token positions; the 
first and the last Spans determine whether there is a match, and all other 
Spans (in the queue) are "in between".
{quote}
Would it be simpler to just use a SortedSet?  Then we could iterate w/o losing 
the sort, right?  Would this be faster since we wouldn't have to do the heap 
operations?

> Add Payload retrieval to Spans
> --
>
> Key: LUCENE-1001
> URL: https://issues.apache.org/jira/browse/LUCENE-1001
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
>
> It will be nice to have access to payloads when doing SpanQuerys.
> See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/51134
> Current API, added to Spans.java is below.  I will try to post a patch as 
> soon as I can figure out how to make it work for unordered spans (I believe I 
> have all the other cases working).
> {noformat}
>  /**
>* Returns the payload data for the current span.
>* This is invalid until [EMAIL PROTECTED] #next()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return a List of byte arrays containing the data of this payload
>* @throws IOException
>*/
>   // TODO: Remove warning after API has been finalized
>   List/**/ getPayload() throws IOException;
>   /**
>* Checks if a payload can be loaded at this position.
>* 
>* Payloads can only be loaded once per call to
>* [EMAIL PROTECTED] #next()}.
>* 
>* 
>* WARNING: The status of the Payloads feature is experimental.
>* The APIs introduced here might change in the future and will not be
>* supported anymore in such a case.
>*
>* @return true if there is a payload available at this position that can 
> be loaded
>*/
>   // TODO: Remove warning after API has been finalized
>   public boolean isPayloadAvailable();
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains

2007-11-20 Thread Michael McCandless (JIRA)

Token re-use API breaks back compatibility in certain TokenStream chains


 Key: LUCENE-1063
 URL: https://issues.apache.org/jira/browse/LUCENE-1063
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.3


In scrutinizing the new Token re-use API during this thread:

  http://www.gossamer-threads.com/lists/lucene/java-dev/54708

I realized we now have a non-back-compatibility when mixing re-use and
non-re-use TokenStreams.

The new "reuse" next(Token) API actually allows two different aspects
of re-use:

  1) "Backwards re-use": the subsequent call to next(Token) is allowed
 to change all aspects of the provided Token, meaning the caller
 must do all persisting of Token that it needs before calling
 next(Token) again.

  2) "Forwards re-use": the caller is allowed to modify the returned
 Token however it wants.  Eg the LowerCaseFilter is allowed to
 downcase the characters in-place in the char[] termBuffer.

The forwards re-use case can break backwards compatibility now.  EG:
if a TokenStream X providing only the "non-reuse" next() API is
followed by a TokenFilter Y using the "reuse" next(Token) API to pull
the tokens, then the default implementation in TokenStream.java for
next(Token) will kick in.

That default implementation just returns the provided "private copy"
Token returned by next().  But, because of 2) above, this is not
legal: if the TokenFilter Y modifies the char[] termBuffer (say), that
is actually modifying the cached copy being potentially stored by X.

I think the opposite case is handled correctly.

A simple way to fix this is to make a full copy of the Token in the
next(Token) call in TokenStream, just like we do in the next() method
in TokenStream.  The downside is this is a small performance hit.  However
that hit only happens at the boundary between a non-reuse and a re-use
tokenizer.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1040) Can't quickly create StopFilter

2007-11-20 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543903
 ] 

Yonik Seeley commented on LUCENE-1040:
--

Indeed... thanks for catching that!

> Can't quickly create StopFilter
> ---
>
> Key: LUCENE-1040
> URL: https://issues.apache.org/jira/browse/LUCENE-1040
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Attachments: CharArraySet.patch, CharArraySet.take2.patch
>
>
> Due to the use of CharArraySet by StopFilter, one can no longer efficiently 
> pre-create a Set for use by future StopFilter instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

2007-11-20 Thread Grant Ingersoll

This may be worth asking legal-discuss about.  I am not sure if there  
is an issue or not.


-Grant


On Nov 20, 2007, at 4:54 AM, karl wettin wrote:


On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

it is always good to have query logs


I realize that it is not that politically correct, but the TPB
collection is released to the public domain and contains 3.2 million
user queries with session id, timestamp, category etc to go with the
150,000+500,000 documents.


http://thepiratebay.org/tor/3783572


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payload API

2007-11-20 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> On Nov 19, 2007 6:52 PM, Michael Busch <[EMAIL PROTECTED]> wrote:
> > Yonik Seeley wrote:
> > >
> > > So I think we all agree to do payloads by reference (do not make a
> > > copy of byte[] like termBuffer does), and to allow payload reuse.
> > >
> > > So now we still have 3 viable options still on the table I think:
> > > Token{ byte[] payload, int payloadLength, ...}
> > > Token{ byte[] payload, int payloadOffset, int payloadLength,...}
> > > Token{ Payload p, ... }
> > >
> >
> > I'm for option 2. I agree that it is worthwhile to allow filters to
> > modify the payloads. And I'd like to optimize for the case where lot's
> > of tokens have payloads, and option 2 seems therefore the way to go.
> 
> Just to play devil's advocate, it seems like adding the byte[]
> directly to Token gains less than we might have been thinking if we
> have reuse in any case.  A TokenFilter could reuse the same Payload
> object for each term in a Field, so the CPU allocation savings is
> closer to a single Payload per field using payloads.
> 
> If we used a Payload object, it would save 8 bytes per Token for
> fields not using payloads.
> Besides an initial allocation per field, the additional cost to using
> a Payload field would be an additional dereference (but that should be
> really minor).

These are excellent points.  I guess I would lean [back] towards
keeping the separate Payload object and extending its API to allow
re-use and modification of its byte[]?

I'm now even wondering whether the char[] termBuffer should be by
reference (again!), too?  This would save 1 copy for those
TokenStreams that could provide a reference to their own char[]
buffers (eg CharTokenizer).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Reopened: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-1052:



> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543854
 ] 

Michael McCandless commented on LUCENE-1052:


Thanks Chuck for such a wonderfully thorough patch & unit tests, and
for adding the methods to ParallelReader, too (I had missed it the
first time around)!  The patch looks good.

Should we use an abstract base class instead of interface for
TermInfosConfigurer so we can add additional methods in the future
without breaking back compatibility?

Also I think we should mark this API as advanced, somewhat
experimental and subject to change?


> Add an "termInfosIndexDivisor" to IndexReader
> -
>
> Key: LUCENE-1052
> URL: https://issues.apache.org/jira/browse/LUCENE-1052
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.2
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1052.patch, termInfosConfigurer.patch
>
>
> The termIndexInterval, set during indexing time, let's you tradeoff
> how much RAM is used by a reader to load the indexed terms vs cost of
> seeking to the specific term you want to load.
> But the downside is you must set it at indexing time.
> This issue adds an indexDivisor to TermInfosReader so that on opening
> a reader you could further sub-sample the the termIndexInterval to use
> less RAM.  EG a setting of 2 means every 2 * termIndexInterval is
> loaded into RAM.
> This is particularly useful if your index has a great many terms (eg
> you accidentally indexed binary terms).
> Spinoff from this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-dev/54371

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1040) Can't quickly create StopFilter

2007-11-20 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543847
 ] 

Michael McCandless commented on LUCENE-1040:


Yonik, I think you missed my proposed update to your original patch, here?

https://issues.apache.org/jira/browse/LUCENE-1040#action_12539319

EG, there are some problems with the changes to rehash (and I added a unit-test 
to expose them).

> Can't quickly create StopFilter
> ---
>
> Key: LUCENE-1040
> URL: https://issues.apache.org/jira/browse/LUCENE-1040
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
>Assignee: Yonik Seeley
> Attachments: CharArraySet.patch, CharArraySet.take2.patch
>
>
> Due to the use of CharArraySet by StopFilter, one can no longer efficiently 
> pre-create a Set for use by future StopFilter instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Apache logs and data

2007-11-20 Thread karl wettin

On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> it is always good to have query logs

I realize that it is not that politically correct, but the TPB
collection is released to the public domain and contains 3.2 million
user queries with session id, timestamp, category etc to go with the
150,000+500,000 documents.


http://thepiratebay.org/tor/3783572


-- 
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1055) Remove GData from trunk

2007-11-20 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543807
 ] 

Paul Elschot commented on LUCENE-1055:
--

After svn update, contrib/gdata-server is still in my working copy.
Is that intended, or is there still an svn delete to be done?


> Remove GData from trunk 
> 
>
> Key: LUCENE-1055
> URL: https://issues.apache.org/jira/browse/LUCENE-1055
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: lucene-1055.patch
>
>
> GData doesn't seem to be maintained anymore. We're going to remove it before 
> we cut the 2.3 release unless there are negative votes.
> In case someones jumps in in the future and starts to maintain it, we can 
> re-add it to the trunk.
> If anyone is using GData and needs it to be in 2.3 please let us know soon!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

65 matches

Mail list logo