Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
Thanks! On Nov 21, 2007, at 1:35 AM, Michael Busch wrote: robert engels wrote: We are still using Lucene 1.9.1+, and I am wondering if there has been any improvements in searching on AND clauses when some of the terms are very infrequent... multi-level skipping should help when an AND query has frequent and infrequent terms. See LUCENE-866 for some performance numbers. -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
robert engels wrote: > > We are still using Lucene 1.9.1+, and I am wondering if there has been > any improvements in searching on AND clauses when some of the terms are > very infrequent... > multi-level skipping should help when an AND query has frequent and infrequent terms. See LUCENE-866 for some performance numbers. -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
Sorry if this is somewhat off topic, but it seems at least marginally related to this... We are still using Lucene 1.9.1+, and I am wondering if there has been any improvements in searching on AND clauses when some of the terms are very infrequent... This change seems appropriate. Are there others associated with the performance gains? If you were going to back-port some of the later changes, can anyone give some advice as to the biggest "bang for the buck". Hopefully those not involving an index format change. Thanks. Robert On Nov 21, 2007, at 1:16 AM, Yonik Seeley (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-693? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-693: Attachment: conjunction.patch Whew... I'd forgotten about this issue. I brushed up one of the last versions I had lying around from a year ago (see lastest conjunction.patch), fixed up my synthetic tests a bit, and got some decent results: 1% faster in top level term conjunctions (wheee) 49% faster in a conjunction of nested term conjunctions (no sort per call to skipTo) 5% faster in a top level ConstantScoreQuery conjunction 144% faster in a conjunction of nested ConstantScoreQuery conjunctions A sort is done the first time, and the scorers are ordered so that the highest will skip first (the idea being that there may be a little info in the first skip about which scorer is most sparse). Michael Busch recently brought up a related idea... that one could skip on low df terms first... but that would of course require some terms in the conjunction. ConjunctionScorer - more tuneup --- Key: LUCENE-693 URL: https://issues.apache.org/jira/browse/LUCENE-693 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: Windows Server 2003 x64, Java 1.6, pretty large index Reporter: Peter Keegan Attachments: conjunction.patch, conjunction.patch, conjunction.patch, conjunction.patch, conjunction.patch.nosort1 (See also: #LUCENE-443) I did some profile testing with the new ConjuctionScorer in 2.1 and discovered a new bottleneck in ConjunctionScorer.sortScorers. The java.utils.Arrays.sort method is cloning the Scorers array on every sort, which is quite expensive on large indexes because of the size of the 'norms' array within, and isn't necessary. Here is one possible solution: private void sortScorers() { // squeeze the array down for the sort //if (length != scorers.length) { // Scorer[] temps = new Scorer[length]; // System.arraycopy(scorers, 0, temps, 0, length); // scorers = temps; //} insertionSort( scorers,length ); // note that this comparator is not consistent with equals! //Arrays.sort(scorers, new Comparator() { // sort the array //public int compare(Object o1, Object o2) { // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); //} // }); first = 0; last = length - 1; } private void insertionSort( Scorer[] scores, int len) { for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc ();j-- ) { swap (scores, j, j-1); } } return; } private void swap(Object[] x, int a, int b) { Object t = x[a]; x[a] = x[b]; x[b] = t; } The squeezing of the array is no longer needed. We also initialized the Scorers array to 8 (instead of 2) to avoid having to grow the array for common queries, although this probably has less performance impact. This change added about 3% to query throughput in my testing. Peter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-693) ConjunctionScorer - more tuneup
[ https://issues.apache.org/jira/browse/LUCENE-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-693: Attachment: conjunction.patch Whew... I'd forgotten about this issue. I brushed up one of the last versions I had lying around from a year ago (see lastest conjunction.patch), fixed up my synthetic tests a bit, and got some decent results: 1% faster in top level term conjunctions (wheee) 49% faster in a conjunction of nested term conjunctions (no sort per call to skipTo) 5% faster in a top level ConstantScoreQuery conjunction 144% faster in a conjunction of nested ConstantScoreQuery conjunctions A sort is done the first time, and the scorers are ordered so that the highest will skip first (the idea being that there may be a little info in the first skip about which scorer is most sparse). Michael Busch recently brought up a related idea... that one could skip on low df terms first... but that would of course require some terms in the conjunction. > ConjunctionScorer - more tuneup > --- > > Key: LUCENE-693 > URL: https://issues.apache.org/jira/browse/LUCENE-693 > Project: Lucene - Java > Issue Type: Bug > Components: Search >Affects Versions: 2.1 > Environment: Windows Server 2003 x64, Java 1.6, pretty large index >Reporter: Peter Keegan > Attachments: conjunction.patch, conjunction.patch, conjunction.patch, > conjunction.patch, conjunction.patch.nosort1 > > > (See also: #LUCENE-443) > I did some profile testing with the new ConjuctionScorer in 2.1 and > discovered a new bottleneck in ConjunctionScorer.sortScorers. The > java.utils.Arrays.sort method is cloning the Scorers array on every sort, > which is quite expensive on large indexes because of the size of the 'norms' > array within, and isn't necessary. > Here is one possible solution: > private void sortScorers() { > // squeeze the array down for the sort > //if (length != scorers.length) { > // Scorer[] temps = new Scorer[length]; > // System.arraycopy(scorers, 0, temps, 0, length); > // scorers = temps; > //} > insertionSort( scorers,length ); > // note that this comparator is not consistent with equals! > //Arrays.sort(scorers, new Comparator() { // sort the array > //public int compare(Object o1, Object o2) { > // return ((Scorer)o1).doc() - ((Scorer)o2).doc(); > //} > // }); > > first = 0; > last = length - 1; > } > private void insertionSort( Scorer[] scores, int len) > { > for (int i=0; i for (int j=i; j>0 && scores[j-1].doc() > scores[j].doc();j-- ) { > swap (scores, j, j-1); > } > } > return; > } > private void swap(Object[] x, int a, int b) { > Object t = x[a]; > x[a] = x[b]; > x[b] = t; > } > > The squeezing of the array is no longer needed. > We also initialized the Scorers array to 8 (instead of 2) to avoid having to > grow the array for common queries, although this probably has less > performance impact. > This change added about 3% to query throughput in my testing. > Peter -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544175 ] Doron Cohen commented on LUCENE-1063: - {quote} So I don't think we need to change anything. {quote} (y) sounds good to me. (i) looking close at TokenStream it is interesting that next() and next(Token) as written will loop forever. So if a subclass just implements say next() by calling (super's) next(new Token()) it is an infinite loop. However anything like this would be buggy anyhow because no meaningful token is created this way. To summarize there's no action item here. (I thought about modifying the javadoc NOTE to: ??subclasses must +create the next Token+ by overriding at least one of next() or next(Token)??, but I am not convinced it is any clearer.) > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1058: Attachment: LUCENE-1058.patch Here's a patch that modifies the DocumentsWriter to not throw an IllegalArgumentException if no Reader is specified. Thus, an Analyzer needs to be able to handle a null Reader (this still needs to be documented). Basically, the semantics of it are that the Analyzer is producing Tokens from some other means. I probably should spell this out in a new Field constructor as well, but this should suffice for now, and I will revisit it after the break. I also added in a TestCollaboratingAnalyzer. All tests pass. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1058.patch, LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544145 ] Grant Ingersoll commented on LUCENE-1058: - Some javadoc comments for the modifyToken method in BufferingTokenFilter should be sufficient, right? Something to the effect that if this TokenFilter is not the last in the chain that it should make a full copy. As for the CachedTokenizer and CachedAnalyzer, those should be implied, since the user is passing them in to begin with. The other thing of interest, is that calling Analyzer.tokenStream(String, Reader) is not needed. In fact, this somewhat suggests having a new Fieldable property akin to tokenStreamValue(), etc. that says don't even ask the Fieldable for a value. Let me take a crack at what that means and post a patch. It will mean some changes to invertField() in DocumentsWriter and possibly changing it to not require that one of tokenStreamValue, readerValue() or stringValue() be defined. Not sure if that is a good idea or not. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544136 ] Chuck Williams commented on LUCENE-1052: I can report that in our application having a formula is critical. We have no control over the content our users index, nor in fact do they. These are arbitrary documents. We find a surprising number of them contain embedded encoded binary data. When those are indexed, lucene's memory consumption skyrockets, either bringing the whole app down with an OOM or slowing performance to a crawl due to excessive GC's reclaiming a tiny remaining working memory space. Our users won't accept a solution like, wait until the problem occurs and then increment your termIndexDivisor. They expect our app to manage this automatically. I agree that making TermInfosReader, SegmentReader, etc. public classes is not a great solution The current patch does not do that. It simply adds a configurable class that can be used to provide formula parameters as opposed to just value parameters. At least for us, this special case is sufficiently important to outweigh any considerations of the complexity of an additional class. A single configuration class could be used at the IndexReader level that provides for both static and dynamically-varying properties through getters, some of which take parameters. Here is another possible solution. My current thought is that the bound should always be a multiple of sqrt(numDocs). E.g., see Heap's Law here: http://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html I'm currently using this formula in my TermInfosConfigurer: int bound = (int) (1+TERM_BOUNDING_MULTIPLIER*Math.sqrt(1+segmentNumDocs)/TERM_INDEX_INTERVAL); This has Heap's law as foundation. I provide TERM_BOUNDING_MULTIPLIER as the config parameter, with 0 meaning don't do this. I also provide a TERM_INDEX_DIVISOR_OVERRIDE that overrides the dynamic bounding with a manually specified constant amount. If that approach would be acceptable to lucene in general, then we just need two static parameters. However, I don't have enough experience with how well this formula works in our user base yet to know whether or not we'll tune it further. > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544125 ] Doug Cutting commented on LUCENE-1052: -- What class would we put TermInfosReader-specific setters & getters on, since that class is not public? Do we make TermInfosReader public or leave it package-private? My intuition is to leave it package-private for now, in order to retain freedom to re-structure w/o breaking applications, and because making it public would drag a lot of other stuff into the public. We could consider making SegmentReader public, so that there's a public class that corresponds to the concrete index implementation, but that'd also drag more stuff public (like DirectoryIndexReader). I'm also not yet convinced that it is critical to support arbitrary formulae for this feature. Sure, it would be nice, but it has costs, like increasing public APIs that must be supported. Folks have done fine without this feature for many years. Adding a simple integer divisor is a sufficient initial step here. So, even if we add a configuration system, I think the setter methods could still end up on IndexReader. The difference is primarily whether the methods are: public void setTermIndexInterval(int interval); public void setTermIndexDivisor(int divisor); or public static void setTermIndexInterval(LuceneProps props, int interval); public static void setTermIndexDivisor(LuceneProps props, int divisor); With the latter just a façade that uses package-private stuff. I think the latter style will be handy as we start adding parameters to, e.g., Query classes. In those cases we'll probably want façade's too, since a Query setter will probably really tweak something for a private Scorer class. In the case of indexes, however, we don't have a public, concrete class. Another option is to make a public class whose purpose is just to only such parameters, something like SegmentIndexParameters. That'd be my first choice and was the direction I pointed in my initial proposal, but with considerably less explanation. > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown
[ https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544115 ] Michael McCandless commented on LUCENE-1044: OK, I tested calling command-line "sync", after writing each segments file. It's in fact even slower than fsync on each file for these 3 cases: Linux (2.6.22.1), reiserfs 6 drive RAID5 array 93% slower sync - 330.74 nosync - 171.24 Linux (2.6.22.1), ext3 single drive 60% slower sync - 242.02 nosync - 150.91 Mac Pro (10.4 Tiger), 4 drive RAID0 array 28% slower sync - 204.77 nosync - 159.90 I'll look into the separate thread to sync/close files in the background next... > Behavior on hard power shutdown > --- > > Key: LUCENE-1044 > URL: https://issues.apache.org/jira/browse/LUCENE-1044 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java > 1.5 >Reporter: venkat rangan >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1044.patch, LUCENE-1044.take2.patch, > LUCENE-1044.take3.patch > > > When indexing a large number of documents, upon a hard power failure (e.g. > pull the power cord), the index seems to get corrupted. We start a Java > application as an Windows Service, and feed it documents. In some cases > (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the > following is observed. > The 'segments' file contains only zeros. Its size is 265 bytes - all bytes > are zeros. > The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes > are zeros. > Before corruption, the segments file and deleted file appear to be correct. > After this corruption, the index is corrupted and lost. > This is a problem observed in Lucene 1.4.3. We are not able to upgrade our > customer deployments to 1.9 or later version, but would be happy to back-port > a patch, if the patch is small enough and if this problem is already solved. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1001) Add Payload retrieval to Spans
[ https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-1001: - Comment: was deleted > Add Payload retrieval to Spans > -- > > Key: LUCENE-1001 > URL: https://issues.apache.org/jira/browse/LUCENE-1001 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It will be nice to have access to payloads when doing SpanQuerys. > See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and > http://www.gossamer-threads.com/lists/lucene/java-dev/51134 > Current API, added to Spans.java is below. I will try to post a patch as > soon as I can figure out how to make it work for unordered spans (I believe I > have all the other cases working). > {noformat} > /** >* Returns the payload data for the current span. >* This is invalid until [EMAIL PROTECTED] #next()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return a List of byte arrays containing the data of this payload >* @throws IOException >*/ > // TODO: Remove warning after API has been finalized > List/**/ getPayload() throws IOException; > /** >* Checks if a payload can be loaded at this position. >* >* Payloads can only be loaded once per call to >* [EMAIL PROTECTED] #next()}. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return true if there is a payload available at this position that can > be loaded >*/ > // TODO: Remove warning after API has been finalized > public boolean isPayloadAvailable(); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans
[ https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544108 ] Paul Elschot commented on LUCENE-1001: -- Grant, You asked: ... how do I get access to the position payloads in the order that they occur in the PQ? The answer was already there: ... , it's easier than that: when they match, they all match, so you only need to keep the input Spans around in List or whatever. Then use them all as a source for your payloads. Regards, Paul Elschot P.S. I've had my break already... > Add Payload retrieval to Spans > -- > > Key: LUCENE-1001 > URL: https://issues.apache.org/jira/browse/LUCENE-1001 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It will be nice to have access to payloads when doing SpanQuerys. > See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and > http://www.gossamer-threads.com/lists/lucene/java-dev/51134 > Current API, added to Spans.java is below. I will try to post a patch as > soon as I can figure out how to make it work for unordered spans (I believe I > have all the other cases working). > {noformat} > /** >* Returns the payload data for the current span. >* This is invalid until [EMAIL PROTECTED] #next()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return a List of byte arrays containing the data of this payload >* @throws IOException >*/ > // TODO: Remove warning after API has been finalized > List/**/ getPayload() throws IOException; > /** >* Checks if a payload can be loaded at this position. >* >* Payloads can only be loaded once per call to >* [EMAIL PROTECTED] #next()}. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return true if there is a payload available at this position that can > be loaded >*/ > // TODO: Remove warning after API has been finalized > public boolean isPayloadAvailable(); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1055) Remove GData from trunk
[ https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544107 ] Paul Elschot commented on LUCENE-1055: -- Hoss, That must have been the cause. After removing the gdata-server directory manually everything is in order. Thanks. > Remove GData from trunk > > > Key: LUCENE-1055 > URL: https://issues.apache.org/jira/browse/LUCENE-1055 > Project: Lucene - Java > Issue Type: Task > Components: contrib/* >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1055.patch > > > GData doesn't seem to be maintained anymore. We're going to remove it before > we cut the 2.3 release unless there are negative votes. > In case someones jumps in in the future and starts to maintain it, we can > re-add it to the trunk. > If anyone is using GData and needs it to be in 2.3 please let us know soon! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans
[ https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544105 ] Grant Ingersoll commented on LUCENE-1001: - Sure, but how do I get access to the position payloads in the order that they occur in the PQ? I have to go and pop them all of the PQ or I need to maintain a separate PQ for the Payloads so that when I go to get a payload for a span, I can iterate over all the items by calling PQ.pop() but then I have to rebuild it again if getPayload is called again, right? I think I need to take a break and come back to this after some Turkey... :-) > Add Payload retrieval to Spans > -- > > Key: LUCENE-1001 > URL: https://issues.apache.org/jira/browse/LUCENE-1001 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It will be nice to have access to payloads when doing SpanQuerys. > See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and > http://www.gossamer-threads.com/lists/lucene/java-dev/51134 > Current API, added to Spans.java is below. I will try to post a patch as > soon as I can figure out how to make it work for unordered spans (I believe I > have all the other cases working). > {noformat} > /** >* Returns the payload data for the current span. >* This is invalid until [EMAIL PROTECTED] #next()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return a List of byte arrays containing the data of this payload >* @throws IOException >*/ > // TODO: Remove warning after API has been finalized > List/**/ getPayload() throws IOException; > /** >* Checks if a payload can be loaded at this position. >* >* Payloads can only be loaded once per call to >* [EMAIL PROTECTED] #next()}. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return true if there is a payload available at this position that can > be loaded >*/ > // TODO: Remove warning after API has been finalized > public boolean isPayloadAvailable(); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1063. Resolution: Invalid > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544103 ] Michael McCandless commented on LUCENE-1063: OK it sounds like this was a false alarm on my part -- sorry! The semantics of next() have always allowed the caller to arbitrarily modify the returned token ("forward reuse"). So I don't think we need to change anything. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544098 ] Yonik Seeley commented on LUCENE-1063: -- > CachingTokenFilter actually does this (caching references to the tokens). It's a bug to depend on the fact that the tokens you return won't change. If one is supposed to be able to use CachingTokenFilter anywhere in a filter chain and then be able to replay the tokens exactly as CachingTokenFilter first saw them (which is what I would guess the use to be), then it is a bug and didn't work properly before token reuse either. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1062) Improved Payloads API
[ https://issues.apache.org/jira/browse/LUCENE-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544095 ] Michael Busch commented on LUCENE-1062: --- We want to add the following methods to Payload: {code:java} public void setPayload(byte[] data); public void setPayload(byte[] data, int offset, int length); public byte[] getPayload(); public int getPayloadOffset(); public Object clone(); {code} Also Payload should implement Cloneable. Furthermore, we want to add a fieldName arg to Similarity.scorePayload(). I think we can also remove the "experimental" warnings from the Payload APIs now? > Improved Payloads API > - > > Key: LUCENE-1062 > URL: https://issues.apache.org/jira/browse/LUCENE-1062 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > > We want to make some optimizations to the Payloads API. > See following thread for related discussions: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens
[ https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544096 ] Michael McCandless commented on LUCENE-1058: I think the discussion in LUCENE-1063 is relevant to this issue: if you store (& re-use) Tokens you may need to return a copy of the Token from the next() method to ensure that nay filters that alter the Token don't mess up your private copy. > New Analyzer for buffering tokens > - > > Key: LUCENE-1058 > URL: https://issues.apache.org/jira/browse/LUCENE-1058 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1058.patch > > > In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that > could siphon off certain tokens and store them in a buffer to be used later > in the processing pipeline. > For example, if you want to have two fields, one lowercased and one not, but > all the other analysis is the same, then you could save off the tokens to be > output for a different field. > Patch to follow, but I am still not sure about a couple of things, mostly how > it plays with the new reuse API. > See > http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544093 ] Doron Cohen commented on LUCENE-1063: - {quote} > TokenStreams that cache tokens without "protecting" their private copy when > next() is called? That would be a bug in the filter (both in the past and now). {quote} I think it is okay to relax this to only protect in Toknizers (where Tokens are created), and not worry about TokenFilters. TokenFilters always take a TokenStream at construction and always call its next(Token), which eventually calls a Tokenizer.next(Token) -- which is protecyed -- and so the TokenFilter can rely on that protection.Right? > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544091 ] Michael Busch commented on LUCENE-1063: --- {quote} I think it should put a cloned copy into the cache. {quote} Or, we could add a boolean to the ctr of CachingTokenFilter that specifies whether or not to clone the Tokens. So if a user knows that it is safe to simply cache the references they can disable the cloning for performance reasons. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544088 ] Michael Busch commented on LUCENE-1063: --- {quote} That would be a bug in the filter (both in the past and now). {quote} CachingTokenFilter actually does this (caching references to the tokens). I think it should put a cloned copy into the cache. Oh and actually I just noticed that Payload doesn't implement Cloneable! So Token.clone() doesn't create a copy of the Payload, which I think it should? I will fix this with Lucene-1062. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544077 ] Yonik Seeley commented on LUCENE-1063: -- In the past, the semantics were simple... Tokenizer generated tokens, and token filters modified them. I don't think it was a bug that filters modify instead of create new tokens. No one cached tokens and expected them to be unchanged because they could be modified by a downstream filter. > TokenStreams that cache tokens without "protecting" their private copy when > next() is called? That would be a bug in the filter (both in the past and now). > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544076 ] Michael McCandless commented on LUCENE-1052: Maybe, instead, we should simply make it "easy" to subclass TermInfosReader whenever a SegmentReader wants to instantiate it? Ie, the formula is such an advanced use case that it seems appropriate to subclass instead of trying to break it out into a special interface/abstract class? Of course, we need to know this class at SegmentReader construction time, so I think to specify it we should in fact take Doug's suggested approach using generic properties. The challenge with Lucene (and Hadoop) is how can you reach deep down into a complex IndexReader.open static method call to change various details of the embedded *Readers while they are being constructed, and, after they are constructed... I agree it is messy now that we must propogate the setTermInfosIndexInterval method up the *Reader hierarchy when not all Readers would even use a TermInfosReader. So ... maybe we 1) implement generic Lucene properties w/ static classes/methods to set/get these properties, then 2) remove set/getTermInfosIndexInterval from *Reader and make a generic property for it instead, and 3) add another property that allows you to specify the Class (or String name) of that is your TermInfosReader subclass (and make it non-final)? > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544075 ] Doron Cohen commented on LUCENE-1063: - Oh, I was locked on that calling next(null) means do-not-reuse but guess since we have the original next() this is not required. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544063 ] Michael McCandless commented on LUCENE-1063: {quote} I checked next(Token res) implementations of CharTokenizer, KeywordTokenizer and StandardTokenizer and non of them checks res for null. {quote} I think you should not pass null into this method? (Ie you should use next() instead). I can clarify this in the javadocs... > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544059 ] Doron Cohen commented on LUCENE-1063: - {quote} and even with old style Tokens w/o Token reuse, one could always change what string the token pointed at. {quote} ...right... termText is now private but it used to be package protected. Patch looks good for (default) TokenStream, though it is a shame there is no magic way to know if Token was changed and is copying really required. But is this good enough also for non default TokenStreams which imlpement next(Token)? mm.. I checked next(Token res) implementations of CharTokenizer, KeywordTokenizer and StandardTokenizer and non of them checks res for null. Am I missing something trivial? > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544055 ] Chuck Williams commented on LUCENE-1052: I agree a general configuration system would be much better. Doug. we use a similar method to what you described in our application. TermInfosConfigurer is slightly different though since the desired config is a method that implements a formula, rather than just a value. This could still be done more generally by allowing methods as well as properties or setters on a higher level configuration object. I didn't want to take on the broader issue just for this feature. Michael I agree with both of your points. I'd be happy to clean up this patch if you guys provide some guidance for what would make it acceptable to commit. > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544054 ] Michael McCandless commented on LUCENE-1063: {quote} Looking at the test, this would not have worked before token-reuse either I don't yet see how we are breaking backward compatibility. Callers of next() could change the Token, so caching your own copy that you already passed on to someone else was never valid. {quote} You're right: before token reuse a filter could change the String termText (and other fields) and mess up a cached copy held by a TokenStream earlier in the chain. But, our core filters now use the reuse API (for better performance), so if you are using a TokenStream that does caching followed by one of these core filters we will now mess up the cached copy, right? Oh, duh: I just checked 2.2 and in fact the LowerCaseFilter, PorterStemFilter, ISOLatin1AccentFilter all directly alter termText rather than making a new token. So actually this issue is pre-existing! And then I guess we are not breaking backwards compatibility by further propagating it. But I think this is still a bug? Hmm, I guess the semantics of the next() API is and has been to allow you to arbitrarily modify the token after you receive it ("forwards reuse") but not re-use the token on the next call to next ("backwards reuse"). If we take that approach then the bug is in those TokenStreams that cache tokens without "protecting" their private copy when next() is called? > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544048 ] Yonik Seeley commented on LUCENE-1063: -- > it is the addition of Token.termBuffer() that allowed this to happen But old filters won't use the termBuffer. and even with old style Tokens w/o Token reuse, one could always change what string the token pointed at. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544045 ] Doron Cohen commented on LUCENE-1063: - Yes, that's what I meant - it is the addition of Token.termBuffer() that allowed this to happen - in 2.2 (apart from payloads) only immutable String could be obtained from the Token. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Apache logs and data
20 nov 2007 kl. 20.28 skrev Doug Cutting: karl wettin wrote: On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: it is always good to have query logs http://thepiratebay.org/tor/3783572 It doesn't look as though there's click data, so we can't use this for relevance experiments without manually creating judgments. (LUCENE-626 extracts query goals from this data.) I'll send my fellow countrymen a request for an update with query log containing clicks, download, or whatever they are willing to give out. I'm sure they wont mind. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Grant Ingersoll wrote: > Scratch my last comment. I was thinking it only pertained to payloads. > > In that light, I think we should modify the scorePayload method for the > time being, then we can deprecate it when we go to per field sim. > > -Grant > OK sounds good. Will make the change with LUCENE-1062. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544034 ] Yonik Seeley commented on LUCENE-1063: -- Looking at the test, this would not have worked before token-reuse either I don't yet see how we are breaking backward compatibility. Callers of next() *could* change the Token, so caching your own copy that you already passed on to someone else was never valid. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544032 ] Doug Cutting commented on LUCENE-1052: -- I think we should be cautious about adding a new public interface or abstract class to support just this feature. If we want to add a generic configuration API for Lucene, then I'd prefer something fully general, like what I proposed on the mailing list, not something specific to configuring TermInfosReader. Otherwise we'll keep adding new configuration interfaces and adding more parameters to IndexReader constructors each time we wish to make some obscure feature configurable. http://www.gossamer-threads.com/lists/lucene/java-dev/54421#54421 In the model proposed there, adding a new configuration parameter involves just adding a new static method to the public class that implements a new configurable feature. > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Scratch my last comment. I was thinking it only pertained to payloads. In that light, I think we should modify the scorePayload method for the time being, then we can deprecate it when we go to per field sim. -Grant On Nov 20, 2007, at 2:34 PM, Michael Busch wrote: Yonik Seeley wrote: Per field similarity would certainly be more efficient since it moves the field->similarity lookup from the inner loop to the outer loop. I agree. Then I'll leave the scorePayload() API as is for now. And I don't think the per-field similarity should block 2.3, so let's work on that after the release, ok? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Well, we are making an awful lot of improvements for Payloads, I think we should try to get it in now and make 2.3 wait a bit more, since we all have more or less agreed that 2.9 (next after 2.3) is going to be a deprecation release before moving to 3.0 -Grant On Nov 20, 2007, at 2:34 PM, Michael Busch wrote: Yonik Seeley wrote: Per field similarity would certainly be more efficient since it moves the field->similarity lookup from the inner loop to the outer loop. I agree. Then I'll leave the scorePayload() API as is for now. And I don't think the per-field similarity should block 2.3, so let's work on that after the release, ok? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans
[ https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544029 ] Doug Cutting commented on LUCENE-1001: -- > Would it be simpler to just use a SortedSet? TreeMap is slower than a PriorityQueue for this. With PriorityQueue, insertions and deletions do not allocate new objects. And, if some items are much more frequent than others, using adjustTop() instead of inserting and deleting makes merges run much faster, since most updates are then considerably faster than log(n). > Add Payload retrieval to Spans > -- > > Key: LUCENE-1001 > URL: https://issues.apache.org/jira/browse/LUCENE-1001 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It will be nice to have access to payloads when doing SpanQuerys. > See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and > http://www.gossamer-threads.com/lists/lucene/java-dev/51134 > Current API, added to Spans.java is below. I will try to post a patch as > soon as I can figure out how to make it work for unordered spans (I believe I > have all the other cases working). > {noformat} > /** >* Returns the payload data for the current span. >* This is invalid until [EMAIL PROTECTED] #next()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return a List of byte arrays containing the data of this payload >* @throws IOException >*/ > // TODO: Remove warning after API has been finalized > List/**/ getPayload() throws IOException; > /** >* Checks if a payload can be loaded at this position. >* >* Payloads can only be loaded once per call to >* [EMAIL PROTECTED] #next()}. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return true if there is a payload available at this position that can > be loaded >*/ > // TODO: Remove warning after API has been finalized > public boolean isPayloadAvailable(); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1063: --- Attachment: LUCENE-1063.patch Attached patch w/ unit test showing the issue, plus the fix. The fix was actually simpler than I thought: we don't have to make a new Token(); instead we just have to copy over the fields to the Token that was passed in. So the performance hit is less that I thought it'd be (copy instead of new/GC). I also strengthened the javadocs on the reuse & non-reuse APIs. All tests pass. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1063.patch > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Yonik Seeley wrote: > > Per field similarity would certainly be more efficient since it moves > the field->similarity lookup from the inner loop to the outer loop. > I agree. Then I'll leave the scorePayload() API as is for now. And I don't think the per-field similarity should block 2.3, so let's work on that after the release, ok? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Apache logs and data
karl wettin wrote: On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: it is always good to have query logs I realize that it is not that politically correct, but the TPB collection is released to the public domain and contains 3.2 million user queries with session id, timestamp, category etc to go with the 150,000+500,000 documents. http://thepiratebay.org/tor/3783572 That's a good find! They use Lucene too! I don't see any legal issues to us writing code that parses these files. To be safest, I don't think we should republish the files, or even any of the queries, but I don't think we should need to. Folks can download them to their own machines and use them for testing there. It doesn't look as though there's click data, so we can't use this for relevance experiments without manually creating judgments. But for performance benchmarking it could be useful. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
On Nov 20, 2007 2:17 PM, Michael Busch <[EMAIL PROTECTED]> wrote: > Grant Ingersoll wrote: > > +1 for adding the field name. > > > > The question is whether we should add the field name to the > Similarity#scorePayload() method or if we should support a per-field > similarity in the future? Per field similarity would certainly be more efficient since it moves the field->similarity lookup from the inner loop to the outer loop. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Grant Ingersoll wrote: > +1 for adding the field name. > > The question is whether we should add the field name to the Similarity#scorePayload() method or if we should support a per-field similarity in the future? -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:Token re-use API breaks back compatibility in certain TokenStream chains
OK, thanks. I'll put mine in there too. Mike "Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 20, 2007 1:49 PM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > > > Will do ... > > > > Mike > > > > "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote: > > > Could we make this a little more concrete by creating a simple test case > > > that fails? > > FWIW, I recently added mine to TestAnalyzers to check for proper > payload copying. > > -Yonik > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances
[ https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Wang updated LUCENE-1061: -- Fix Version/s: 2.3 Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Affects Version/s: 2.3 > Adding a factory to QueryParser to instantiate query instances > -- > > Key: LUCENE-1061 > URL: https://issues.apache.org/jira/browse/LUCENE-1061 > Project: Lucene - Java > Issue Type: Improvement > Components: QueryParser >Affects Versions: 2.3 >Reporter: John Wang > Fix For: 2.3 > > Attachments: lucene_patch.txt > > > With the new efforts with Payload and scoring functions, it would be nice to > plugin custom query implementations while using the same QueryParser. > Included is a patch with some refactoring the QueryParser to take a factory > that produces query instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re:Token re-use API breaks back compatibility in certain TokenStream chains
On Nov 20, 2007 1:49 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > Will do ... > > Mike > > "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote: > > Could we make this a little more concrete by creating a simple test case > > that fails? FWIW, I recently added mine to TestAnalyzers to check for proper payload copying. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
Will do ... Mike "Yonik Seeley (JIRA)" <[EMAIL PROTECTED]> wrote: > > [ > > https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544005 > ] > > Yonik Seeley commented on LUCENE-1063: > -- > > Could we make this a little more concrete by creating a simple test case > that fails? > > > > Token re-use API breaks back compatibility in certain TokenStream chains > > > > > > Key: LUCENE-1063 > > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > > Project: Lucene - Java > > Issue Type: Bug > > Components: Analysis > >Affects Versions: 2.3 > >Reporter: Michael McCandless > >Assignee: Michael McCandless > > Fix For: 2.3 > > > > > > In scrutinizing the new Token re-use API during this thread: > > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > > I realized we now have a non-back-compatibility when mixing re-use and > > non-re-use TokenStreams. > > The new "reuse" next(Token) API actually allows two different aspects > > of re-use: > > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > > to change all aspects of the provided Token, meaning the caller > > must do all persisting of Token that it needs before calling > > next(Token) again. > > 2) "Forwards re-use": the caller is allowed to modify the returned > > Token however it wants. Eg the LowerCaseFilter is allowed to > > downcase the characters in-place in the char[] termBuffer. > > The forwards re-use case can break backwards compatibility now. EG: > > if a TokenStream X providing only the "non-reuse" next() API is > > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > > the tokens, then the default implementation in TokenStream.java for > > next(Token) will kick in. > > That default implementation just returns the provided "private copy" > > Token returned by next(). But, because of 2) above, this is not > > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > > is actually modifying the cached copy being potentially stored by X. > > I think the opposite case is handled correctly. > > A simple way to fix this is to make a full copy of the Token in the > > next(Token) call in TokenStream, just like we do in the next() method > > in TokenStream. The downside is this is a small performance hit. However > > that hit only happens at the boundary between a non-reuse and a re-use > > tokenizer. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12544005 ] Yonik Seeley commented on LUCENE-1063: -- Could we make this a little more concrete by creating a simple test case that fails? > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543991 ] Michael McCandless commented on LUCENE-1063: {quote} {code} // Filter F is calling TokenStream ts: F.next(Token result) { Token t = ts.next(result); t.setSomething(); return t; } {code} Problem as described: ts expects the token it returns to not be altered because it somehow intends to rely on its content when servicing the following call to next([]). In other words, it assumes that callers to next([]) would only consume, but not alter, the returned token. {quote} And, ts only defined "non-reuse" next(), thus it is the default implemenation in TokenStream.next(Token) that is actually invoked, which in turn invokes ts.next() and directly returns the result. {quote} Seems that such an expectation by ts would be problematic no matter if ts.next() or ts.next(Token) are used. I mean, even if we removed next(Token) but kept Token.termBuffer(), that char array could be modified, and some TokenSteam implementation could still be broken because it assumes (following similar logic) that it can reuse its private copy of the char array... right? {quote} I don't think it's problematic for ts to expect this? This is the "contract" that you are supposed to follow for this API, spelled out in the javadocs. When you call "non-reuse" ts.next() you expect to get a private copy that you can hold onto indefinitely and it will never be modified, and, you accept that you must never modify this token yourself. Whereas when you call "reuse" ts.next(Token) you accept that you must fully consume the returned Token before you next call next(Token), and, that you are free to alter this token. I think that contract is well defined & consistent? {quote} TokenStream already does this, right? (or do you mean in the class TokenStream or in all implementations of TokenStream?) {quote} I'm talking about TokenStream's default implementation of next(Token). It's not copying now, but it needs to in order to properly meet the contract of this API (ie, allow caller to modify the returned token). The default implementation of TokenStream.next() does already copy. > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1055) Remove GData from trunk
[ https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543982 ] Hoss Man commented on LUCENE-1055: -- contrib/gdata-server is recorded as deleted (so an "svn status" will show that subversion doesn't know anything about it) but if you've ever build gdata-server, then it contains an "ext-libs" directory which was not managed by subversion, so "svn update" won't delete it automatically. > Remove GData from trunk > > > Key: LUCENE-1055 > URL: https://issues.apache.org/jira/browse/LUCENE-1055 > Project: Lucene - Java > Issue Type: Task > Components: contrib/* >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1055.patch > > > GData doesn't seem to be maintained anymore. We're going to remove it before > we cut the 2.3 release unless there are negative votes. > In case someones jumps in in the future and starts to maintain it, we can > re-add it to the trunk. > If anyone is using GData and needs it to be in 2.3 please let us know soon! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > > If we used a Payload object, it would save 8 bytes per Token for > > fields not using payloads. Of course with Token reuse, saving 8 bytes isn't important any more either since it's only allocated once per field. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1055) Remove GData from trunk
[ https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543979 ] Michael Busch commented on LUCENE-1055: --- {quote} After svn update, contrib/gdata-server is still in my working copy. Is that intended, or is there still an svn delete to be done? {quote} Hmm that's strange. I tried svn up on a different checkout folder and contrib/gdata-server was successfully removed. Are you sure that you don't have any local changes in that folder that prevent it from being removed? > Remove GData from trunk > > > Key: LUCENE-1055 > URL: https://issues.apache.org/jira/browse/LUCENE-1055 > Project: Lucene - Java > Issue Type: Task > Components: contrib/* >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1055.patch > > > GData doesn't seem to be maintained anymore. We're going to remove it before > we cut the 2.3 release unless there are negative votes. > In case someones jumps in in the future and starts to maintain it, we can > re-add it to the trunk. > If anyone is using GData and needs it to be in 2.3 please let us know soon! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
Michael McCandless wrote: > "Yonik Seeley" <[EMAIL PROTECTED]> wrote: >> On Nov 19, 2007 6:52 PM, Michael Busch <[EMAIL PROTECTED]> wrote: >>> Yonik Seeley wrote: So I think we all agree to do payloads by reference (do not make a copy of byte[] like termBuffer does), and to allow payload reuse. So now we still have 3 viable options still on the table I think: Token{ byte[] payload, int payloadLength, ...} Token{ byte[] payload, int payloadOffset, int payloadLength,...} Token{ Payload p, ... } >>> I'm for option 2. I agree that it is worthwhile to allow filters to >>> modify the payloads. And I'd like to optimize for the case where lot's >>> of tokens have payloads, and option 2 seems therefore the way to go. >> Just to play devil's advocate, it seems like adding the byte[] >> directly to Token gains less than we might have been thinking if we >> have reuse in any case. A TokenFilter could reuse the same Payload >> object for each term in a Field, so the CPU allocation savings is >> closer to a single Payload per field using payloads. >> >> If we used a Payload object, it would save 8 bytes per Token for >> fields not using payloads. >> Besides an initial allocation per field, the additional cost to using >> a Payload field would be an additional dereference (but that should be >> really minor). > > These are excellent points. I guess I would lean [back] towards > keeping the separate Payload object and extending its API to allow > re-use and modification of its byte[]? > +1 -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Apache logs and data
: I think the safest path is simply to not publish any queries, but rather to, : e.g., permit committers to run experiments using them and publish the results : of the experiments. But no queries would be made available to the general : public on a website. that would eliminate the goal of having datasets (docs+queries+judgements) that anyone could download for testing whether a patch they want to propose alter the scores produced by Lucene (for better or for worse) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
[ https://issues.apache.org/jira/browse/LUCENE-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543939 ] Doron Cohen commented on LUCENE-1063: - In ''code words": {code} // Filter F is calling TokenStream ts: F.next(Token result) { Token t = ts.next(result); t.setSomething(); return t; } {code} Problem as described: ts expects the token it returns to not be altered because it somehow intends to rely on its content when servicing the following call to next([]). In other words, it assumes that callers to next([]) would only consume, but not alter, the returned token. Seems that such an expectation by ts would be problematic no matter if ts.next() or ts.next(Token) are used. I mean, even if we removed next(Token) but kept Token.termBuffer(), that char array could be modified, and some TokenSteam implementation could still be broken because it assumes (following similar logic) that it can reuse its private copy of the char array... right? {quote} A simple way to fix this is to make a full copy of the Token in the next(Token) call in TokenStream, just like we do in the next() method in TokenStream. The downside is this is a small performance hit. However that hit only happens at the boundary between a non-reuse and a re-use tokenizer. {quote} TokenStream already does this, right? (or do you mean in the class TokenStream or in all implementations of TokenStream?) > Token re-use API breaks back compatibility in certain TokenStream chains > > > Key: LUCENE-1063 > URL: https://issues.apache.org/jira/browse/LUCENE-1063 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > > In scrutinizing the new Token re-use API during this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54708 > I realized we now have a non-back-compatibility when mixing re-use and > non-re-use TokenStreams. > The new "reuse" next(Token) API actually allows two different aspects > of re-use: > 1) "Backwards re-use": the subsequent call to next(Token) is allowed > to change all aspects of the provided Token, meaning the caller > must do all persisting of Token that it needs before calling > next(Token) again. > 2) "Forwards re-use": the caller is allowed to modify the returned > Token however it wants. Eg the LowerCaseFilter is allowed to > downcase the characters in-place in the char[] termBuffer. > The forwards re-use case can break backwards compatibility now. EG: > if a TokenStream X providing only the "non-reuse" next() API is > followed by a TokenFilter Y using the "reuse" next(Token) API to pull > the tokens, then the default implementation in TokenStream.java for > next(Token) will kick in. > That default implementation just returns the provided "private copy" > Token returned by next(). But, because of 2) above, this is not > legal: if the TokenFilter Y modifies the char[] termBuffer (say), that > is actually modifying the cached copy being potentially stored by X. > I think the opposite case is handled correctly. > A simple way to fix this is to make a full copy of the Token in the > next(Token) call in TokenStream, just like we do in the next() method > in TokenStream. The downside is this is a small performance hit. However > that hit only happens at the boundary between a non-reuse and a re-use > tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1001) Add Payload retrieval to Spans
[ https://issues.apache.org/jira/browse/LUCENE-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543915 ] Grant Ingersoll commented on LUCENE-1001: - {quote} Off the top of my head: the priority queue is used to make sure that the Spans are processed by increasing doc numbers and increasing token positions; the first and the last Spans determine whether there is a match, and all other Spans (in the queue) are "in between". {quote} Would it be simpler to just use a SortedSet? Then we could iterate w/o losing the sort, right? Would this be faster since we wouldn't have to do the heap operations? > Add Payload retrieval to Spans > -- > > Key: LUCENE-1001 > URL: https://issues.apache.org/jira/browse/LUCENE-1001 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > > It will be nice to have access to payloads when doing SpanQuerys. > See http://www.gossamer-threads.com/lists/lucene/java-dev/52270 and > http://www.gossamer-threads.com/lists/lucene/java-dev/51134 > Current API, added to Spans.java is below. I will try to post a patch as > soon as I can figure out how to make it work for unordered spans (I believe I > have all the other cases working). > {noformat} > /** >* Returns the payload data for the current span. >* This is invalid until [EMAIL PROTECTED] #next()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #next()}. However, payloads are loaded lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return a List of byte arrays containing the data of this payload >* @throws IOException >*/ > // TODO: Remove warning after API has been finalized > List/**/ getPayload() throws IOException; > /** >* Checks if a payload can be loaded at this position. >* >* Payloads can only be loaded once per call to >* [EMAIL PROTECTED] #next()}. >* >* >* WARNING: The status of the Payloads feature is experimental. >* The APIs introduced here might change in the future and will not be >* supported anymore in such a case. >* >* @return true if there is a payload available at this position that can > be loaded >*/ > // TODO: Remove warning after API has been finalized > public boolean isPayloadAvailable(); > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1063) Token re-use API breaks back compatibility in certain TokenStream chains
Token re-use API breaks back compatibility in certain TokenStream chains Key: LUCENE-1063 URL: https://issues.apache.org/jira/browse/LUCENE-1063 Project: Lucene - Java Issue Type: Bug Components: Analysis Affects Versions: 2.3 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.3 In scrutinizing the new Token re-use API during this thread: http://www.gossamer-threads.com/lists/lucene/java-dev/54708 I realized we now have a non-back-compatibility when mixing re-use and non-re-use TokenStreams. The new "reuse" next(Token) API actually allows two different aspects of re-use: 1) "Backwards re-use": the subsequent call to next(Token) is allowed to change all aspects of the provided Token, meaning the caller must do all persisting of Token that it needs before calling next(Token) again. 2) "Forwards re-use": the caller is allowed to modify the returned Token however it wants. Eg the LowerCaseFilter is allowed to downcase the characters in-place in the char[] termBuffer. The forwards re-use case can break backwards compatibility now. EG: if a TokenStream X providing only the "non-reuse" next() API is followed by a TokenFilter Y using the "reuse" next(Token) API to pull the tokens, then the default implementation in TokenStream.java for next(Token) will kick in. That default implementation just returns the provided "private copy" Token returned by next(). But, because of 2) above, this is not legal: if the TokenFilter Y modifies the char[] termBuffer (say), that is actually modifying the cached copy being potentially stored by X. I think the opposite case is handled correctly. A simple way to fix this is to make a full copy of the Token in the next(Token) call in TokenStream, just like we do in the next() method in TokenStream. The downside is this is a small performance hit. However that hit only happens at the boundary between a non-reuse and a re-use tokenizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1040) Can't quickly create StopFilter
[ https://issues.apache.org/jira/browse/LUCENE-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543903 ] Yonik Seeley commented on LUCENE-1040: -- Indeed... thanks for catching that! > Can't quickly create StopFilter > --- > > Key: LUCENE-1040 > URL: https://issues.apache.org/jira/browse/LUCENE-1040 > Project: Lucene - Java > Issue Type: Bug >Reporter: Yonik Seeley >Assignee: Yonik Seeley > Attachments: CharArraySet.patch, CharArraySet.take2.patch > > > Due to the use of CharArraySet by StopFilter, one can no longer efficiently > pre-create a Set for use by future StopFilter instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Apache logs and data
This may be worth asking legal-discuss about. I am not sure if there is an issue or not. -Grant On Nov 20, 2007, at 4:54 AM, karl wettin wrote: On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: it is always good to have query logs I realize that it is not that politically correct, but the TPB collection is released to the public domain and contains 3.2 million user queries with session id, timestamp, category etc to go with the 150,000+500,000 documents. http://thepiratebay.org/tor/3783572 -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Payload API
"Yonik Seeley" <[EMAIL PROTECTED]> wrote: > On Nov 19, 2007 6:52 PM, Michael Busch <[EMAIL PROTECTED]> wrote: > > Yonik Seeley wrote: > > > > > > So I think we all agree to do payloads by reference (do not make a > > > copy of byte[] like termBuffer does), and to allow payload reuse. > > > > > > So now we still have 3 viable options still on the table I think: > > > Token{ byte[] payload, int payloadLength, ...} > > > Token{ byte[] payload, int payloadOffset, int payloadLength,...} > > > Token{ Payload p, ... } > > > > > > > I'm for option 2. I agree that it is worthwhile to allow filters to > > modify the payloads. And I'd like to optimize for the case where lot's > > of tokens have payloads, and option 2 seems therefore the way to go. > > Just to play devil's advocate, it seems like adding the byte[] > directly to Token gains less than we might have been thinking if we > have reuse in any case. A TokenFilter could reuse the same Payload > object for each term in a Field, so the CPU allocation savings is > closer to a single Payload per field using payloads. > > If we used a Payload object, it would save 8 bytes per Token for > fields not using payloads. > Besides an initial allocation per field, the additional cost to using > a Payload field would be an additional dereference (but that should be > really minor). These are excellent points. I guess I would lean [back] towards keeping the separate Payload object and extending its API to allow re-use and modification of its byte[]? I'm now even wondering whether the char[] termBuffer should be by reference (again!), too? This would save 1 copy for those TokenStreams that could provide a reference to their own char[] buffers (eg CharTokenizer). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Reopened: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-1052: > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1052) Add an "termInfosIndexDivisor" to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543854 ] Michael McCandless commented on LUCENE-1052: Thanks Chuck for such a wonderfully thorough patch & unit tests, and for adding the methods to ParallelReader, too (I had missed it the first time around)! The patch looks good. Should we use an abstract base class instead of interface for TermInfosConfigurer so we can add additional methods in the future without breaking back compatibility? Also I think we should mark this API as advanced, somewhat experimental and subject to change? > Add an "termInfosIndexDivisor" to IndexReader > - > > Key: LUCENE-1052 > URL: https://issues.apache.org/jira/browse/LUCENE-1052 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.2 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.3 > > Attachments: LUCENE-1052.patch, termInfosConfigurer.patch > > > The termIndexInterval, set during indexing time, let's you tradeoff > how much RAM is used by a reader to load the indexed terms vs cost of > seeking to the specific term you want to load. > But the downside is you must set it at indexing time. > This issue adds an indexDivisor to TermInfosReader so that on opening > a reader you could further sub-sample the the termIndexInterval to use > less RAM. EG a setting of 2 means every 2 * termIndexInterval is > loaded into RAM. > This is particularly useful if your index has a great many terms (eg > you accidentally indexed binary terms). > Spinoff from this thread: > http://www.gossamer-threads.com/lists/lucene/java-dev/54371 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1040) Can't quickly create StopFilter
[ https://issues.apache.org/jira/browse/LUCENE-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543847 ] Michael McCandless commented on LUCENE-1040: Yonik, I think you missed my proposed update to your original patch, here? https://issues.apache.org/jira/browse/LUCENE-1040#action_12539319 EG, there are some problems with the changes to rehash (and I added a unit-test to expose them). > Can't quickly create StopFilter > --- > > Key: LUCENE-1040 > URL: https://issues.apache.org/jira/browse/LUCENE-1040 > Project: Lucene - Java > Issue Type: Bug >Reporter: Yonik Seeley >Assignee: Yonik Seeley > Attachments: CharArraySet.patch, CharArraySet.take2.patch > > > Due to the use of CharArraySet by StopFilter, one can no longer efficiently > pre-create a Set for use by future StopFilter instances. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Apache logs and data
On Nov 15, 2007 10:09 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > it is always good to have query logs I realize that it is not that politically correct, but the TPB collection is released to the public domain and contains 3.2 million user queries with session id, timestamp, category etc to go with the 150,000+500,000 documents. http://thepiratebay.org/tor/3783572 -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1055) Remove GData from trunk
[ https://issues.apache.org/jira/browse/LUCENE-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12543807 ] Paul Elschot commented on LUCENE-1055: -- After svn update, contrib/gdata-server is still in my working copy. Is that intended, or is there still an svn delete to be done? > Remove GData from trunk > > > Key: LUCENE-1055 > URL: https://issues.apache.org/jira/browse/LUCENE-1055 > Project: Lucene - Java > Issue Type: Task > Components: contrib/* >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: lucene-1055.patch > > > GData doesn't seem to be maintained anymore. We're going to remove it before > we cut the 2.3 release unless there are negative votes. > In case someones jumps in in the future and starts to maintain it, we can > re-add it to the trunk. > If anyone is using GData and needs it to be in 2.3 please let us know soon! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]