[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783558#action_12783558 ] Robert Muir commented on LUCENE-2091: - Yuval, bm25 has been working nicely for me too. on some collections, it really helps, but I haven't yet found a case where it hurts (compared to lucene's current scoring algorithm) thanks in advance for working this! > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Attachments: persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2062) Bulgarian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2062: Attachment: LUCENE-2062.patch some improvements on the previous patch, mostly changing the test to work in a similar way to TestCzechStemmer, refining stopwords list, javadocs, etc. I think this one is ready. I'll commit in a few days if no one objects. > Bulgarian Analyzer > -- > > Key: LUCENE-2062 > URL: https://issues.apache.org/jira/browse/LUCENE-2062 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2062.patch, LUCENE-2062.patch > > > someone asked about bulgarian analysis on solr-user today... > http://www.lucidimagination.com/search/document/e1e7a5636edb1db2/non_english_languages > I was surprised we did not have anything. > This analyzer implements the algorithm specified here, > http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf > In the measurements there, this improves MAP approx 34% -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783555#action_12783555 ] Yuval Feinstein commented on LUCENE-2091: - Otis and Robert, Here's my (limited) experience with BM25: On a proprietary corpus (alas) I got a nice improvement, which was more pronounced in recall (hits that were previously not ranked as top ones, and therefore remained unseen, now appear in the top results). I have worked on lowering the BM25 run time to a reasonable level. I hope that once this gets into the hands of the Lucene community, BM25 performance will approach the current Lucene scoring's performance. This is a tall order, as the latter has been in the works for the last eight years or so. As for use cases, in my use case BM25 helps, I believe this may be true for other cases. > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Attachments: persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783532#action_12783532 ] Robert Muir edited comment on LUCENE-2091 at 11/30/09 4:45 AM: --- otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations Lucene SimpleAnalyzer Lucene SimpleAnalyzer + BM25 Lucene PersianAnalyzer Lucene PersianAnalyzer + BM25 the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op). so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words. this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this. btw, you can play around with openrelevance svn and duplicate my experiments on this same corpus yourself if you want. there's an indonesian corpus there too. i've also tested hindi with this impl. was (Author: rcmuir): otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations Lucene SimpleAnalyzer Lucene SimpleAnalyzer + BM25 Lucene PersianAnalyzer Lucene PersianAnalyzer + BM25 the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op). so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words. this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this. > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Attachments: persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2091: Attachment: persianlucene.jpg otis attached is a graph i produced from the hamshahri corpus, comparing 4 different combinations Lucene SimpleAnalyzer Lucene SimpleAnalyzer + BM25 Lucene PersianAnalyzer Lucene PersianAnalyzer + BM25 the hamshahri corpus contains standardized encoding of persian (i.e. the normalization filter is a no-op). so any analyzer gain is strictly due to "stopwords", although in persian i wouldn't call some of these words. this was mostly to show that the analyzer is actually useful, i.e. the scoring system can't completely make up for lack of support like this. > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Attachments: persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783530#action_12783530 ] Otis Gospodnetic edited comment on LUCENE-2091 at 11/30/09 4:21 AM: Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Aha, I found something: http://markmail.org/message/c2r4v7zj7mduzs5d Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. was (Author: otis): Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783530#action_12783530 ] Otis Gospodnetic commented on LUCENE-2091: -- Has anyone compared this particular BM25 impl. to the current Lucene's quasi-VSM approach in terms of: * any of the relevance eval methods * indexing performance * search performance * ... Also, this issue is marked as contrib/*. Should this not go straight to core, so more people actually use this and provide feedback? Who knows, there is a chance (ha!) BM25 might turn out better than the current approach, and become the default. > Add BM25 Scoring to Lucene > -- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Yuval Feinstein >Priority: Minor > Fix For: 3.1 > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-1458: Attachment: LUCENE-1458_rotate.patch fwiw here is a patch to use the algorithm from the unicode std for utf8 in utf16 sort order. they claim it is fast because there is no conditional branching... who knows > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, > LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, > UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Socket and file locks
Hello, I'm glad you appreciate it; I've added the Wiki page here: http://wiki.apache.org/lucene-java/AvailableLockFactories I avoided on purpose to copy-paste the full javadocs of each implementation as that would be out-of-date or too specific to some version, I limited myself to writing some words to highlight the differences as a quick overview of what is available. hope you like it, I'm open to suggestions. Regards, Sanne 2009/11/29 Michael McCandless : > This looks great! > > Maybe it makes most sense to create a wiki page > (http://wiki.apache.org/lucene-java) for interesting LockFactory > implementations/tradeoffs, and add this there? > > Mike > > On Sat, Nov 28, 2009 at 9:26 AM, Sanne Grinovero > wrote: >> Hello, >> Together with the Infinispan Directory we developed such a >> LockFactory; I'd me more than happy if you wanted to add some pointers >> to it in the Lucene documention/readme. >> This depends on Infinispan for multiple-machines communication >> (JGroups, indirectly) but >> it's not required to use an Infinispan Directory, you could combine it >> with a Directory impl of choice. >> This was tested with the LockVerifyServer mentioned by Michael >> McCandless and also >> with some other tests inspired from it (in-VM for lower delay >> coordination and verify, while the LockFactory was forced to >> use real network communication). >> >> While this is a technology preview and performance regarding the >> Directory code is still unknown, I believe the LockFactory was the >> most tested component. >> >> free to download and inspect (LGPL): >> http://anonsvn.jboss.org/repos/infinispan/trunk/lucene-directory/ >> >> Regards, >> Sanne >> >> 2009/11/27 Michael McCandless : >>> I think a LockFactory for Lucene that implemented the ideas you & >>> Marvin are discussing in LUCENE-1877, and/or the approach you >>> implemented in the H2 DB, would be a useful addition to Lucene! >>> >>> For many apps, the simple LockFactory impls suffice, but for apps >>> where multiple machines can become the writer, it gets hairy. Having >>> an always correct Lock impl for these apps would be great. >>> >>> Note that Lucene has some basic tools (in oal.store) for asserting >>> that a LockFactory is correct (see LockVerifyServer), so it's a useful >>> way to test that things are working from Lucene's standpoint. >>> >>> Mike >>> >>> On Fri, Nov 27, 2009 at 9:23 AM, Thomas Mueller >>> wrote: Hi, I'm wondering if your are interested in automatically releasing the write lock. See also my comments on https://issues.apache.org/jira/browse/LUCENE-1877 - I thought it's a problem worth solving, because it's also in the Lucene FAQ list at http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_purpose_of_write.lock_file.2C_when_is_it_used.2C_and_by_which_classes.3F Unfortunately there seems to be no solution that 'always works', but delegating the task and responsibility to the application / to the user is problematic as well. For example, a user of the H2 database (that supports Lucene fulltext indexing) suggested to automatically remove the write.lock file whenever the file is there: http://code.google.com/p/h2database/issues/detail?id=141 - sounds a bit dangerous in my view. So, if you are interested to solve the problem, then maybe I can help. If not, then I will not bother you any longer :-) Regards, Thomas > > > shouldn't active code like that live in the application layer? > > Why? > You can all but guarantee that polling will work at the app layer The application layer may also run with low priority. In operating systems, it's usually the lower layer that have more 'rights' (priority), and not the higher levels (I'm not saying it should be like that in Java). I just think the application layer should not have to deal with write locks or removing write locks. > by the time the original process realizes that it doesn't hold the lock > anymore, the damage could already have been done. Yes, I'm not sure how to best avoid that (with any design). Asking the application layer or the user whether the lock file can be removed is probably more dangerous than trying the best in Lucene. Standby / hibernate: the question is, if the machine process is currently not running, does the process still hold the lock? I think no, because the machine might as well turned off. How to detect whether the machine is turned off versus in hibernate mode? I guess that's a problem for all mechanisms (socket / file lock / background thread). When a hibernated process wakes up again, he thinks he owns the lock. Even if the process checks before each write, it is unsafe: if (isStillLocked()) { write(); } The process could wake up after isStillLocked() but before write().
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783499#action_12783499 ] Robert Muir commented on LUCENE-1458: - bq. It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars yeah you are right I dont think it will be faster on average (i was just posing the question because i dont really know NRQ), and you will waste 4 bits by using the first bit at the minimum. i am just always trying to improve collation too, so that's why I am bugging you. I guess hopefully soon we have byte[] and can do it properly, and speed up both. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783496#action_12783496 ] Uwe Schindler commented on LUCENE-1458: --- bq. because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better? It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars. The 8th bit is just a marker (simply spoken). If this marker is always 0 or always 1 does not make a difference, in UTF-8 only 7 bits/byte are used for data. And with UTF-8 in the 3rd byte more bits are unused! bq. I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!? I do not know who made this IndexableBinaryStrings encoding, but it would not work for NRQ at all with current trunk (too complicated during indexing and decoding, because for NRQ, we also need to decode such char[] very fast for populating the FieldCache). But as discussed with Yonik (do not know the issue), the ASCII only encoding should always perform better (but needs more memory in trunk, as char[] is used during indexing -- I think because of that it was added). So the difference is not speed, its memory consumption. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) t
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783494#action_12783494 ] Michael McCandless commented on LUCENE-1458: bq. The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. - the new AttributeSource API was created just because of such customizations (not possible with Token). This sounds like an interesting approach! We'd have to work out some details... eg you presumably can't mix char[] term and byte[] term in the same field. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- T
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783493#action_12783493 ] Robert Muir commented on LUCENE-1458: - bq. Why should they compare faster when encoded by IndexableBinaryStringTools? because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better? I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!? > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783492#action_12783492 ] Michael McCandless commented on LUCENE-1458: bq. I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087). We should change this here, too. Mark has been periodically re-syncing changes down from trunk... we should probably just let this change come in through his process (else I think we cause more conflicts). bq. The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed Ahh excellent. Wanna commit that when you get a chance? bq. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. That'd be great! bq. With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change. Right, this is a change in analysis -> DocumentsWriter -- somehow we have to allow a Token to carry a byte[] and that is directly indexes as the opaque term. At search time NRQ is all byte[] already (unlike other queries, which are new String()'ing for every term on the enum). > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783491#action_12783491 ] Uwe Schindler commented on LUCENE-1458: --- bq. Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really. The new TermsEnum directly compares the byte[] arrays. Why should they compare faster when encoded by IndexableBinaryStringTools? Less bytes are faster to compare (it's one CPU instruction if optimized a very native x86/x64 loop). It may be faster if we need to decode to char[] but thats not the case (in flex branch). > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to ad
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783489#action_12783489 ] Robert Muir commented on LUCENE-1458: - Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really. yeah I agree that encoding directly to byte[] is the way to go though, this would be nice for collation too... > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783490#action_12783490 ] Uwe Schindler commented on LUCENE-1458: --- As the codec is per field, we could also add an Attribute to TokenStream that holds the codec (the default is Standard). The indexer just uses the codec for the field from the TokenStream. NTS would use a NumericCodec (just thinking...) - will go sleeping now. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783488#action_12783488 ] Uwe Schindler edited comment on LUCENE-1458 at 11/29/09 10:16 PM: -- bq. A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte" This would not change anything, only would make the format incompatible. With 7bits/char the currently UTF-8 coded index is the smallest possible one (even IndexableBinaryString would cost more bytes in the index, because if you would use 14 of the 16 bits/char, most chars would take 3 bytes in index because of UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String representation would take less space than currently. See the discussion with Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much faster). For the TokenStreams: The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. -- the new AttributeSource API was created just because of such customizations (not possible with Token). was (Author: thetaphi): bq. A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte" This would not change anything, only would make the format incompatible. With 7bits/char the currently UTF-8 coded index is the smallest possible one (even IndexableBinaryString would cost more bytes in the index, because if you would use 14 of the 16 bits/char, most chars would take 3 bytes in index because of UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String representation would take less space than currently. See the discussion with Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much faster). For the TokenStreams: The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files,
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783488#action_12783488 ] Uwe Schindler commented on LUCENE-1458: --- bq. A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte" This would not change anything, only would make the format incompatible. With 7bits/char the currently UTF-8 coded index is the smallest possible one (even IndexableBinaryString would cost more bytes in the index, because if you would use 14 of the 16 bits/char, most chars would take 3 bytes in index because of UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String representation would take less space than currently. See the discussion with Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much faster). For the TokenStreams: The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-c
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783485#action_12783485 ] Robert Muir commented on LUCENE-1458: - bq. With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change. Uwe, it looks like you can do this now (with the exception of tokenstreams). A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte" > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783482#action_12783482 ] Uwe Schindler commented on LUCENE-1458: --- Robert: I know, because of that I said it works with UTF-8/UTF-16 comparator. It would *not* work with a reverse comparator as Mike uses in the test. With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -
[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783481#action_12783481 ] Robert Muir edited comment on LUCENE-1458 at 11/29/09 9:33 PM: --- bq. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. but isn't this what it does already with the TermsEnum api? the TermRef itself is just byte[], and NRQ precomputes all the TermRef's it needs up front, there is no unicode conversion there. edit: btw Uwe, and the comparator is be essentially just comparing bytes, the 0xee/0xef "shifting" should never take place with NRQ because those bytes will never be in a numeric field... was (Author: rcmuir): bq. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. but isn't this what it does already with the TermsEnum api? the TermRef itself is just byte[], and NRQ precomputes all the TermRef's it needs up front, there is no unicode conversion there. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the b
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783481#action_12783481 ] Robert Muir commented on LUCENE-1458: - bq. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. but isn't this what it does already with the TermsEnum api? the TermRef itself is just byte[], and NRQ precomputes all the TermRef's it needs up front, there is no unicode conversion there. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783475#action_12783475 ] Uwe Schindler commented on LUCENE-1458: --- Hi Mike, I looked into your commit, looks good. You are right with your comment in NRQ, it will only work with UTF-8 or UTF-16. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. Two things: - The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed - I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087). We should change this here, too. This makes also the enum simplier (and it looks more like the Automaton one). The methods in trunk 3.0 setEnum() and endEnum() both throw now UOE. I will look into these two changes tomorrow and change the code. Uwe > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for
[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783471#action_12783471 ] Michael McCandless commented on LUCENE-1458: OK I finally worked out a solution for the UTF16 sort order problem (just committed). I added a TermRef.Comparator class, for comparing TermRefs, and I removed TermRef.compareTo, and fixed all low-level places in Lucene that rely on sort order of terms to use this new API instead. I changed the Terms/TermsEnum/TermsConsumer API, adding a getTermComparator(), ie, the codec now determines the sort order for terms in each field. For the core codecs (standard, pulsing, intblock) I default to UTF16 sort order, for back compat, but you could easily instantiate it yourself and use a different term sort. I changed TestExternalCodecs to test this new capability, by sorting 2 of its fields in reversed unicode code point order. While this means your codec is now completely free to define the term sort order per field, in general Lucene queries will not behave right if you do this, so it's obviously a very advanced use case. I also changed (yet again!) how DocumentsWriter encodes the terms bytes, to record the length (in bytes) of the term, up front, followed by the term bytes (vs the trailing 0xff that I had switched to). The length is a 1 or 2 byte vInt, ie if it's < 128 it's 1 byte, else 2 bytes. This approach means the TermRef.Collector doesn't have to deal with 0xff's (which was messy). I think this also means that, to the flex API, a term is actually opaque -- it's just a series of bytes. It need not be UTF8 bytes. However, all of analysis, and then how TermsHash builds up these byte[]s, and what queries do with these bytes, is clearly still very much Unicode/UTF8. But one could, in theory (I haven't tested this!) separately use the flex API to build up a segment whose terms are arbitrary byte[]'s, eg maybe you want to use 4 bytes to encode int values, and then interact with those terms at search time using the flex API. > Further steps towards flexible indexing > --- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Affects Versions: 2.9 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, > UnicodeTestCase.patch, UnicodeTestCase.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry
[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.
[ https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783442#action_12783442 ] Erick Erickson commented on LUCENE-2037: Darn it! I'll get the comments right sometime and not have to retype them after making an attachment Anyway, this patch allows us to use Junit4 constructs as well as Junit3 constructs. It includes a sibling class to LuceneTestCase called LuceneTestCaseJ4 that provides the functionality we used to get from LuceneTestCase. When creating Junit4-style tests, preferentially import from org.junit rather than from junit.framework. Junit-3.8.2.jar may (should?) be removed from the distro, all tests run just fine under Junit-4.7,jar, which is attached to this issue. I wrote a little script that compares the results of running the tests and we run exactly the same number of TestSuites and each runs exactly the same number of tests, so I'm pretty confident about this one. I may be wrong, but I'm not uncertain. Single data-points aren't worth much, but on my Macbook Pro, running under Junit4 took about a minute longer than Junit3 (about 23 1/2 minutes). Which could have been the result of my Time Machine running for all I know All the tests in test...search.function have been converted to use LuceneTestCaseJ4 as an exemplar. I've deprecated LuceneTestCase to prompt people. When you derive from LuceneTestCaseJ4, you *must* use the @Before, @After and @Test annotations to get the functionality you expect, as must *all* subclasses. So one gotcha people will surely run across is deriving form J4 and failing to put @Test Converting all the tests was my way of working through the derivation issues. I don't particularly see the value in doing a massive conversion just for the heck of it. Unless someone has a real urge. More along the lines of "I'm in this test anyway, lets upgrade it and add new ones". What about new tests? Should we encourage new patches to use Junit4 rather than Junit3? If so, how? I've noticed the convention of putting underscores in front of some tests to keep them from running. The Junit4 convention is the @Ignore annotation, which will cause the @Ignored tests to be reported (something like 1300 successful, 0 failures, 23 ignored), which is a nice way to keep these from getting lost in the shuffle. When this gets applied, I can put up the patch for LocalizedTestCase and we can give that a whirl > Allow Junit4 tests in our environment. > -- > > Key: LUCENE-2037 > URL: https://issues.apache.org/jira/browse/LUCENE-2037 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Affects Versions: 3.1 > Environment: Development >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Minor > Fix For: 3.1 > > Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate > Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should > have to be rewritten. We should start this for the 3.1 release so we can get > a clean 3.0 out smoothly. > It's probably worthwhile to convert a small set of tests as an exemplar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2037) Allow Junit4 tests in our environment.
[ https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-2037: --- Attachment: LUCENE-2037.patch See JIRA comments > Allow Junit4 tests in our environment. > -- > > Key: LUCENE-2037 > URL: https://issues.apache.org/jira/browse/LUCENE-2037 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Affects Versions: 3.1 > Environment: Development >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Minor > Fix For: 3.1 > > Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch > > Original Estimate: 8h > Remaining Estimate: 8h > > Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate > Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should > have to be rewritten. We should start this for the 3.1 release so we can get > a clean 3.0 out smoothly. > It's probably worthwhile to convert a small set of tests as an exemplar. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2096) Investigate parallelizing Ant junit tests
[ https://issues.apache.org/jira/browse/LUCENE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783436#action_12783436 ] Erick Erickson commented on LUCENE-2096: Parallelizing tests is proving trickier than I'd hoped. Part of the problem is my not-wonderful ant skills... But what I've found so far with trying to use ForEach is that stuff gets in the way. In particular, the tag in the test-macro body I'm pretty sure defeats any parallelizing attempts by ForEach. Taking it out isn't straightforward. In some of my experiments, I got tests to fire off in parallel, but then started running into wonky errors that were so strange now I can't remember them, but some having to do with what looked like file contention for some temporary test files. Googling around I think I remember posts by Jason Ruthgren trying to so something similar in SOLR (?). Jason: if I'm remembering right did you find any joy? Then we'd have to rework how success and failure are handled because there's contention for that file as well. Now I'm wondering if the "scary python script" gets us more bang for the buck. I wrote a Groovy script the probably is a near-cousin for experiments and I'm wondering what would happen if we wrote a special testcase-type target that did NOT depend upon compile-test or, really, much of anything else and counted on the user to make sure to build the system first before using whatever script wecame up with. We don't really lose functionality by recursively looking for Test*.java files because that's what's done internally in the build files anyway. So doing that outside or inside the ant files doesn't seem like a loss. I'm putting this in the JIRA issue to preserve it for posterity. Meanwhile, I'll appeal to Ant gurus if they want to try whacking the Ant build files, and see what the script notion brings... > Investigate parallelizing Ant junit tests > - > > Key: LUCENE-2096 > URL: https://issues.apache.org/jira/browse/LUCENE-2096 > Project: Lucene - Java > Issue Type: Improvement > Components: Build >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Minor > > Ant Contrib has a "ForEach" construct that may speed up running all of the > Junit tests by parallelizing them with a configurable number of threads. I > envision this in several stages. First, see if ForEach works for us with > hard-coded lists, distribute this for testing then make the changes "for > real". I intend to hard-code the list for the first pass, ordered by the time > they take. This won't do for check-in, but will give us a fast > proof-of-concept. > This approach will be most useful for multi-core machines. > In particular, we need to see whether the parallel tasks are isolated enough > from each other to prevent mutual interference. > All this assumes the fragmentary reference I found is still available... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783410#action_12783410 ] Robert Muir commented on LUCENE-2094: - bq. This is one thing I thought about too - I did not change it to keep the noise as low as possible in the patch but if we want to do it we can do in this patch too. well I think it will be noisy either way (updating all the analyzers, etc), but will make things a lot more consistent and easier to maintain... if you do this then StopFilter takes version so it can be modified / bugfixed in the future in other ways too, with less noise. I also think it will make it easier to write an analyzer. because even completely ignoring the unicode issue, with the current codebase: {code} streams.source = new StandardTokenizer(matchVersion, reader); streams.result = new StandardFilter(streams.source); streams.result = new LowerCaseFilter(matchVersion, streams.result); streams.result = new StopFilter(matchVersion, streams.result, stoptable); ... {code} reads a lot easier to me than {code} streams.source = new StandardTokenizer(matchVersion, reader); streams.result = new StandardFilter(streams.source); streams.result = new LowerCaseFilter(matchVersion, streams.result); streams.result = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion), streams.result, stoptable); ... {code} > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783409#action_12783409 ] Uwe Schindler commented on LUCENE-2094: --- +1 for pushing version downto StopFilter (it is there already, but hidden in this getDefault() method! Its presence was justified by Lucene 2.9/3.0 migration. Now it should just take a matchVersion and no more setters inside StopFilter. The noise is the same, as all analyzers using stopfilter then need the version arg / need to be changed anyhow. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783406#action_12783406 ] Simon Willnauer commented on LUCENE-2094: - bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity This is one thing I thought about too - I did not change it to keep the noise as low as possible in the patch but if we want to do it we can do in this patch too. The question if we want to drop bw. compat and simply update CharArraySet to Unicode 4.0 seems to be more important. But IMO if we push Version to StopFilter we can also make CharArraySet using Version though. thoughts? > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783402#action_12783402 ] Michael McCandless commented on LUCENE-2094: bq. I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity OK, I agree, let's also push Version down into StopFilter (to get posIncr setting). > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783399#action_12783399 ] Robert Muir commented on LUCENE-2094: - Uwe, yeah, that is what I was thinking. I guess I think an alternate ctor that allows explicit control of this with a boolean is ok, but I think if you want the "defaults" it should just be with Version. This really doesn't have a lot to do with Simon's patch but it becomes noticeable now. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ] Uwe Schindler edited comment on LUCENE-2094 at 11/29/09 12:56 PM: -- Mike didn't want to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use only matchVersion everywhere and eliminate the enablePosIncr setting at all. was (Author: thetaphi): Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396 ] Uwe Schindler commented on LUCENE-2094: --- Mike didn't wanted to add matchVersion to StopFilter at this time, but when we change this, we should remove this static method or deprecate it and not use it anymore in the code. Instead use the matchVersion everywhere. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783395#action_12783395 ] Robert Muir commented on LUCENE-2094: - Hi Simon, One thing I noticed is with this patch we get: {code} public StopFilter(Version matchVersion, boolean enablePositionIncrements, TokenStream input, Set stopWords, boolean ignoreCase) {code} I know this is really not related to what you are doing here, but I wonder if instead StopFilter should look like this: {code} public StopFilter(Version matchVersion, TokenStream input, Set stopWords, boolean ignoreCase) {code} and use matchVersion to determine enablePositionIncrements. I think its already wierd how to create a stopfilter, you have to pass version to a static method getEnablePositionIncrementsVersionDefault. I don't think the user should have to pass Version twice: {code} new StopFilter(Version.WHATEVER, StopFilter.getEnablePositionIncrementsVersionDefault(Version.WHATEVER), ...) {code} I guess i think this getEnablePositionIncrementsVersionDefault should be deprecated along with the ctors that take this boolean argument, and it should all be driven off a single Version argument for simplicity > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783394#action_12783394 ] Simon Willnauer commented on LUCENE-2094: - bq. If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking. no doubt! :) But if you do not want your terms to be lowercased but you do not care if "The" is at has an uppercase "T" you want this behaviour. Yet, either way we go we need the version somehow to preserve bw. compat. We should rather think about breaking bw. compat for this particular language (deseret) but we have no idea what happens with unicode in the future. Its tough. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783393#action_12783393 ] Uwe Schindler commented on LUCENE-2094: --- bq. Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set. If the LowerCaseFilter is applied before the stopwords, there is no need for doing irgnore-case-checking. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2062) Bulgarian Analyzer
[ https://issues.apache.org/jira/browse/LUCENE-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2062: --- Assignee: Robert Muir > Bulgarian Analyzer > -- > > Key: LUCENE-2062 > URL: https://issues.apache.org/jira/browse/LUCENE-2062 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2062.patch > > > someone asked about bulgarian analysis on solr-user today... > http://www.lucidimagination.com/search/document/e1e7a5636edb1db2/non_english_languages > I was surprised we did not have anything. > This analyzer implements the algorithm specified here, > http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf > In the measurements there, this improves MAP approx 34% -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
[ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783392#action_12783392 ] Simon Willnauer commented on LUCENE-2094: - bq. Why do you use Version.LUCENE_CURRENT for all predefined stop word sets (ok, they do not need a match version, because they are already lowercased). 1. the do not ignore case at all so the version will not affect those sets. 2. they are private and we have the full control over the sets. The are all lowercased (as you figured correctly) and none of them contains any supplementary character. 3. The are static and private so passing any usersupplied version is not feasible. bq. In my opinion the whole stuff is only needed for chararrayssets, which are not already lowercased. So is there any chararrayset in lucene with predefined stop-words, that is not lowercased)? Either way, if the set is lowercased or not the lowercaseing is also applied to the values checked against the set. > Prepare CharArraySet for Unicode 4.0 > > > Key: LUCENE-2094 > URL: https://issues.apache.org/jira/browse/LUCENE-2094 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, > 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1 >Reporter: Simon Willnauer > Fix For: 3.1 > > Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, > LUCENE-2094.txt > > > CharArraySet does lowercaseing if created with the correspondent flag. This > causes that String / char[] with uncode 4 chars which are in the set can not > be retrieved in "ignorecase" mode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2067) Czech Stemmer
[ https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783390#action_12783390 ] Robert Muir commented on LUCENE-2067: - bq. well at least I figured out there must be something wrong i appreciate the review... it is frustrating that you have to pay $ to view the paper right now. on the other hand we are lucky when researchers that are this open about their experiments... saves a lot of work. > Czech Stemmer > - > > Key: LUCENE-2067 > URL: https://issues.apache.org/jira/browse/LUCENE-2067 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, > LUCENE-2067.patch > > > Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech > stemmer in snowball. > This patch implements the light stemming algorithm described in: > http://portal.acm.org/citation.cfm?id=1598600 > In their measurements, it improves MAP 42% > The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back > compat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2067) Czech Stemmer
[ https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-2067. - Resolution: Fixed Committed revision 885216. > Czech Stemmer > - > > Key: LUCENE-2067 > URL: https://issues.apache.org/jira/browse/LUCENE-2067 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, > LUCENE-2067.patch > > > Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech > stemmer in snowball. > This patch implements the light stemming algorithm described in: > http://portal.acm.org/citation.cfm?id=1598600 > In their measurements, it improves MAP 42% > The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back > compat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2067) Czech Stemmer
[ https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783389#action_12783389 ] Simon Willnauer commented on LUCENE-2067: - bq. make the stem filter final, and add explicit test for the mobile e rewrite looks good to me! Go ahead and commit. bq. Sorry for the confusion (pointing you at a slightly different algorithm)... well at least I figured out there must be something wrong :) > Czech Stemmer > - > > Key: LUCENE-2067 > URL: https://issues.apache.org/jira/browse/LUCENE-2067 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers >Reporter: Robert Muir >Assignee: Robert Muir >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, > LUCENE-2067.patch > > > Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech > stemmer in snowball. > This patch implements the light stemming algorithm described in: > http://portal.acm.org/citation.cfm?id=1598600 > In their measurements, it improves MAP 42% > The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back > compat. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1844) Speed up junit tests
[ https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1844. Resolution: Fixed Fix Version/s: 3.1 Thanks Erick & Mark! Next step is to find some generic way to parallelize the tests... > Speed up junit tests > > > Key: LUCENE-1844 > URL: https://issues.apache.org/jira/browse/LUCENE-1844 > Project: Lucene - Java > Issue Type: Improvement >Reporter: Mark Miller >Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png, > LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch, > LUCENE-1844.patch > > > As Lucene grows, so does the number of JUnit tests. This is obviously a good > thing, but it comes with longer and longer test times. Now that we also run > back compat tests in a standard test run, this problem is essentially doubled. > There are some ways this may get better, including running parallel tests. > You will need the hardware to fully take advantage, but it should be a nice > gain. There is already an issue for this, and Junit 4.6, 4.7 have the > beginnings of something we might be able to count on soon. 4.6 was buggy, and > 4.7 still doesn't come with nice ant integration. Parallel tests will come > though. > Beyond parallel testing, I think we also need to concentrate on keeping our > tests lean. We don't want to sacrifice coverage or quality, but I'm sure > there is plenty of fat to skim. > I've started making a list of some of the longer tests - I think with some > work we can make our tests much faster - and then with parallelization, I > think we could see some really great gains. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Socket and file locks
This looks great! Maybe it makes most sense to create a wiki page (http://wiki.apache.org/lucene-java) for interesting LockFactory implementations/tradeoffs, and add this there? Mike On Sat, Nov 28, 2009 at 9:26 AM, Sanne Grinovero wrote: > Hello, > Together with the Infinispan Directory we developed such a > LockFactory; I'd me more than happy if you wanted to add some pointers > to it in the Lucene documention/readme. > This depends on Infinispan for multiple-machines communication > (JGroups, indirectly) but > it's not required to use an Infinispan Directory, you could combine it > with a Directory impl of choice. > This was tested with the LockVerifyServer mentioned by Michael > McCandless and also > with some other tests inspired from it (in-VM for lower delay > coordination and verify, while the LockFactory was forced to > use real network communication). > > While this is a technology preview and performance regarding the > Directory code is still unknown, I believe the LockFactory was the > most tested component. > > free to download and inspect (LGPL): > http://anonsvn.jboss.org/repos/infinispan/trunk/lucene-directory/ > > Regards, > Sanne > > 2009/11/27 Michael McCandless : >> I think a LockFactory for Lucene that implemented the ideas you & >> Marvin are discussing in LUCENE-1877, and/or the approach you >> implemented in the H2 DB, would be a useful addition to Lucene! >> >> For many apps, the simple LockFactory impls suffice, but for apps >> where multiple machines can become the writer, it gets hairy. Having >> an always correct Lock impl for these apps would be great. >> >> Note that Lucene has some basic tools (in oal.store) for asserting >> that a LockFactory is correct (see LockVerifyServer), so it's a useful >> way to test that things are working from Lucene's standpoint. >> >> Mike >> >> On Fri, Nov 27, 2009 at 9:23 AM, Thomas Mueller >> wrote: >>> Hi, >>> >>> I'm wondering if your are interested in automatically releasing the >>> write lock. See also my comments on >>> https://issues.apache.org/jira/browse/LUCENE-1877 - I thought it's a >>> problem worth solving, because it's also in the Lucene FAQ list at >>> http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_purpose_of_write.lock_file.2C_when_is_it_used.2C_and_by_which_classes.3F >>> >>> Unfortunately there seems to be no solution that 'always works', but >>> delegating the task and responsibility to the application / to the >>> user is problematic as well. For example, a user of the H2 database >>> (that supports Lucene fulltext indexing) suggested to automatically >>> remove the write.lock file whenever the file is there: >>> http://code.google.com/p/h2database/issues/detail?id=141 - sounds a >>> bit dangerous in my view. >>> >>> So, if you are interested to solve the problem, then maybe I can help. >>> If not, then I will not bother you any longer :-) >>> >>> Regards, >>> Thomas >>> >>> >>> > > shouldn't active code like that live in the application layer? > Why? You can all but guarantee that polling will work at the app layer >>> >>> The application layer may also run with low priority. In operating >>> systems, it's usually the lower layer that have more 'rights' >>> (priority), and not the higher levels (I'm not saying it should be >>> like that in Java). I just think the application layer should not have >>> to deal with write locks or removing write locks. >>> by the time the original process realizes that it doesn't hold the lock anymore, the damage could already have been done. >>> >>> Yes, I'm not sure how to best avoid that (with any design). Asking the >>> application layer or the user whether the lock file can be removed is >>> probably more dangerous than trying the best in Lucene. >>> >>> Standby / hibernate: the question is, if the machine process is >>> currently not running, does the process still hold the lock? I think >>> no, because the machine might as well turned off. How to detect >>> whether the machine is turned off versus in hibernate mode? I guess >>> that's a problem for all mechanisms (socket / file lock / background >>> thread). >>> >>> When a hibernated process wakes up again, he thinks he owns the lock. >>> Even if the process checks before each write, it is unsafe: >>> >>> if (isStillLocked()) { >>> write(); >>> } >>> >>> The process could wake up after isStillLocked() but before write(). >>> One protection is: The second process (the one that breaks the lock) >>> would need to work on a copy of the data instead of the original file >>> (it could delete / truncate the orginal file after creating a copy). >>> On Windows, renaming the file might work (not sure); on Linux you >>> probably need to copy the content to a new file. Like that, the awoken >>> process can only destroy inactive data. >>> >>> The question is: do we need to solve this problem? How big is the >>> risk? Instead of solving this problem completely, you could detect it >>> after the fact witho
[jira] Updated: (LUCENE-2097) In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space
[ https://issues.apache.org/jira/browse/LUCENE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2097: --- Attachment: LUCENE-2097.patch Attached patch with test case that shows the issue. Not yet sure what's the best way to fix it... probably we have to build the CFS before opening the reader we want to pool. > In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space > > > Key: LUCENE-2097 > URL: https://issues.apache.org/jira/browse/LUCENE-2097 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2097.patch > > > Spinoff of java-user thread titled "searching while optimize"... > If IndexWriter is in NRT mode (you've called getReader() at least > once), and CFS is enabled, then internally the writer pools readers. > However, after a merge completes, it opens the reader against het > non-CFS segment files, and pools that. It then builds the CFS file, > as well, thus tying up the storage for that segment twice. > Functionally the bug is harmless (it's only a disk space issue). > Also, when the segment is merged, the disk space is released again > (though the newly merged segment will also be double-tied-up). > Simple workaround is to use non-CFS mode, or, don't use getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2097) In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space
In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space Key: LUCENE-2097 URL: https://issues.apache.org/jira/browse/LUCENE-2097 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0, 2.9.1, 2.9 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Spinoff of java-user thread titled "searching while optimize"... If IndexWriter is in NRT mode (you've called getReader() at least once), and CFS is enabled, then internally the writer pools readers. However, after a merge completes, it opens the reader against het non-CFS segment files, and pools that. It then builds the CFS file, as well, thus tying up the storage for that segment twice. Functionally the bug is harmless (it's only a disk space issue). Also, when the segment is merged, the disk space is released again (though the newly merged segment will also be double-tied-up). Simple workaround is to use non-CFS mode, or, don't use getReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2061) Create benchmark & approach for testing Lucene's near real-time performance
[ https://issues.apache.org/jira/browse/LUCENE-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783376#action_12783376 ] Michael McCandless commented on LUCENE-2061: bq. Can you post the queries file you've used? I only used TermQuery "1", sorting by score. I'd generally like to focus on worst case query latency rather than QPS of "easy" queries. Maybe we should switch to harder queries (phrase, boolean). Though one thing I haven't yet focused on testing (which your work on LUCENE-1785 would improve) is queries that hit the FieldCache -- we should test that as well. {quote} I haven't seen the same results in regards to the OS managing small files, and I suspect that users in general will choose a variety of parameters (i.e. 1 max buffered doc) that makes writing to disk inherently slow. Logically the OS should work as a write cache, however in practice, it seems a variety of users have reported otherwise. Maybe 100 docs works, however that feels like a fairly narrow guideline for user's of NRT. {quote} Yeah we need to explore this (when OS doesn't do effective write-caching), in practice. {quote} The latest LUCENE-1313 is a step in a direction that doesn't change IW internals too much. {quote} I do like this simplification -- basically IW is internally managing how best to use RAM in NRT mode -- but I think we need to scrutinize (through benchmarking, here) whether this is really needed (ie, whether we can't simply rely on the OS to behave, with its IO cache). > Create benchmark & approach for testing Lucene's near real-time performance > --- > > Key: LUCENE-2061 > URL: https://issues.apache.org/jira/browse/LUCENE-2061 > Project: Lucene - Java > Issue Type: Task > Components: Index >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Attachments: LUCENE-2061.patch, LUCENE-2061.patch, LUCENE-2061.patch > > > With the improvements to contrib/benchmark in LUCENE-2050, it's now > possible to create compelling algs to test indexing & searching > throughput against a periodically reopened near-real-time reader from > the IndexWriter. > Coming out of the discussions in LUCENE-1526, I think to properly > characterize NRT, we should measure net search throughput as a > function of both reopen rate (ie how often you get a new NRT reader > from the writer) and indexing rate. We should also separately measure > pure adds vs updates (deletes + adds); the latter is much more work > for Lucene. > This can help apps make capacity decisions... and can help us test > performance of pending improvements for NRT (eg LUCENE-1313, > LUCENE-2047). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org