[jira] Commented: (LUCENE-826) Language detector
[ https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805027#action_12805027 ] Karl Wettin commented on LUCENE-826: Hi Ken, it's hard for me to compare. I'll rant a bit about my experience from language detection though. I still haven't found a one strategy that works good on any text: a user query, a sentence, a paragraph or a complete document. 1-5 grams using SVM or NB works pretty good for them all but you really need to train it with the same sort of data you want to classify. Even when training with a mix of text lengths it tend to perform a lot worse than if you had one classifier for each data type. And you still probably want to twiddle with the classifier knobs to make it work great with the data you are classifying and training with. In some cases I've used 1-10 grams and other times I've used 2-4 grams. Sometimes I've used SVM and other times I've used a simple desiction tree. To sum it up, to achieve good quality I've always had to build a classifier for that specific use case. Weka has a great test suite for figuring out what to use. Set it up, press play and return one week later to find out what to use. > Language detector > - > > Key: LUCENE-826 > URL: https://issues.apache.org/jira/browse/LUCENE-826 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Karl Wettin >Assignee: Karl Wettin > Attachments: ld.tar.gz, ld.tar.gz > > > A formula 1A token/ngram-based language detector. Requires a paragraph of > text to avoid false positive classifications. > Depends on contrib/analyzers/ngrams for tokenization, Weka for classification > (logistic support vector models) feature selection and normalization of token > freuencies. Optionally Wikipedia and NekoHTML for training data harvesting. > Initialized like this: > {code} > LanguageRoot root = new LanguageRoot(new > File("documentClassifier/language root")); > root.addBranch("uralic"); > root.addBranch("fino-ugric", "uralic"); > root.addBranch("ugric", "uralic"); > root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi"); > root.addBranch("proto-indo european"); > root.addBranch("germanic", "proto-indo european"); > root.addBranch("northern germanic", "germanic"); > root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark"); > root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge"); > root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige"); > root.addBranch("west germanic", "germanic"); > root.addLanguage("west germanic", "eng", "english", "en", "UK"); > root.mkdirs(); > LanguageClassifier classifier = new LanguageClassifier(root); > if (!new File(root.getDataPath(), "trainingData.arff").exists()) { > classifier.compileTrainingData(); // from wikipedia > } > classifier.buildClassifier(); > {code} > Training set build from Wikipedia is the pages describing the home country of > each registred language in the language to train. Above example pass this > test: > (testEquals is the same as assertEquals, just not required. Only one of them > fail, see comment.) > {code} > assertEquals("swe", classifier.classify(sweden_in_swedish).getISO()); > testEquals("swe", classifier.classify(norway_in_swedish).getISO()); > testEquals("swe", classifier.classify(denmark_in_swedish).getISO()); > testEquals("swe", classifier.classify(finland_in_swedish).getISO()); > testEquals("swe", classifier.classify(uk_in_swedish).getISO()); > testEquals("nor", classifier.classify(sweden_in_norwegian).getISO()); > assertEquals("nor", classifier.classify(norway_in_norwegian).getISO()); > testEquals("nor", classifier.classify(denmark_in_norwegian).getISO()); > testEquals("nor", classifier.classify(finland_in_norwegian).getISO()); > testEquals("nor", classifier.classify(uk_in_norwegian).getISO()); > testEquals("fin", classifier.classify(sweden_in_finnish).getISO()); > testEquals("fin", classifier.classify(norway_i
[jira] Commented: (LUCENE-626) Extended spell checker with phrase support and adaptive user session analysis.
[ https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805021#action_12805021 ] Karl Wettin commented on LUCENE-626: Hej Mikkel, the test case data set is on an HDD hidden away on an attic 600km away from me, but I've asked for someone in the vicinity to fetch it for me. Might take a little while. Sorry! However extremely cool that you're working with this old beast! I'm super busy as always but I promise to follow your progress in case there is something you wonder about. It's been a few years since I looked at the code though. > Extended spell checker with phrase support and adaptive user session analysis. > -- > > Key: LUCENE-626 > URL: https://issues.apache.org/jira/browse/LUCENE-626 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Karl Wettin >Priority: Minor > Attachments: LUCENE-626_20071023.txt > > > Extensive javadocs available in patch, but I also try to keep it compiled > here: > http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description > A semi-retarded reinforcement learning thingy backed by algorithmic second > level suggestion schemes that learns from and adapts to user behavior as > queries change, suggestions are accepted or declined, etc. > Except for detecting spelling errors it considers context, > composition/decomposition and a few other things. > heroes of light and magik -> heroes of might and magic > vinci da code -> da vinci code > java docs -> javadocs > blacksabbath -> black sabbath > Depends on LUCENE-550 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-2194) improve efficiency of snowballfilter
+1 7 jan 2010 kl. 19.50 skrev Robert Muir (JIRA): [ https://issues.apache.org/jira/browse/LUCENE-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797736 #action_12797736 ] Robert Muir commented on LUCENE-2194: - i tested this with some English, benchmark pkg, etc and at most it seems to only improve processing speed 10%. but I think its worth the trouble since its an easy improvement. i'll commit in a few days if no one objects. improve efficiency of snowballfilter Key: LUCENE-2194 URL: https://issues.apache.org/jira/browse/LUCENE-2194 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Reporter: Robert Muir Assignee: Robert Muir Priority: Minor Fix For: 3.1 Attachments: LUCENE-2194.patch snowball stemming currently creates 2 new strings and 1 new stringbuilder for every word. all of this is unnecessary, so don't do it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1515) Improved(?) Swedish snowball stemmer
[ https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795968#action_12795968 ] Karl Wettin commented on LUCENE-1515: - I just posted this to the Snowball users list: The Swedish Snowball stemmer does a terrible job according to <http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf>. It even claims that lfs5, i.e. substring(0,5), does a better job. (It also says that 5-grams cracks the nut.) This didn't come as surprise to me as I've identified problems in the past and implemented my own augmentation that's been posted to this list before, now living at <http://issues.apache.org/jira/browse/LUCENE-1515>. Reading the paper made me take a closer look at what's wrong. define main_suffix as ( setlimit tomark p1 for ([substring]) among( 'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne' 'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter' 'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens' 'hetens' 'erns' 'at' 'andet' 'het' 'ast' 'era' 'erar' 'erarna' 'erarnas' // augmentation starts here 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas' 'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades' 'ikation' 'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens' // augmentation ends here (delete) 's' (s_ending delete) In conjunction with ~200 exception rules these additions help. There are however quite a bit of problems with many of the old rules. E.g. 's' (s_ending delete) is a pluralis rule but have ~5300 exceptions where words ends with s is nominative case singularis. The problem is when written in other form than nominative case. kurs (course) kursen (the course) kursens (the [undefined noun] of the course) kurser (courses) kurserna (the courses) kursernas (the [undefined noun] of the courses) Kurs is stemmed to "kur" (which by the way will missmatch with kur as in remedy) while all the others are correctly stemmed as "kurs". All together there are, according to my estimation, some 10 000 words that will create incompatible stems between nominative case singularis and any other form. That is about 8% of the official language. One rather simple solution is to always use both unstemmed and stemmed words, e.g. as synonyms in an inverted index. But if only using the stemmed output (from the official stemmer or my augmentation) I'd argue it's better to skip stemming all together. A better solution would be to set up the stemmer to ignore the 10 000 exceptions. What would be the best way to implement this? I'd like the generated Java code to simply contain a HashSet noStemExceptions; that was checked first, or something like that. > Improved(?) Swedish snowball stemmer > > > Key: LUCENE-1515 > URL: https://issues.apache.org/jira/browse/LUCENE-1515 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Karl Wettin > Attachments: LUCENE-1515.txt > > > Snowball stemmer for Swedish lacks support for '-an' and '-ans' related > suffix stripping, ending up with non compatible stems for example "klocka", > "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix > stripping rules: > {pre} > 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' > 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' > 'ansernas' > 'iera' > (delete) > {pre} > Th
[jira] Commented: (LUCENE-1515) Improved(?) Swedish snowball stemmer
[ https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795967#action_12795967 ] Karl Wettin commented on LUCENE-1515: - I've added a few more rules. I'll have to add a few more tests etc before I post a new patch. {code} define main_suffix as ( setlimit tomark p1 for ([substring]) among( 'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne' 'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter' 'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens' 'hetens' 'erns' 'at' 'andet' 'het' 'ast' 'era' 'erar' 'erarna' 'erarnas' // augmentation starts here 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas' 'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades' 'ikation' 'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens' // augmentation ends here (delete) 's' (s_ending delete) {code} > Improved(?) Swedish snowball stemmer > > > Key: LUCENE-1515 > URL: https://issues.apache.org/jira/browse/LUCENE-1515 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 >Reporter: Karl Wettin > Attachments: LUCENE-1515.txt > > > Snowball stemmer for Swedish lacks support for '-an' and '-ans' related > suffix stripping, ending up with non compatible stems for example "klocka", > "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix > stripping rules: > {pre} > 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' > 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' > 'ansernas' > 'iera' > (delete) > {pre} > The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and > this is an attempt at solving that problem. The rules and exceptions are > based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] > entries suffixed with 'an' and 'ans'. There a few known problematic stemming > rules but seems to work quite a bit better than the current SwedishStemmer. > It would not be a bad idea to check all of SAOL entries in order to make sure > the integrity of the rules. > My Snowball syntax skills are rather limited so I'm certain the code could be > optimized quite a bit. > *The code is released under BSD and not ASL*. I've been posting a bit in the > Snowball forum and privatly to Martin Porter himself but never got any > response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: LUCENE-1515
1 jan 2010 kl. 14.28 skrev Grant Ingersoll: Please, no Swedish2 or any variant like that. How about something that let's users know what it is and why they should use it? In my view Swedish2 is a better name than MoreSupportForGenitiveCaseSufficesThanSwedishStemmer. Such a name can turn out pretty far fetched if someone adds more rules to it in the future. Perhaps AugmentedSwedishStemmer? karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: LUCENE-1515
I'm actually not sure I understand the question. Afaik backwards compatibillity with the current SwedishStemmer could only be acheived by stemming using both classes and make diffing output synonyms. I just did a bit of testing and the problems I've identified in 1515 is also available in SwedishStemmer. Not that surprising as 1515 is an augmentation of SwedishStemmer... Personally I would not mind deprecating SwedishStemmer (renaming it to OldSwedish or something) and later on replace it with 1515, but that might mess with some people that don't read the README and just upgrade the jar while running on the same old index. 31 dec 2009 kl. 21.55 skrev Simon Willnauer: Is there any chance to get the best of both worlds? Could we merge both together and preserve bw compat with version? Introducing another stemmer doing almost the same thing as an already existing one does is exactly what we try to prevent right now. I don't doubt that this issue is an improvement just thinking of a way to keep code duplication as low as possible. I haven't looked at the code yet so if my question are completely nonsense let me know. simon On Thu, Dec 31, 2009 at 6:05 PM, Karl Wettin wrote: 31 dec 2009 kl. 17.43 skrev Simon Willnauer: what is the essential difference between the existing and LUCENE-1515 stemmer? 1515 handles genitive case suffices better. An example: klocka (a clock) klockan (the clock) klockans (the [insert noun] of the clock) klockornas (the [insert noun] of the clocks) Using snowball SwedishStemmer: klocka -> klock klockan -> klock klockans -> klockans klockornas -> klockornas karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: LUCENE-1515
31 dec 2009 kl. 17.43 skrev Simon Willnauer: what is the essential difference between the existing and LUCENE-1515 stemmer? 1515 handles genitive case suffices better. An example: klocka (a clock) klockan (the clock) klockans (the [insert noun] of the clock) klockornas (the [insert noun] of the clocks) Using snowball SwedishStemmer: klocka -> klock klockan -> klock klockans -> klockans klockornas -> klockornas karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
LUCENE-1515
1515 is an alternative Swedish stemmer that handles a couple of things unsupported by the original stemmer. A few things is handled worse, but all together I think it's a better algorithm. I've used it in two commercial applications. I'd like to commit it. Even though I've done my best to make them notice it, the snowball community never commented on it. Perhaps I should attempt once again before pushing it to Lucene. The code is, as the rest of the snowball contrib package, BSD. That shouldn'y cause any problems, right? What should I call this stemmer? Swedish2? SwedishToo? Svenska? :) http://issues.apache.org/jira/browse/LUCENE-1515 karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789248#action_12789248 ] Karl Wettin commented on LUCENE-2144: - I don't have any strong feelings about this line of code, but let me at least explain it. I like the idea that IIFoo behaves the same way a SegementFoo, even during incorrect/undocumented use of the API. There are no real use cases for this in the Lucene distribution, there are however effects people might use even though caused by invalid use of the API and not recommened. E.g. a skipTo to a target greater than the greatest document associated with that term will position the enum at the greatest document number for that term. Even though I wouldn't do something like this others might. In this case, where an immediate #next() on IR#termDocs() is called, it's might look silly to compare the behaviour of II and Segment as it's such blatantly erroneous use of the API, but even I have been known to come up with some rather strange solution now and then when nobody else is looking. One alternative is that #next would produce an InvalidStateException or something instead of just accepting the call, but then there is of course the small extra cost associated with checking if the enum has been seeked yet and #next is a rather commonly used method. > InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) > - > > Key: LUCENE-2144 > URL: https://issues.apache.org/jira/browse/LUCENE-2144 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Karl Wettin >Assignee: Michael McCandless >Priority: Critical > Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt > > > This patch contains core changes so someone else needs to commit it. > Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, > FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. > AllTermDocs now has a superclass, AbstractAllTermDocs that also > InstantiatedAllTermDocs extend. > Also: > * II-tests made less plausable to pass on future incompatible changes to > TermDocs and TermEnum > * IITermDocs#skipTo and #next mimics the behaviour of document posisioning > from SegmentTermDocs#dito when returning false > * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789021#action_12789021 ] Karl Wettin commented on LUCENE-2144: - Committed change to trunk. In 3.0 comment out ~line 227 in TestIndicesEquals // this is invalid use of the API, // but if the response differs then it's an indication that something might have changed. // in 2.9 and 3.0 the two TermDocs-implementations returned different values at this point. assertEquals("Descripency during invalid use of the TermDocs API, see comments in test code for details.", aprioriTermDocs.next(), testTermDocs.next()); > InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) > - > > Key: LUCENE-2144 > URL: https://issues.apache.org/jira/browse/LUCENE-2144 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Karl Wettin >Assignee: Michael McCandless >Priority: Critical > Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt > > > This patch contains core changes so someone else needs to commit it. > Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, > FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. > AllTermDocs now has a superclass, AbstractAllTermDocs that also > InstantiatedAllTermDocs extend. > Also: > * II-tests made less plausable to pass on future incompatible changes to > TermDocs and TermEnum > * IITermDocs#skipTo and #next mimics the behaviour of document posisioning > from SegmentTermDocs#dito when returning false > * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788966#action_12788966 ] Karl Wettin commented on LUCENE-2144: - bq. at org.apache.lucene.store.instantiated.TestIndicesEquals.testTermDocsSomeMore(TestIndicesEquals.java:226) I have no idea. How do I merge back locally so I can debug it? > InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) > - > > Key: LUCENE-2144 > URL: https://issues.apache.org/jira/browse/LUCENE-2144 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Karl Wettin >Assignee: Michael McCandless >Priority: Critical > Attachments: LUCENE-2144.txt > > > This patch contains core changes so someone else needs to commit it. > Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, > FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. > AllTermDocs now has a superclass, AbstractAllTermDocs that also > InstantiatedAllTermDocs extend. > Also: > * II-tests made less plausable to pass on future incompatible changes to > TermDocs and TermEnum > * IITermDocs#skipTo and #next mimics the behaviour of document posisioning > from SegmentTermDocs#dito when returning false > * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788950#action_12788950 ] Karl Wettin commented on LUCENE-2144: - bq. We should fix this on at least 3.0 as well right? Would be great if you had the bandwidth to fix that. > InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) > - > > Key: LUCENE-2144 > URL: https://issues.apache.org/jira/browse/LUCENE-2144 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Karl Wettin >Assignee: Michael McCandless >Priority: Critical > Attachments: LUCENE-2144.txt > > > This patch contains core changes so someone else needs to commit it. > Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, > FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. > AllTermDocs now has a superclass, AbstractAllTermDocs that also > InstantiatedAllTermDocs extend. > Also: > * II-tests made less plausable to pass on future incompatible changes to > TermDocs and TermEnum > * IITermDocs#skipTo and #next mimics the behaviour of document posisioning > from SegmentTermDocs#dito when returning false > * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
[ https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-2144: Attachment: LUCENE-2144.txt BUILD SUCCESSFUL Total time: 36 minutes 4 seconds > InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) > - > > Key: LUCENE-2144 > URL: https://issues.apache.org/jira/browse/LUCENE-2144 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.9, 2.9.1, 3.0 > Reporter: Karl Wettin >Priority: Critical > Attachments: LUCENE-2144.txt > > > This patch contains core changes so someone else needs to commit it. > Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, > FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. > AllTermDocs now has a superclass, AbstractAllTermDocs that also > InstantiatedAllTermDocs extend. > Also: > * II-tests made less plausable to pass on future incompatible changes to > TermDocs and TermEnum > * IITermDocs#skipTo and #next mimics the behaviour of document posisioning > from SegmentTermDocs#dito when returning false > * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs) - Key: LUCENE-2144 URL: https://issues.apache.org/jira/browse/LUCENE-2144 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 3.0, 2.9.1, 2.9 Reporter: Karl Wettin Priority: Critical This patch contains core changes so someone else needs to commit it. Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9. AllTermDocs now has a superclass, AbstractAllTermDocs that also InstantiatedAllTermDocs extend. Also: * II-tests made less plausable to pass on future incompatible changes to TermDocs and TermEnum * IITermDocs#skipTo and #next mimics the behaviour of document posisioning from SegmentTermDocs#dito when returning false * II now uses BitVector rather than sets for deleted documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-774) TopDocs and TopFieldDocs does not implement equals and hashCode
[ https://issues.apache.org/jira/browse/LUCENE-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-774. -- Resolution: Won't Fix > TopDocs and TopFieldDocs does not implement equals and hashCode > --- > > Key: LUCENE-774 > URL: https://issues.apache.org/jira/browse/LUCENE-774 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: 2.0.0 > Reporter: Karl Wettin >Priority: Trivial > Attachments: extendsObject.diff > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated
[ https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774252#action_12774252 ] Karl Wettin commented on LUCENE-1370: - Oups, I seem to have assigned this to me and then forgotten about it. Sorry! I'll check it out this weekend! > Patch to make ShingleFilter output a unigram if no ngrams can be generated > -- > > Key: LUCENE-1370 > URL: https://issues.apache.org/jira/browse/LUCENE-1370 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Reporter: Chris Harris >Assignee: Karl Wettin > Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, > LUCENE-1370.patch, ShingleFilter.patch > > > Currently if ShingleFilter.outputUnigrams==false and the underlying token > stream is only one token long, then ShingleFilter.next() won't return any > tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this > option is set and the underlying stream is only one token long, then > ShingleFilter will return that token, regardless of the setting of > outputUnigrams. > My use case here is speeding up phrase queries. The technique is as follows: > First, doing index-time analysis using ShingleFilter (using > outputUnigrams==true), thereby expanding things as follows: > "please divide this sentence into shingles" -> > "please", "please divide" > "divide", "divide this" > "this", "this sentence" > "sentence", "sentence into" > "into", "into shingles" > "shingles" > Second, do query-time analysis using ShingleFilter (using > outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters > a phrase query, it will get tokenized in the following manner: > "please divide this sentence into shingles" -> > "please divide" > "divide this" > "this sentence" > "sentence into" > "into shingles" > By doing phrase queries with bigrams like this, I can gain a very > considerable speedup. Without the outputUnigramIfNoNgrams option, then a > single word query would tokenize like this: > "please" -> >[no tokens] > But thanks to outputUnigramIfNoNgrams, single words will now tokenize like > this: > "please" -> > "please" > > The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests. > > I'm not sure if the patch in this state is useful to anyone else, but I > thought I should throw it up here and try to find out. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [VOTE] Release Apache Lucene Java 2.9.1, take 3
+1 30 okt 2009 kl. 00.27 skrev Michael McCandless: OK, let's try this again! I've built new release artifacts from svn rev 831145 (on the 2.9 branch), here: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/ Changes are here: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1changes/ Please vote to officially release these artifacts as Apache Lucene Java 2.9.1. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [Lucene-java Wiki] Update of "LuceneAtApacheConUs2009" by HossMan
20 okt 2009 kl. 07.15 skrev Apache Wiki: + There will be a Lucene/Search !MeetUp on Tuesday night at 8PM. 'This event is open to anyone who wants to come, even if you are not registered for the conference'. That is a really nice thing, and completely new if I'm not misstaken. Perhaps even worth advertise as news on the front page. karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?
[ https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1958. --- Resolution: Won't Fix Not a problem in 2.9 > ShingleFilter creates shingles across two consecutives documents : bug or > normal behaviour ? > > > Key: LUCENE-1958 > URL: https://issues.apache.org/jira/browse/LUCENE-1958 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.4.1 > Environment: Windows XP / jdk1.6.0_15 >Reporter: MRIT64 >Priority: Minor > > HI > I add two consecutive documents that are indexed with some filters. The last > one is ShingleFilter. > ShingleFilter creates a shingle spannnig the two documents, which has no > sense in my context. > Is that a bug oris it ShingleFilter normal behaviour ? If it's normal > behaviour, is it possible to change it optionnaly ? > Thanks > MR -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header
[ https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1947. --- Resolution: Fixed Committed in revision 823445 > Snowball package contains BSD licensed code with ASL header > --- > > Key: LUCENE-1947 > URL: https://issues.apache.org/jira/browse/LUCENE-1947 > Project: Lucene - Java > Issue Type: Task > Components: contrib/analyzers >Affects Versions: 2.9 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: LUCENE-1947.patch, LUCENE-1947.patch > > > All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) > has for some reason been given an ASL header. These classes are licensed with > BSD. Thus the ASL header should be removed. I suppose this a misstake or > possible due to the ASL header automation tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Using payload during indexes with Lucene 2.9.0
Hi Mauro, this is the -dev list where we discuss the development of the API. Questions about how to use the API should be sent to the -users list. Please try use the -users list for future questions on how to use the API or if responding to this mail. In answer to your question, the classes you are looking for are located in the contrib/analyzers package. http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/common/src/java/org/apache/lucene/analysis/payloads/ http://repo2.maven.org/maven2/org/apache/lucene/lucene-analyzers/2.9.0/ karl 8 okt 2009 kl. 22.45 skrev Mauro Dragoni: Hi to everyone, I'm new in this mailing list... :) Some days ago I downloaded the new versione of Lucene, but I didn't find the classes that I used to index terms with payload (PayloadEncoder, DelimitedPayloadTokenFilter, etc.) So, I would ask you where may I find an example to use payload with the new lucene version. Thanks in advance to everyone. Mauro. -- Dott. Mauro Dragoni Ph.D. Università di Milano, Italy My Business Site: http://www.dragotechpro.com My Research Site: http://www.genalgo.com Confidentially Notice. This electronic mail transmission may contain legally privileged and/or confidential information. Do not read this, if you are not the person named to. Any use, distribution, copying or disclosure by any other person is strictly prohibited. If you received this transmission in error, please notify the sender and delete the original transmission and its attachments without reading or saving it in any manner. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Output from a small Snowball benchmark
There have been a few small comments in the Jira about the reflection in Snowball's Among class. There is very little to do about this unless one want to redesign the stemmers so they include an inner class that handle the method callbacks. That's quite a bit of work and I don't even know how much CPU one would save by doing this. So I was thinking maybe it would save a some resources if one reused the stemmers instead of reinstantiating them, which I presume everybody does. I thought it would make most sense to simulate query time stemming so my benchmark contained 4 words where 2 of them are plural. Each test ran 1 000 000 times. The amount of CPU time used is bearly noticeable relative to what other things cost: 0.0109ms/iteration when reinstantiating, 0.0067ms/iteration when reusing. The heap consuption was however rather different. At the end of reinstantiation it had consumed about 10x more than when reusing. ~20MB vs. ~2MB. I realize people don't usally run 1 000 000 queries in so short time, but at least this is an indication that one could save some GC time here. Many a mickle makes a muckle... So I was thinking that perhaps it would make sense with something like a singleton concurrent queue in the SnowballFilter and a new constructor that takes the snowball program implementation class as an argument. But this might also be way premature optimization. karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header
[ https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1947: Attachment: LUCENE-1947.patch * Added Snowball license header to static Snowball classes (SnowballProgram, Among and TestApp) * Refactored StringBuffer to StringBuilder in all classes * Added notes about above in README and package overview. > Snowball package contains BSD licensed code with ASL header > --- > > Key: LUCENE-1947 > URL: https://issues.apache.org/jira/browse/LUCENE-1947 > Project: Lucene - Java > Issue Type: Task > Components: contrib/analyzers >Affects Versions: 2.9 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: LUCENE-1947.patch, LUCENE-1947.patch > > > All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) > has for some reason been given an ASL header. These classes are licensed with > BSD. Thus the ASL header should be removed. I suppose this a misstake or > possible due to the ASL header automation tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1948) Deprecating InstantiatedIndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1948: Attachment: LUCENE-1948.patch > Deprecating InstantiatedIndexWriter > --- > > Key: LUCENE-1948 > URL: https://issues.apache.org/jira/browse/LUCENE-1948 > Project: Lucene - Java > Issue Type: Task > Components: contrib/* >Affects Versions: 2.9 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: LUCENE-1948.patch > > > http://markmail.org/message/j6ip266fpzuaibf7 > I suppose that should have been suggested before 2.9 rather than > after... > There are at least three reasons to why I want to do this: > The code is based on the behaviour or the Directory IndexWriter as of > 2.3 and I have not been touching it since then. If there will be > changes in the future one will have to keep IIW in sync, something > that's easy to forget. > There is no locking which will cause concurrent modification > exceptions when accessing the index via searcher/reader while > committing. > It use the old token stream API so it has to be upgraded in case it > should stay. > The java- and package level docs have since it was committed been > suggesting that one should consider using II as if it was immutable > due to the locklessness. My suggestion is that we make it immutable > for real. > Since II is ment for small corpora there is very little time lost by > using the constructor that builts the index from an IndexReader. I.e. > rather than using InstantiatedIndexWriter one would have to use a > Directory and an IndexWriter and then pass an IndexReader to a new > InstantiatedIndex. > Any objections? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1948) Deprecating InstantiatedIndexWriter
Deprecating InstantiatedIndexWriter --- Key: LUCENE-1948 URL: https://issues.apache.org/jira/browse/LUCENE-1948 Project: Lucene - Java Issue Type: Task Components: contrib/* Affects Versions: 2.9 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 3.0 http://markmail.org/message/j6ip266fpzuaibf7 I suppose that should have been suggested before 2.9 rather than after... There are at least three reasons to why I want to do this: The code is based on the behaviour or the Directory IndexWriter as of 2.3 and I have not been touching it since then. If there will be changes in the future one will have to keep IIW in sync, something that's easy to forget. There is no locking which will cause concurrent modification exceptions when accessing the index via searcher/reader while committing. It use the old token stream API so it has to be upgraded in case it should stay. The java- and package level docs have since it was committed been suggesting that one should consider using II as if it was immutable due to the locklessness. My suggestion is that we make it immutable for real. Since II is ment for small corpora there is very little time lost by using the constructor that builts the index from an IndexReader. I.e. rather than using InstantiatedIndexWriter one would have to use a Directory and an IndexWriter and then pass an IndexReader to a new InstantiatedIndex. Any objections? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header
[ https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1947: Attachment: LUCENE-1947.patch > Snowball package contains BSD licensed code with ASL header > --- > > Key: LUCENE-1947 > URL: https://issues.apache.org/jira/browse/LUCENE-1947 > Project: Lucene - Java > Issue Type: Task > Components: contrib/analyzers >Affects Versions: 2.9 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: LUCENE-1947.patch > > > All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) > has for some reason been given an ASL header. These classes are licensed with > BSD. Thus the ASL header should be removed. I suppose this a misstake or > possible due to the ASL header automation tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header
Snowball package contains BSD licensed code with ASL header --- Key: LUCENE-1947 URL: https://issues.apache.org/jira/browse/LUCENE-1947 Project: Lucene - Java Issue Type: Task Components: contrib/analyzers Affects Versions: 2.9 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 3.0 All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) has for some reason been given an ASL header. These classes are licensed with BSD. Thus the ASL header should be removed. I suppose this a misstake or possible due to the ASL header automation tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1939. --- Resolution: Fixed Fix Version/s: 3.0 Committed in 821888. Thanks Patrick! (I'll consider the other stuff mentioned in the issue later this week, and if managable then as a new issue.) > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Fix For: 3.0 > > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762131#action_12762131 ] Karl Wettin commented on LUCENE-1257: - bq. err... looks like perhaps its only hit once though and then reused.. maybe not so nasty. My first time looking at this code, so I'm sure you can clear it up ... Mark, are you referring to the reflection in Among? Those are pretty tough to get rid of. I think we should replace the StringBuffers in the stemmers if nobody else minds. But I think we should do that in another issue. I also found a bit of ASL headers in some of the classes. Suppose they have been added automatically at some point. These classes are all BSD. > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: instantiated_fieldable.patch, java5.patch, > LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257_messages.patch, lucene1257surround1.patch, > lucene1257surround1.patch, shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Deprecating InstantiatedIndexWriter
I suppose that should have been suggested before 2.9 rather than after... There are at least three reasons to why I want to do this: The code is based on the behaviour or the Directory IndexWriter as of 2.3 and I have not been touching it since then. If there will be changes in the future one will have to keep IIW in sync, something that's easy to forget. There is no locking which will cause concurrent modification exceptions when accessing the index via searcher/reader while committing. It use the old token stream API so it has to be upgraded in case it should stay. The java- and package level docs have since it was committed been suggesting that one should consider using II as if it was immutable due to the locklessness. My suggestion is that we make it immutable for real. Since II is ment for small corpora there is very little time lost by using the constructor that builts the index from an IndexReader. I.e. rather than using InstantiatedIndexWriter one would have to use a Directory and an IndexWriter and then pass an IndexReader to a new InstantiatedIndex. Any objections? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761924#action_12761924 ] Karl Wettin commented on LUCENE-1939: - The exception is thrown when ts#next (incrementToken) is called again after already having returned null (false) once. So this is a nice catch! But this means that RemoveDuplicatesTokenFilter in Solr calls incrementToken one extra time for some reason. Can you please post the complete stacktrace so I can take a look in there too? I suppose the expected behaviour would be that a token stream keep returning false when incrementToken is called upon after returning false already, but the javadocs doesn't really say anything about this, nor is there a generic test case that ensure this for all filters. Thus this error might be available in other filters. I'll see if I can do something about that before committing. Thanks for the report Patrick! > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761877#action_12761877 ] Karl Wettin commented on LUCENE-1257: - bq. Fix for InstantiadexIndex compile error caused by code committed in revision 821277 Committed in rev 821315 > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: instantiated_fieldable.patch, java5.patch, > LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > lucene1257surround1.patch, lucene1257surround1.patch, > shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1257: Attachment: instantiated_fieldable.patch Fix for InstantiadexIndex compile error caused by code committed in revision 821277 List rather than List > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: instantiated_fieldable.patch, java5.patch, > LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > lucene1257surround1.patch, lucene1257surround1.patch, > shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761875#action_12761875 ] Karl Wettin commented on LUCENE-1257: - bq. how that? It asserted that a Document contained a List rather than List in ctor(IndexReader) , which I actually think is true at that point using that code. > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-Document.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, > lucene1257surround1.patch, shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761874#action_12761874 ] Karl Wettin commented on LUCENE-1257: - bq. Generified ShingleMatrixFilter Committed in rev 821311 > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-Document.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, > lucene1257surround1.patch, shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1257: Attachment: shinglematrixfilter_generified.patch Generified ShingleMatrixFilter > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-Document.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, > lucene1257surround1.patch, shinglematrixfilter_generified.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761870#action_12761870 ] Karl Wettin commented on LUCENE-1257: - bq. Generification of Document. It makes now clear what getFields() returns really. This was very bad documented. Now its a List. This broke InstantiatedIndex in the trunk. Patch and commit is on the way. > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-Document.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, > lucene1257surround1.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761868#action_12761868 ] Karl Wettin commented on LUCENE-1939: - Patrick, I can't manage to reproduce this error. Uwe is right though, you are getting this error using 2.4.1 or earlier, not by using 2.9. bq. at org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) Can you please try with 2.9? It would also be very helpful if you could list the applicable Solr configurations and some example data you are passing to the filter when it's thrown. Thanks in advance. > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761862#action_12761862 ] Karl Wettin commented on LUCENE-1257: - bq. Wait ... do you mean you got rid of some of the reflection or did we lose your changes? I'm seeing some nasty slow reflection in there still ... My changes was to the abstract Snowball stemmer class. I simply added an abstract method and got rid of the reflection in the Lucene filter. One could argue that we should update the Snowball compiler rather than updating the Java code it renders. But honestly I think we should just update the rendered code and then report any improvement found to the Snowball ml and keep track of it in the package readme. bq. err... looks like perhaps its only hit once though and then reused.. maybe not so nasty. My first time looking at this code, so I'm sure you can clear it up ... It could still be rather expensive per stem at query time. I vote for getting rid of it if we can. I'll throw an eye at it. > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-Document.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, > lucene1257surround1.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1257) Port to Java5
[ https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761755#action_12761755 ] Karl Wettin commented on LUCENE-1257: - bq. I vote to move to StringBuilder anyway if its in Contrib. Though probably not with Snowball, since we don't really write/maintain that code. Actually I patched the Snowball stemmer code to get ridth of the use of reflection. So what we use is an altered version of their code. I tried to get Dr Porter to commit those changes for years but it's still the same. Based on this I think we could just keep going with our own stuff in there as long we keep a record of what we have done in case we want to merge with their trunk. > Port to Java5 > - > > Key: LUCENE-1257 > URL: https://issues.apache.org/jira/browse/LUCENE-1257 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis, Examples, Index, Other, Query/Scoring, > QueryParser, Search, Store, Term Vectors >Affects Versions: 2.3.1 >Reporter: Cédric Champeau >Assignee: Uwe Schindler >Priority: Minor > Fix For: 3.0 > > Attachments: java5.patch, LUCENE-1257-StringBuffer.patch, > LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch > > > For my needs I've updated Lucene so that it uses Java 5 constructs. I know > Java 5 migration had been planned for 2.1 someday in the past, but don't know > when it is planned now. This patch against the trunk includes : > - most obvious generics usage (there are tons of usages of sets, ... Those > which are commonly used have been generified) > - PriorityQueue generification > - replacement of indexed for loops with for each constructs > - removal of unnececessary unboxing > The code is to my opinion much more readable with those features (you > actually *know* what is stored in collections reading the code, without the > need to lookup for field definitions everytime) and it simplifies many > algorithms. > Note that this patch also includes an interface for the Query class. This has > been done for my company's needs for building custom Query classes which add > some behaviour to the base Lucene queries. It prevents multiple unnnecessary > casts. I know this introduction is not wanted by the team, but it really > makes our developments easier to maintain. If you don't want to use this, > replace all /Queriable/ calls with standard /Query/. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761712#action_12761712 ] Karl Wettin commented on LUCENE-1939: - bq. I also think so, because the above stack dump seems to be from 2.4.1 (in 2.9 there should be incrementToken() instead of next() for all filters listed there). Ah, I missunderstood your comment. The thing is that ShingleMatrixFilter was left using the old API because of its complexity. I told whoever it was that gave it a shot that I'd look in to upgrading it, just haven't had time to do so yet. There will be a new generified and updated version of the filter any year now. At least before 3.0. > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761706#action_12761706 ] Karl Wettin commented on LUCENE-1939: - bq. Is this caused by the rewrite because of the new TokenStream API? Nah, I think it's just a miss in the code never cought before. Not sure though so I'll write a test or two this weekend. > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
[ https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin reassigned LUCENE-1939: --- Assignee: Karl Wettin > IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method > -- > > Key: LUCENE-1939 > URL: https://issues.apache.org/jira/browse/LUCENE-1939 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/analyzers >Affects Versions: 2.9 >Reporter: Patrick Jungermann >Assignee: Karl Wettin > Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch > > > I tried to use the ShingleMatrixFilter within Solr. To test the functionality > etc., I first used the built-in field analysis view.The filter was configured > to be used only at query time analysis with "_" as spacer character and a > min. and max. shingle size of 2. The generation of the shingles for query > strings with this filter seems to work at this view, but by turn on the > highlighting of indexed terms that will match the query terms, the exception > was thrown. Also, each time I tried to query the index the exception was > immediately thrown. > Stacktrace: > {code} > java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 > at java.util.ArrayList.RangeCheck(Unknown Source) > at java.util.ArrayList.get(Unknown Source) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729) > at > org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47) > ... > {code} > Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList > {{columns}} requested, but there isn't this entry within columns. > I created a patch that checks, if {{columns}} contains enough entries. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-625) Query auto completer
[ https://issues.apache.org/jira/browse/LUCENE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736923#action_12736923 ] Karl Wettin commented on LUCENE-625: bq. Karl, did you ever proceed on this patch? I'm interested in adding autosuggest to Solr. I used this patch for a few things a couple of years ago. If I recall everything right I ended up using the bootstrapped apriori corpus of LUCENE-626 as training data the last time. Made the corpus rather small, speedy and still relevant for most users. But the major caveat is that this patch is a trie and is thus a "precise forward only" thing. So that might not fit all use cases. It might be easier to get things going using an index with ngrams of untokenized user queries (i.e. including whitespace) or subject-like fields. But I really prefere user queries as using only the last n queries will make it sensitive to trends. That will however require quite a bit of data to work well. A lot as in hundreds of thousands of user queries, according to my experience. Not sure if this was an answer to your question.. : ) > Query auto completer > > > Key: LUCENE-625 > URL: https://issues.apache.org/jira/browse/LUCENE-625 > Project: Lucene - Java > Issue Type: New Feature > Components: Search >Reporter: Karl Wettin >Priority: Minor > Attachments: autocomplete_0.0.1.tar.gz, autocomplete_20060730.tar.gz > > > A trie that helps users to type in their query. Made for AJAX, works great > with ruby on rails common scripts <http://script.aculo.us/>. Similar to the > Google labs suggester. > Trained by user queries. Optimizable. Uses an in memory corpus. Serializable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722575#action_12722575 ] Karl Wettin commented on LUCENE-1260: - Hi Johan, didn't try it out yet but the patch looks nice and clean. +1 from me. Let's try to convince some of the old -1:ers. YONIK? See, it's not just me. ; ) I do however still think it's nice with the serializable codec interface as in the previous patches in order for all applications to use the index as intended (Luke and what not). 256 bytes stored to a file and by default backed by a binary search or so unless there is a registred codec that handles it algorithmic. I'll copy and paste that in as an alternative suggestion ASAP. (I think the next move should be to allow for per field variable norms resolution, but that is a whole new issue.) > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin > Attachments: Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, > LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin resolved LUCENE-1578. - Resolution: Fixed comitted > InstantiatedIndex supports non-optimized IndexReaders > - > > Key: LUCENE-1578 > URL: https://issues.apache.org/jira/browse/LUCENE-1578 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1578.txt > > Original Estimate: 72h > Remaining Estimate: 72h > > InstantiatedIndex does not currently support non-optimized IndexReaders. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r784481 - in /lucene/java/trunk/contrib: ./ instantiated/src/java/org/apache/lucene/store/instantiated/ instantiated/src/test/org/apache/lucene/store/instantiated/
oups, an error in the code. im on it. 13 jun 2009 kl. 23.54 skrev ka...@apache.org: Author: kalle Date: Sat Jun 13 21:54:07 2009 New Revision: 784481 URL: http://svn.apache.org/viewvc?rev=784481&view=rev Log: LUCENE-1578: Support for loading unoptimized readers to the constructor of InstantiatedIndex. (Karl Wettin) Added: lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ store/instantiated/TestUnoptimizedReaderOnConstructor.java Modified: lucene/java/trunk/contrib/CHANGES.txt lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/ store/instantiated/InstantiatedIndex.java lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ store/instantiated/TestIndicesEquals.java lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ store/instantiated/TestRealTime.java Modified: lucene/java/trunk/contrib/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/CHANGES.txt?rev=784481&r1=784480&r2=784481&view=diff = = = = = = = = == --- lucene/java/trunk/contrib/CHANGES.txt (original) +++ lucene/java/trunk/contrib/CHANGES.txt Sat Jun 13 21:54:07 2009 @@ -62,8 +62,11 @@ (Xiaoping Gao via Mike McCandless) -6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for automatically adding payloads "in-stream" (Grant Ingersoll) - + 6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for automatically adding payloads "in-stream" (Grant Ingersoll) + + 7. LUCENE-1578: Support for loading unoptimized readers to the + constructor of InstantiatedIndex. (Karl Wettin) + Optimizations 1. LUCENE-1643: Re-use the collation key (RawCollationKey) for Modified: lucene/java/trunk/contrib/instantiated/src/java/org/apache/ lucene/store/instantiated/InstantiatedIndex.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java?rev=784481&r1=784480&r2=784481&view=diff = = = = = = = = == --- lucene/java/trunk/contrib/instantiated/src/java/org/apache/ lucene/store/instantiated/InstantiatedIndex.java (original) +++ lucene/java/trunk/contrib/instantiated/src/java/org/apache/ lucene/store/instantiated/InstantiatedIndex.java Sat Jun 13 21:54:07 2009 @@ -110,7 +110,8 @@ public InstantiatedIndex(IndexReader sourceIndexReader, Set fields) throws IOException { if (!sourceIndexReader.isOptimized()) { - throw new IOException("Source index is not optimized."); + System.out.println(("Source index is not optimized.")); + //throw new IOException("Source index is not optimized."); } @@ -170,11 +171,14 @@ } -documentsByNumber = new InstantiatedDocument[sourceIndexReader.numDocs()]; +documentsByNumber = new InstantiatedDocument[sourceIndexReader.maxDoc()]; + // create documents -for (int i = 0; i < sourceIndexReader.numDocs(); i++) { - if (!sourceIndexReader.isDeleted(i)) { +for (int i = 0; i < sourceIndexReader.maxDoc(); i++) { + if (sourceIndexReader.isDeleted(i)) { +deletedDocuments.add(i); + } else { InstantiatedDocument document = new InstantiatedDocument(); // copy stored fields from source reader Document sourceDocument = sourceIndexReader.document(i); @@ -259,6 +263,9 @@ // load offsets to term-document informations for (InstantiatedDocument document : getDocumentsByNumber()) { + if (document == null) { +continue; // deleted + } for (Field field : (List) document.getDocument().getFields()) { if (field.isTermVectorStored() && field.isStoreOffsetWithTermVector()) { TermPositionVector termPositionVector = (TermPositionVector) sourceIndexReader.getTermFreqVector(document.getDocumentNumber(), field.name()); Modified: lucene/java/trunk/contrib/instantiated/src/test/org/apache/ lucene/store/instantiated/TestIndicesEquals.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java?rev=784481&r1=784480&r2=784481&view=diff = = = = = = = = == --- lucene/java/trunk/contrib/instantiated/src/test/org/apache/ lucene/store/instantiated/TestIndicesEquals.java (original) +++ lucene/java/trunk/contrib/instantiated/src/test/org/apache/ lucene/store/instantiated/TestIndicesEquals.java Sat Jun 13 21:54:07 2009 @@ -40,6 +40,10 @@ import org.apache.lucene.index.TermPositions; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.TermQuery; +import org.apac
[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712 ] Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM: - Although you have a valid point I'd like to argue this a bit. My arguments are probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams. BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on. Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch. Perhaps I should leave this issue alone and do some more work with LUCENE-1306 was (Author: karl.wettin): Although you have a valid point I'd like to argue this a bit. My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams. BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on. Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch. Perhaps I should leave this issue alone and do some more work with LUCENE-1306 > EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. > > > Key: LUCENE-1491 > URL: https://issues.apache.org/jira/browse/LUCENE-1491 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4, 2.4.1, 2.9, 3.0 >Reporter: Todd Feak >Assignee: Otis Gospodnetic > Fix For: 2.9 > > Attachments: LUCENE-1491.patch > > > If a token is encountered in the stream that is shorter in length than the > min gram size, the filter will stop processing the token stream. > Working up a unit test now, but may be a few days before I can provide it. > Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712 ] Karl Wettin commented on LUCENE-1491: - Although you have a valid point I'd like to argue this a bit. My arguments is probably considered silly by some. Perhaps it's just me that use ngrams for something completly different than what everybody else does, but here we go: Adding the feature as suggested by this patch is, according to me, to fix symptoms from bad use of character ngrams. BOL, EOL, whitespace and punctuation are all valid parts of character ngrams than can increase precision/recall quite a bit. EdgeNGrams could sort of be considered such data too. So what I'm saying here is that I consider your example a bad use of charachter ngrams, that the whole sentance should have been grammed up. So in the case of 4-grams the output would end up as: "to b", "o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so on. Supporting what I suggest will of course mean quite a bit of more work. A whole new filter that also does input text normalization such as removing double spaces and what not. That will probably not be implemented anytime soon. But adding the features in the patch to the filter actually means that this use is endorsed by the community and I'm not sure that's a good idea. I thus think it would be better with some sort of secondary filter that did the exact same thing as the patch. Perhaps I should leave this issue alone and do some more work with LUCENE-1306 > EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. > > > Key: LUCENE-1491 > URL: https://issues.apache.org/jira/browse/LUCENE-1491 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4, 2.4.1, 2.9, 3.0 >Reporter: Todd Feak >Assignee: Otis Gospodnetic > Fix For: 2.9 > > Attachments: LUCENE-1491.patch > > > If a token is encountered in the stream that is shorter in length than the > min gram size, the filter will stop processing the token stream. > Working up a unit test now, but may be a few days before I can provide it. > Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
[ https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715567#action_12715567 ] Karl Wettin commented on LUCENE-1491: - bq. Perhaps we need boolean keepSmaller somewhere, so we can explicitly control the behaviour? I'm not sure. Is there a use case for this or is it an XY-problem? > EdgeNGramTokenFilter stops on tokens smaller then minimum gram size. > > > Key: LUCENE-1491 > URL: https://issues.apache.org/jira/browse/LUCENE-1491 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: 2.4, 2.4.1, 2.9, 3.0 >Reporter: Todd Feak >Assignee: Otis Gospodnetic > Fix For: 2.9 > > Attachments: LUCENE-1491.patch > > > If a token is encountered in the stream that is shorter in length than the > min gram size, the filter will stop processing the token stream. > Working up a unit test now, but may be a few days before I can provide it. > Wanted to get it in the system. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: HitCollector#collect(int,float,Collection)
So, I've been sleeping on this for a few weeks. Would it be possible to solve this with a decorator? Perhaps a top level decorator that also decorates all subqueries at rewrite-time and then keeps the instantiated scorers bound to the top level decorator, i.e. makes the decorated query non resuable. Query realQuery = ... DecoratedQuery dq = new DecoratedQuery(realQuery); searcher.search(dq, ..); Map dq.getScoringQueries(); Not quite sure if this is terrible or elegant. karl 7 apr 2009 kl. 12.17 skrev Michael McCandless: On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin wrote: 7 apr 2009 kl. 10.23 skrev Michael McCandless: Do you mean tracking the "atomic queries" that caused a given hit to match (where "atomic query" is a query that actually uses TermDocs/Positions to check matching, vs other queries like BooleanQuery that "glomm together" sub-query matches)? EG for a boolean query w/ N clauses, which of those N clauses matched? This is exactly what I mean. I do however think it makes sense to get information about non atomic queries as it seems reasonble that the first clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is more interesting than only getting to know that one of the clauses of that boolean query is matching. Ahh OK I agree. So every query in the full tree should be able to state whether it matched the doc. A natural place to do this is Scorer API, ie extend it with a "getMatchingAtomicQueries" or some such. Probably, for efficiency, each Query should be pre-assigned an int position, and then the matching is represented as a bit array, reused across matches. Your collector could then ask the scorer for these bits if it wanted. There should be no performance cost for collectors that don't use this functionality. I'll look in to it. Thanks for the feedback. karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715494#action_12715494 ] Karl Wettin commented on LUCENE-1578: - Jason, did you get a chanse to try this out? It seems to work fine for me and I plan to pop it in the trunk in a few days. I think I'll have to add a warning of some kind in runtime though as it could slow down the index a bit if the reader is way fragmented. > InstantiatedIndex supports non-optimized IndexReaders > - > > Key: LUCENE-1578 > URL: https://issues.apache.org/jira/browse/LUCENE-1578 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1578.txt > > Original Estimate: 72h > Remaining Estimate: 72h > > InstantiatedIndex does not currently support non-optimized IndexReaders. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin reassigned LUCENE-1578: --- Assignee: Karl Wettin > InstantiatedIndex supports non-optimized IndexReaders > - > > Key: LUCENE-1578 > URL: https://issues.apache.org/jira/browse/LUCENE-1578 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1578.txt > > Original Estimate: 72h > Remaining Estimate: 72h > > InstantiatedIndex does not currently support non-optimized IndexReaders. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity
[ https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715492#action_12715492 ] Karl Wettin commented on LUCENE-1260: - bq. Wouldn't the simplest solution be to refactor out the static methods, replace them with instance methods and remove the getNormDecoder method? This would enable a pluggable behavior without introducing a new Codec. Hi Johan, feel free to post a patch! > Norm codec strategy in Similarity > - > > Key: LUCENE-1260 > URL: https://issues.apache.org/jira/browse/LUCENE-1260 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.3.1 >Reporter: Karl Wettin > Attachments: LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt > > > The static span and resolution of the 8 bit norms codec might not fit with > all applications. > My use case requires that 100f-250f is discretized in 60 bags instead of the > default.. 10? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: InstantiatedIndex Memory required
Hi Ravichandra, this is a question better fitted the java-users maillinglist. On this list we talk about the development of the Lucene API rather than how to use it. To answer your question, there is no simple formula that says how much RAM an InstantiatedIndex will consume given the FSDirectory or RAMDirectory size. Your index is however probably way too large for when InstantiatedIndex is considerably faster than RAMDirecotry. There is a diagram in the Javadocs that shows the speed on a Reuters index as it grows in size: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/store/instantiated/package-summary.html#package_description As milage varies on term saturation you should still try benchmarking and see if there is anything to be gained. Try increasing Xmx to whatever you have, you can also take a look at -XX:+AggressiveHeap. karl 12 maj 2009 kl. 18.43 skrev thiruvee: Hi So far I am using RAMDirectory for my indexes. To meet the SLA of our project, i thought of using InstantiatedIndex. But when I used that, i am not able to get any out put from that and its throwing out of memory error. What is the ratio between Index size and memory size, when using InstantiatedIndex. Here are my index details: Index size : 200mB RAM Size : 1 GB If i try with a small test index of size 100KB, its working. Please help me with this. Thanks Ravichandra -- View this message in context: http://www.nabble.com/InstantiatedIndex-Memory-required-tp23506231p23506231.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: HitCollector#collect(int,float,Collection)
7 apr 2009 kl. 10.23 skrev Michael McCandless: Do you mean tracking the "atomic queries" that caused a given hit to match (where "atomic query" is a query that actually uses TermDocs/Positions to check matching, vs other queries like BooleanQuery that "glomm together" sub-query matches)? EG for a boolean query w/ N clauses, which of those N clauses matched? This is exactly what I mean. I do however think it makes sense to get information about non atomic queries as it seems reasonble that the first clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is more interesting than only getting to know that one of the clauses of that boolean query is matching. A natural place to do this is Scorer API, ie extend it with a "getMatchingAtomicQueries" or some such. Probably, for efficiency, each Query should be pre-assigned an int position, and then the matching is represented as a bit array, reused across matches. Your collector could then ask the scorer for these bits if it wanted. There should be no performance cost for collectors that don't use this functionality. I'll look in to it. Thanks for the feedback. karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
HitCollector#collect(int,float,Collection)
How crazy would it be to refactor HitCollector so it also accept the matching queries? Let's ignore my use case (not sure it makes sense yet, it's related to finding a threadshold between probably interesting and definitly not interesting results of huge OR-statements, but I really have to try it out before I can say if it's any good) and just focus on the speed impact. If I cleared and reused the Collection passed down to the HitCollector then it shouldn't really slow things down, right? And if I reused the collections in my TopDocsCollector as low scoring results was pushed down then it shouldn't have to be expensive there either. Or? karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store
[ https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693744#action_12693744 ] Karl Wettin commented on LUCENE-1039: - Vaijanath, can you please post a small test case that demonstrates the problem? > Bayesian classifiers using Lucene as data store > --- > > Key: LUCENE-1039 > URL: https://issues.apache.org/jira/browse/LUCENE-1039 > Project: Lucene - Java > Issue Type: New Feature > Reporter: Karl Wettin > Assignee: Karl Wettin >Priority: Minor > Attachments: LUCENE-1039.txt > > > Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and > Fisher method algorithms as described by Toby Segaran in "Programming > Collective Intelligence", ISBN 978-0-596-52932-1. > Have fun. > Poor java docs, but the TestCase shows how to use it: > {code:java} > public class TestClassifier extends TestCase { > public void test() throws Exception { > InstanceFactory instanceFactory = new InstanceFactory() { > public Document factory(String text, String _class) { > Document doc = new Document(); > doc.add(new Field("class", _class, Field.Store.YES, > Field.Index.NO_NORMS)); > doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, > Field.TermVector.NO)); > doc.add(new Field("text/ngrams/start", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > doc.add(new Field("text/ngrams/end", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > return doc; > } > Analyzer analyzer = new Analyzer() { > private int minGram = 2; > private int maxGram = 3; > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream ts = new StandardTokenizer(reader); > ts = new LowerCaseFilter(ts); > if (fieldName.endsWith("/ngrams/start")) { > ts = new EdgeNGramTokenFilter(ts, > EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram); > } else if (fieldName.endsWith("/ngrams/inner")) { > ts = new NGramTokenFilter(ts, minGram, maxGram); > } else if (fieldName.endsWith("/ngrams/end")) { > ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, > minGram, maxGram); > } > return ts; > } > }; > public Analyzer getAnalyzer() { > return analyzer; > } > }; > Directory dir = new RAMDirectory(); > new IndexWriter(dir, null, true).close(); > Instances instances = new Instances(dir, instanceFactory, "class"); > instances.addInstance("hello world", "en"); > instances.addInstance("hallå världen", "sv"); > instances.addInstance("this is london calling", "en"); > instances.addInstance("detta är london som ringer", "sv"); > instances.addInstance("john has a long mustache", "en"); > instances.addInstance("john har en lång mustache", "sv"); > instances.addInstance("all work and no play makes jack a dull boy", "en"); > instances.addInstance("att bara arbeta och aldrig leka gör jack en trist > gosse", "sv"); > instances.addInstance("shrimp sandwich", "en"); > instances.addInstance("räksmörgås", "sv"); > instances.addInstance("it's now or never", "en"); > instances.addInstance("det är nu eller aldrig", "sv"); > instances.addInstance("to tie up at a landing-stage", "en"); > instances.addInstance("att angöra en brygga", "sv"); > instances.addInstance("it's now time for the children's television > shows", "en"); > instances.addInstance("nu är det dags för barnprogram", "sv"); > instances.flush(); > testClassifier(instances, new NaiveBayesClassifier()); > testClassifier(instances, new FishersMethodClassifier()); > instances.close(); > } > private void testClassifier(Instances instances, BayesianClassifier > classifier) throws IOException { > assertEquals("sv",
[jira] Updated: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders
[ https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1578: Attachment: LUCENE-1578.txt Please test this patch using a couple of different unoptimized readers in the constructor. > InstantiatedIndex supports non-optimized IndexReaders > - > > Key: LUCENE-1578 > URL: https://issues.apache.org/jira/browse/LUCENE-1578 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Jason Rutherglen > Fix For: 2.9 > > Attachments: LUCENE-1578.txt > > Original Estimate: 72h > Remaining Estimate: 72h > > InstantiatedIndex does not currently support non-optimized IndexReaders. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: InstantiatedIndex
28 mar 2009 kl. 01.21 skrev Jason Rutherglen: I'm thinking InstantiatedIndex needs to implement either clone of all the index data or needs to be able to accept a non-optimized reader, or both. I forget what the obstacles are to implementing the non-optimized reader option? Do you think there are advantages or disadvantages when comparing the solutions? Hi Jason, I honestly don't remember the reason but it seems to have something to do with deletions. Realtime search will need to periodically merge InstantiatedIndex's. One option is to clone an existing index, then add a document to it, clone, and so on, freeze it and later merge it with other indexes. The other option that provides the same functionality is to pass the smaller readers into an InstantiatedIndex. How do you feel about something like this? public InstantiatedIndex merge(IndexReader[] readers) { Directory dir = new RAMDirectory(); IndexWriter w = new IndexWriter(dir); w.addIndexes(readers); w.commit(); w.optimize(); w.close(); IndexReader reader = IndexReader.open(dir); InstantiatedIndex ii = new InstantiatedIndex(reader); reader.close(); dir.close(); return ii; } karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer
[ https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683423#action_12683423 ] Karl Wettin commented on LUCENE-1543: - bq. Karl, is there a reason why a function query can't be used in your situation? It seems like it should work? I'm sure it would. : ) I do however not understand why you think it is a more correct/nice/better/what not solution than to use this patch. This is how I reason: if the feature of norms scoring is available in all other low level queries, than it also makes sense to have it in the low level MatchAllDocumentsQuery > Field specified norms in MatchAllDocumentsScorer > - > > Key: LUCENE-1543 > URL: https://issues.apache.org/jira/browse/LUCENE-1543 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Affects Versions: 2.4 >Reporter: Karl Wettin >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1543.txt > > > This patch allows for optionally setting a field to use for norms factoring > when scoring a MatchingAllDocumentsQuery. > From the test case: > {code:java} > . > RAMDirectory dir = new RAMDirectory(); > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, > IndexWriter.MaxFieldLength.LIMITED); > iw.setMaxBufferedDocs(2); // force multi-segment > addDoc("one", iw, 1f); > addDoc("two", iw, 20f); > addDoc("three four", iw, 300f); > iw.close(); > IndexReader ir = IndexReader.open(dir); > IndexSearcher is = new IndexSearcher(ir); > ScoreDoc[] hits; > // assert with norms scoring turned off > hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs; > assertEquals(3, hits.length); > assertEquals("one", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("three four", ir.document(hits[2].doc).get("key")); > // assert with norms scoring turned on > MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key"); > assertEquals(3, hits.length); > //is.explain(normsQuery, hits[0].doc); > hits = is.search(normsQuery, null, 1000).scoreDocs; > assertEquals("three four", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("one", ir.document(hits[2].doc).get("key")); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer
[ https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675118#action_12675118 ] Karl Wettin commented on LUCENE-1543: - bq. Couldn't you just use a TermQuery? Or a BooleanQuery with a MatchAllDocsQuery and an optional TermQuery? Wouldn't that require a TermQuery that match all documents? I.e. adding a term to a field in all documents? The following stuff doesn't really fit in this issue, but still. It's rather related to column stride payloads LUCENE-1231 . I've been considering adding a new "norms" field at document level for a couple of years now. 8 more bits at document level would allow for moving general document boosting to move it out the norms-boost-per-field-blob and increase the length normalization and per field boost resolution quite a bit at a low cost. (I hope that is not yet another can of worms I get to open.) > Field specified norms in MatchAllDocumentsScorer > - > > Key: LUCENE-1543 > URL: https://issues.apache.org/jira/browse/LUCENE-1543 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Affects Versions: 2.4 >Reporter: Karl Wettin >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1543.txt > > > This patch allows for optionally setting a field to use for norms factoring > when scoring a MatchingAllDocumentsQuery. > From the test case: > {code:java} > . > RAMDirectory dir = new RAMDirectory(); > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, > IndexWriter.MaxFieldLength.LIMITED); > iw.setMaxBufferedDocs(2); // force multi-segment > addDoc("one", iw, 1f); > addDoc("two", iw, 20f); > addDoc("three four", iw, 300f); > iw.close(); > IndexReader ir = IndexReader.open(dir); > IndexSearcher is = new IndexSearcher(ir); > ScoreDoc[] hits; > // assert with norms scoring turned off > hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs; > assertEquals(3, hits.length); > assertEquals("one", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("three four", ir.document(hits[2].doc).get("key")); > // assert with norms scoring turned on > MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key"); > assertEquals(3, hits.length); > //is.explain(normsQuery, hits[0].doc); > hits = is.search(normsQuery, null, 1000).scoreDocs; > assertEquals("three four", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("one", ir.document(hits[2].doc).get("key")); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer
[ https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1543: Attachment: LUCENE-1543.txt > Field specified norms in MatchAllDocumentsScorer > - > > Key: LUCENE-1543 > URL: https://issues.apache.org/jira/browse/LUCENE-1543 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring >Affects Versions: 2.4 > Reporter: Karl Wettin >Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1543.txt > > > This patch allows for optionally setting a field to use for norms factoring > when scoring a MatchingAllDocumentsQuery. > From the test case: > {code:java} > . > RAMDirectory dir = new RAMDirectory(); > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, > IndexWriter.MaxFieldLength.LIMITED); > iw.setMaxBufferedDocs(2); // force multi-segment > addDoc("one", iw, 1f); > addDoc("two", iw, 20f); > addDoc("three four", iw, 300f); > iw.close(); > IndexReader ir = IndexReader.open(dir); > IndexSearcher is = new IndexSearcher(ir); > ScoreDoc[] hits; > // assert with norms scoring turned off > hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs; > assertEquals(3, hits.length); > assertEquals("one", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("three four", ir.document(hits[2].doc).get("key")); > // assert with norms scoring turned on > MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key"); > assertEquals(3, hits.length); > //is.explain(normsQuery, hits[0].doc); > hits = is.search(normsQuery, null, 1000).scoreDocs; > assertEquals("three four", ir.document(hits[0].doc).get("key")); > assertEquals("two", ir.document(hits[1].doc).get("key")); > assertEquals("one", ir.document(hits[2].doc).get("key")); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer
Field specified norms in MatchAllDocumentsScorer - Key: LUCENE-1543 URL: https://issues.apache.org/jira/browse/LUCENE-1543 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Affects Versions: 2.4 Reporter: Karl Wettin Priority: Minor Fix For: 2.9 Attachments: LUCENE-1543.txt This patch allows for optionally setting a field to use for norms factoring when scoring a MatchingAllDocumentsQuery. >From the test case: {code:java} . RAMDirectory dir = new RAMDirectory(); IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); iw.setMaxBufferedDocs(2); // force multi-segment addDoc("one", iw, 1f); addDoc("two", iw, 20f); addDoc("three four", iw, 300f); iw.close(); IndexReader ir = IndexReader.open(dir); IndexSearcher is = new IndexSearcher(ir); ScoreDoc[] hits; // assert with norms scoring turned off hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs; assertEquals(3, hits.length); assertEquals("one", ir.document(hits[0].doc).get("key")); assertEquals("two", ir.document(hits[1].doc).get("key")); assertEquals("three four", ir.document(hits[2].doc).get("key")); // assert with norms scoring turned on MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key"); assertEquals(3, hits.length); //is.explain(normsQuery, hits[0].doc); hits = is.search(normsQuery, null, 1000).scoreDocs; assertEquals("three four", ir.document(hits[0].doc).get("key")); assertEquals("two", ir.document(hits[1].doc).get("key")); assertEquals("one", ir.document(hits[2].doc).get("key")); {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1537) InstantiatedIndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673610#action_12673610 ] Karl Wettin commented on LUCENE-1537: - I didn't try it out yet, but I have a few comments and questions on the patch: {code} Index: contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java + + public Object clone() { +try { + doCommit(); + InstantiatedIndex clonedIndex = index.cloneWithDeletesNorms(); + return new InstantiatedIndexReader(clonedIndex); +} catch (IOException ioe) { + throw new RuntimeException("", ioe); +} + } Index: contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java + + InstantiatedIndex cloneWithDeletesNorms() { +InstantiatedIndex clone = new InstantiatedIndex(); +clone.version = System.currentTimeMillis(); +clone.documentsByNumber = documentsByNumber; +clone.deletedDocuments = new HashSet(deletedDocuments); +clone.termsByFieldAndText = termsByFieldAndText; +clone.orderedTerms = orderedTerms; +clone.normsByFieldNameAndDocumentNumber = new HashMap(normsByFieldNameAndDocumentNumber); +clone.fieldSettings = fieldSettings; +return clone; + } {code} Perhaps we should move deleted documents to the reader? It might be a bit of work to hook it up with term enum et c, but it could be worth looking in to. I think it makes more sense to keep the same instance of InstantiatedIndex and only produce a cloned InstantiatedIndexReader. It is the reader#clone we call upon so cloning the store sounds like a future placeholder for unwanted bugs. I see there are some left overs from your attempt to handle none optimized readers: {code} -documentsByNumber = new InstantiatedDocument[sourceIndexReader.numDocs()]; +documentsByNumber = new InstantiatedDocument[sourceIndexReader.maxDoc()]; // create documents for (int i = 0; i < sourceIndexReader.numDocs(); i++) { {code} I think if you switch to maxDoc it should also use maxDoc int the loop and skip any deleted document. {code} -for (InstantiatedDocument document : getDocumentsByNumber()) { +//for (InstantiatedDocument document : getDocumentsByNumber()) { +for (InstantiatedDocument document : getDocumentsNotDeleted()) { for (Field field : (List) document.getDocument().getFields()) { if (field.isTermVectorStored() && field.isStoreOffsetWithTermVector()) { TermPositionVector termPositionVector = (TermPositionVector) sourceIndexReader.getTermFreqVector(document.getDocumentNumber(), field.name()); @@ -312,7 +325,15 @@ public InstantiatedDocument[] getDocumentsByNumber() { return documentsByNumber; } - + + public List getDocumentsNotDeleted() { +List list = new ArrayList(documentsByNumber.length-deletedDocuments.size()); +for (int x=0; x < documentsByNumber.length; x++) { + if (!deletedDocuments.contains(x)) list.add(documentsByNumber[x]); +} +return list; + } + {code} As the source never contains any deleted documents this really doesn't do anything but consume a bit of resources, or? {code} -int maxVal = getAssociatedDocuments()[max].getDocument().getDocumentNumber(); +InstantiatedTermDocumentInformation itdi = getAssociatedDocuments()[max]; +InstantiatedDocument id = itdi.getDocument(); +int maxVal = id.getDocumentNumber(); +//int maxVal = getAssociatedDocuments()[max].getDocument().getDocumentNumber(); {code} Is this refactor just for debugging purposes? I find it harder to read than the original one-liner. > InstantiatedIndexReader.clone > - > > Key: LUCENE-1537 > URL: https://issues.apache.org/jira/browse/LUCENE-1537 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 > Reporter: Jason Rutherglen >Assignee: Karl Wettin >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1537.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > This patch will implement IndexReader.clone for InstantiatedIndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1537) InstantiatedIndexReader.clone
[ https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin reassigned LUCENE-1537: --- Assignee: Karl Wettin > InstantiatedIndexReader.clone > - > > Key: LUCENE-1537 > URL: https://issues.apache.org/jira/browse/LUCENE-1537 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 >Reporter: Jason Rutherglen >Assignee: Karl Wettin >Priority: Trivial > Fix For: 2.9 > > Attachments: LUCENE-1537.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > This patch will implement IndexReader.clone for InstantiatedIndexReader. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support
[ https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1531. --- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed revision 742411 > contrib/xml-query-parser, BoostingTermQuery support > --- > > Key: LUCENE-1531 > URL: https://issues.apache.org/jira/browse/LUCENE-1531 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1531.txt, LUCENE-1531.txt > > > I'm not 100% on this patch. > BooleanTermQuery is a part of the spans family, but I generally use that > class as a replacement for TermQuery. Thus in the DTD I have stated that it > can be a part of the root queries as well as a part of a span. > However, SpanFooQueries xml elements are named rather than > , I have however chosen to call it . It > would be possible to set it up so it would be parsed as > when inside of a , but I just find that confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Partial / starts with searching
Hi Jori, your question is better suited the java-users lists, on this list we discuss about developing the API. To answer your question, ngrams might solve your problem, tokenizers are available in contrib/analyzers. karl 5 feb 2009 kl. 10.19 skrev d-fader: Hi, I'm new to this list, so please don't be too harsh if I missed some rules or something. Since about half a year I'm using Lucene and I think it's awesome, respect for all your efforts! Maybe the 'issue' I'm addressing now is discussed thouroughly already, in that case I think I need some redirection to the sources of those discussions :) Anyway, here's the thing. For all I know it's impossible to search partial words with Lucene (except the asterix method with e.g. the StandardAnalyzer -> ambul* to find ambulance). My problem with that method is that my index consists of quite a few terms. This means that if a user would search for 'ambu amster' (ambulance amsterdam), there will be so many terms to search, it's not doable. Now I started thinking why it's impossible to search only a 'part' of a term or even only the 'start' of a term and the only reason I could think of was that the Index terms are stored tokenized (in that way you (of course) can't find partial terms, since the index actually doesn't contain the literal terms, but tokens instead). But Lucene can also store all terms untokenized, so in that case a partial search would be possible in my humble opinion, since all terms would be stored 'literally'. Maybe my thinking is wrong, I only have a black box view of Lucene, so I don't know much about indexing algorithm and all, but I just want to know if this could be done or else why not :) You see, the users of my index want to know why they can't search parts of the words they enter and I still can't give them a really good answer, except the 'it would result in too many OR operators in the query' statement :) Thanks in advance! Jori - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support
[ https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670240#action_12670240 ] Karl Wettin commented on LUCENE-1531: - Any objections to this patch? If not I'll pop in the trunk in a few days from now. > contrib/xml-query-parser, BoostingTermQuery support > --- > > Key: LUCENE-1531 > URL: https://issues.apache.org/jira/browse/LUCENE-1531 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 >Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1531.txt, LUCENE-1531.txt > > > I'm not 100% on this patch. > BooleanTermQuery is a part of the spans family, but I generally use that > class as a replacement for TermQuery. Thus in the DTD I have stated that it > can be a part of the root queries as well as a part of a span. > However, SpanFooQueries xml elements are named rather than > , I have however chosen to call it . It > would be possible to set it up so it would be parsed as > when inside of a , but I just find that confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support
[ https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1531: Attachment: LUCENE-1531.txt Previous patch was messed up from cloning SpanTerm.. > contrib/xml-query-parser, BoostingTermQuery support > --- > > Key: LUCENE-1531 > URL: https://issues.apache.org/jira/browse/LUCENE-1531 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1531.txt, LUCENE-1531.txt > > > I'm not 100% on this patch. > BooleanTermQuery is a part of the spans family, but I generally use that > class as a replacement for TermQuery. Thus in the DTD I have stated that it > can be a part of the root queries as well as a part of a span. > However, SpanFooQueries xml elements are named rather than > , I have however chosen to call it . It > would be possible to set it up so it would be parsed as > when inside of a , but I just find that confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support
[ https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1531: Attachment: LUCENE-1531.txt > contrib/xml-query-parser, BoostingTermQuery support > --- > > Key: LUCENE-1531 > URL: https://issues.apache.org/jira/browse/LUCENE-1531 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1531.txt > > > I'm not 100% on this patch. > BooleanTermQuery is a part of the spans family, but I generally use that > class as a replacement for TermQuery. Thus in the DTD I have stated that it > can be a part of the root queries as well as a part of a span. > However, SpanFooQueries xml elements are named rather than > , I have however chosen to call it . It > would be possible to set it up so it would be parsed as > when inside of a , but I just find that confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support
contrib/xml-query-parser, BoostingTermQuery support --- Key: LUCENE-1531 URL: https://issues.apache.org/jira/browse/LUCENE-1531 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 2.9 I'm not 100% on this patch. BooleanTermQuery is a part of the spans family, but I generally use that class as a replacement for TermQuery. Thus in the DTD I have stated that it can be a part of the root queries as well as a part of a span. However, SpanFooQueries xml elements are named rather than , I have however chosen to call it . It would be possible to set it up so it would be parsed as when inside of a , but I just find that confusing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Filesystem based bitset
Thinking out loud, SSD is pretty close to RAM when it comes to seeking. Wouldn't that mean that a bitset stored on an SSD would be more or less as fast as a bitset in RAM? So how about storing all permutations of filters one use on SSD? Perhaps loading them to RAM in case they are frequently used? To me it sounds like a great idea. Not sure if one should focus at OpenBitSet or a fixed size BitSet, I'd really need to do some real tests to tell. Still, I'm rather convinced the bang for the buck ratio is quite a bit more using SSD than RAM given IO throughput (compare an index in RAM vs on SSD vs on HDD) isn't an issue. The only real issue I can this of is the lack of DocSetIterator#close().. karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1515) Improved(?) Swedish snowball stemmer
[ https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1515: Attachment: LUCENE-1515.txt snowball code, generated java class and unit test. > Improved(?) Swedish snowball stemmer > > > Key: LUCENE-1515 > URL: https://issues.apache.org/jira/browse/LUCENE-1515 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin > Attachments: LUCENE-1515.txt > > > Snowball stemmer for Swedish lacks support for '-an' and '-ans' related > suffix stripping, ending up with non compatible stems for example "klocka", > "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix > stripping rules: > {pre} > 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' > 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' > 'ansernas' > 'iera' > (delete) > {pre} > The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and > this is an attempt at solving that problem. The rules and exceptions are > based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] > entries suffixed with 'an' and 'ans'. There a few known problematic stemming > rules but seems to work quite a bit better than the current SwedishStemmer. > It would not be a bad idea to check all of SAOL entries in order to make sure > the integrity of the rules. > My Snowball syntax skills are rather limited so I'm certain the code could be > optimized quite a bit. > *The code is released under BSD and not ASL*. I've been posting a bit in the > Snowball forum and privatly to Martin Porter himself but never got any > response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store
[ https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662467#action_12662467 ] Karl Wettin commented on LUCENE-1039: - What do you people think, should I commit this to Lucene or Mahout? > Bayesian classifiers using Lucene as data store > --- > > Key: LUCENE-1039 > URL: https://issues.apache.org/jira/browse/LUCENE-1039 > Project: Lucene - Java > Issue Type: New Feature > Reporter: Karl Wettin > Assignee: Karl Wettin >Priority: Minor > Attachments: LUCENE-1039.txt > > > Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and > Fisher method algorithms as described by Toby Segaran in "Programming > Collective Intelligence", ISBN 978-0-596-52932-1. > Have fun. > Poor java docs, but the TestCase shows how to use it: > {code:java} > public class TestClassifier extends TestCase { > public void test() throws Exception { > InstanceFactory instanceFactory = new InstanceFactory() { > public Document factory(String text, String _class) { > Document doc = new Document(); > doc.add(new Field("class", _class, Field.Store.YES, > Field.Index.NO_NORMS)); > doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, > Field.TermVector.NO)); > doc.add(new Field("text/ngrams/start", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > doc.add(new Field("text/ngrams/end", text, Field.Store.NO, > Field.Index.TOKENIZED, Field.TermVector.YES)); > return doc; > } > Analyzer analyzer = new Analyzer() { > private int minGram = 2; > private int maxGram = 3; > public TokenStream tokenStream(String fieldName, Reader reader) { > TokenStream ts = new StandardTokenizer(reader); > ts = new LowerCaseFilter(ts); > if (fieldName.endsWith("/ngrams/start")) { > ts = new EdgeNGramTokenFilter(ts, > EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram); > } else if (fieldName.endsWith("/ngrams/inner")) { > ts = new NGramTokenFilter(ts, minGram, maxGram); > } else if (fieldName.endsWith("/ngrams/end")) { > ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, > minGram, maxGram); > } > return ts; > } > }; > public Analyzer getAnalyzer() { > return analyzer; > } > }; > Directory dir = new RAMDirectory(); > new IndexWriter(dir, null, true).close(); > Instances instances = new Instances(dir, instanceFactory, "class"); > instances.addInstance("hello world", "en"); > instances.addInstance("hallå världen", "sv"); > instances.addInstance("this is london calling", "en"); > instances.addInstance("detta är london som ringer", "sv"); > instances.addInstance("john has a long mustache", "en"); > instances.addInstance("john har en lång mustache", "sv"); > instances.addInstance("all work and no play makes jack a dull boy", "en"); > instances.addInstance("att bara arbeta och aldrig leka gör jack en trist > gosse", "sv"); > instances.addInstance("shrimp sandwich", "en"); > instances.addInstance("räksmörgås", "sv"); > instances.addInstance("it's now or never", "en"); > instances.addInstance("det är nu eller aldrig", "sv"); > instances.addInstance("to tie up at a landing-stage", "en"); > instances.addInstance("att angöra en brygga", "sv"); > instances.addInstance("it's now time for the children's television > shows", "en"); > instances.addInstance("nu är det dags för barnprogram", "sv"); > instances.flush(); > testClassifier(instances, new NaiveBayesClassifier()); > testClassifier(instances, new FishersMethodClassifier()); > instances.close(); > } > private void testClassifier(Instances instances, BayesianClassifier > classifier) throws IOException { > assertEquals("sv", classifie
[jira] Created: (LUCENE-1515) Improved(?) Swedish snowball stemmer
Improved(?) Swedish snowball stemmer Key: LUCENE-1515 URL: https://issues.apache.org/jira/browse/LUCENE-1515 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Affects Versions: 2.4 Reporter: Karl Wettin Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix stripping, ending up with non compatible stems for example "klocka", "klockor", "klockornas", "klockAN", "klockANS". Complete list of new suffix stripping rules: {pre} 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas' 'ans' 'ansen' 'ansens' 'anser' 'ansera' 'anserar' 'anserna' 'ansernas' 'iera' (delete) {pre} The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and this is an attempt at solving that problem. The rules and exceptions are based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries suffixed with 'an' and 'ans'. There a few known problematic stemming rules but seems to work quite a bit better than the current SwedishStemmer. It would not be a bad idea to check all of SAOL entries in order to make sure the integrity of the rules. My Snowball syntax skills are rather limited so I'm certain the code could be optimized quite a bit. *The code is released under BSD and not ASL*. I've been posting a bit in the Snowball forum and privatly to Martin Porter himself but never got any response so now I post it here instead in hope for some momentum. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows
[ https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1514. --- Resolution: Fixed Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Committed in revision 733064 > ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix > grows > -- > > Key: LUCENE-1514 > URL: https://issues.apache.org/jira/browse/LUCENE-1514 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4 >Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1514.txt > > > ShingleMatrixFilter#next makes a recursive function invocation when the > current permutation iterator is exhausted or if the current state of the > permutation iterator already has produced an identical shingle. In a not too > complex matrix this will require a gigabyte sized stack per thread. > My solution is to avoid the recursive invocation by refactoring like this: > {code:java} > public Token next(final Token reusableToken) throws IOException { > assert reusableToken != null; > if (matrix == null) { > matrix = new Matrix(); > // fill matrix with maximumShingleSize columns > while (matrix.columns.size() < maximumShingleSize && readColumn()) { > // this loop looks ugly > } > } > // this loop exists in order to avoid recursive calls to the next method > // as the complexity of a large matrix > // then would require a multi gigabyte sized stack. > Token token; > do { > token = produceNextToken(reusableToken); > } while (token == request_next_token); > return token; > } > > private static final Token request_next_token = new Token(); > /** >* This method exists in order to avoid reursive calls to the method >* as the complexity of a fairlt small matrix then easily would require >* a gigabyte sized stack per thread. >* >* @param reusableToken >* @return null if exhausted, instance request_next_token if one more call > is required for an answer, or instance parameter resuableToken. >* @throws IOException >*/ > private Token produceNextToken(final Token reusableToken) throws > IOException { > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows
[ https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1514: Attachment: LUCENE-1514.txt > ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix > grows > -- > > Key: LUCENE-1514 > URL: https://issues.apache.org/jira/browse/LUCENE-1514 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Affects Versions: 2.4 >Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: LUCENE-1514.txt > > > ShingleMatrixFilter#next makes a recursive function invocation when the > current permutation iterator is exhausted or if the current state of the > permutation iterator already has produced an identical shingle. In a not too > complex matrix this will require a gigabyte sized stack per thread. > My solution is to avoid the recursive invocation by refactoring like this: > {code:java} > public Token next(final Token reusableToken) throws IOException { > assert reusableToken != null; > if (matrix == null) { > matrix = new Matrix(); > // fill matrix with maximumShingleSize columns > while (matrix.columns.size() < maximumShingleSize && readColumn()) { > // this loop looks ugly > } > } > // this loop exists in order to avoid recursive calls to the next method > // as the complexity of a large matrix > // then would require a multi gigabyte sized stack. > Token token; > do { > token = produceNextToken(reusableToken); > } while (token == request_next_token); > return token; > } > > private static final Token request_next_token = new Token(); > /** >* This method exists in order to avoid reursive calls to the method >* as the complexity of a fairlt small matrix then easily would require >* a gigabyte sized stack per thread. >* >* @param reusableToken >* @return null if exhausted, instance request_next_token if one more call > is required for an answer, or instance parameter resuableToken. >* @throws IOException >*/ > private Token produceNextToken(final Token reusableToken) throws > IOException { > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows
ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows -- Key: LUCENE-1514 URL: https://issues.apache.org/jira/browse/LUCENE-1514 Project: Lucene - Java Issue Type: Improvement Components: contrib/analyzers Affects Versions: 2.4 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 2.9 Attachments: LUCENE-1514.txt ShingleMatrixFilter#next makes a recursive function invocation when the current permutation iterator is exhausted or if the current state of the permutation iterator already has produced an identical shingle. In a not too complex matrix this will require a gigabyte sized stack per thread. My solution is to avoid the recursive invocation by refactoring like this: {code:java} public Token next(final Token reusableToken) throws IOException { assert reusableToken != null; if (matrix == null) { matrix = new Matrix(); // fill matrix with maximumShingleSize columns while (matrix.columns.size() < maximumShingleSize && readColumn()) { // this loop looks ugly } } // this loop exists in order to avoid recursive calls to the next method // as the complexity of a large matrix // then would require a multi gigabyte sized stack. Token token; do { token = produceNextToken(reusableToken); } while (token == request_next_token); return token; } private static final Token request_next_token = new Token(); /** * This method exists in order to avoid reursive calls to the method * as the complexity of a fairlt small matrix then easily would require * a gigabyte sized stack per thread. * * @param reusableToken * @return null if exhausted, instance request_next_token if one more call is required for an answer, or instance parameter resuableToken. * @throws IOException */ private Token produceNextToken(final Token reusableToken) throws IOException { {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1510. --- Resolution: Fixed Fix Version/s: 2.9 > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Fix For: 2.9 > > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661908#action_12661908 ] Karl Wettin commented on LUCENE-1510: - Thanks for the report Robert! I've committed a fix in revision 732661. Please check it out and let me know how it works for you. There was a bit of discrepancies between how the InstantiatedIndexReader handled null norms compared to a SegmentReader. I think these problems are fixed now. > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader
[ https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin reassigned LUCENE-1510: --- Assignee: Karl Wettin > InstantiatedIndexReader throws NullPointerException in norms() when used with > a MultiReader > --- > > Key: LUCENE-1510 > URL: https://issues.apache.org/jira/browse/LUCENE-1510 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Robert Newson >Assignee: Karl Wettin > Attachments: TestWithMultiReader.java > > > When using InstantiatedIndexReader under a MultiReader where the other Reader > contains documents, a NullPointerException is thrown here; > public void norms(String field, byte[] bytes, int offset) throws IOException > { > byte[] norms = > getIndex().getNormsByFieldNameAndDocumentNumber().get(field); > System.arraycopy(norms, 0, bytes, offset, norms.length); > } > the 'norms' variable is null. Performing the copy only when norms is not null > does work, though I'm sure it's not the right fix. > java.lang.NullPointerException > at > org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297) > at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273) > at > org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131) > at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112) > at org.apache.lucene.search.Searcher.search(Searcher.java:136) > at org.apache.lucene.search.Searcher.search(Searcher.java:146) > at > org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at junit.framework.TestCase.runTest(TestCase.java:164) > at junit.framework.TestCase.runBare(TestCase.java:130) > at junit.framework.TestResult$1.protect(TestResult.java:106) > at junit.framework.TestResult.runProtected(TestResult.java:124) > at junit.framework.TestResult.run(TestResult.java:109) > at junit.framework.TestCase.run(TestCase.java:120) > at junit.framework.TestSuite.runTest(TestSuite.java:230) > at junit.framework.TestSuite.run(TestSuite.java:225) > at > org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) > at > org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) > at > org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1501) Phonetic filters
[ https://issues.apache.org/jira/browse/LUCENE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660196#action_12660196 ] Karl Wettin commented on LUCENE-1501: - bq. Ryan McKinley - 30/Dec/08 10:36 AM bq. FYI, solr includes phonetic filters also... perhaps we should consolidate? Ah, yes I think we should. I'll take a look at how they differ. > Phonetic filters > > > Key: LUCENE-1501 > URL: https://issues.apache.org/jira/browse/LUCENE-1501 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Minor > Attachments: LUCENE-1501.txt > > > Metaphone, double metaphone, soundex and refined soundex filters using > commons codec API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1501) Phonetic filters
[ https://issues.apache.org/jira/browse/LUCENE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1501: Attachment: LUCENE-1501.txt This is in need of a bit of documentation about the different algorithms. It could also use some tests with with alternative languages. > Phonetic filters > > > Key: LUCENE-1501 > URL: https://issues.apache.org/jira/browse/LUCENE-1501 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Karl Wettin > Assignee: Karl Wettin >Priority: Minor > Attachments: LUCENE-1501.txt > > > Metaphone, double metaphone, soundex and refined soundex filters using > commons codec API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1501) Phonetic filters
Phonetic filters Key: LUCENE-1501 URL: https://issues.apache.org/jira/browse/LUCENE-1501 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Karl Wettin Assignee: Karl Wettin Priority: Minor Metaphone, double metaphone, soundex and refined soundex filters using commons codec API. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Closed: (LUCENE-1462) Instantiated/IndexWriter discrepanies
[ https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1462. --- Resolution: Fixed Committed in r726030 and r 725837. > Instantiated/IndexWriter discrepanies > - > > Key: LUCENE-1462 > URL: https://issues.apache.org/jira/browse/LUCENE-1462 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Critical > Fix For: 2.9 > > Attachments: LUCENE-1462.txt > > > * RAMDirectory seems to do a reset on tokenStreams the first time, this > permits to initialise some objects before starting streaming, > InstantiatedIndex does not. > * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because > of : java.io.NotSerializableException: > org.apache.lucene.index.TermVectorOffsetInfo > http://www.nabble.com/InstatiatedIndex-questions-to20576722.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SVN karma problem?
Everything worked great when I switched from svn.eu.apache.org to svn.apache.org. I suppose I should report that to someone. Infra? 12 dec 2008 kl. 00.13 skrev Grant Ingersoll: http://www.nabble.com/Committing-new-files-to-(write-through-proxy)-slave-repo-fails---400-Bad-Request-td20083914.html Any of that ring a bell? On Dec 11, 2008, at 5:49 PM, Karl Wettin wrote: I tried clean checkout, upgraded my SVN client and a bunch of other things. I could try to add and remove an alternative dummy file. 11 dec 2008 kl. 23.35 skrev Grant Ingersoll: Does an svn cleanup help? What about on a clean checkout? On Dec 11, 2008, at 5:13 PM, Karl Wettin wrote: I can't seem to commit new files in contrib, only update existing. Or am I misinterpreting the error? svn: Commit failed (details follow): svn: Server sent unexpected return value (400 Bad Request) in response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce- e749-4cd0-a609-6e2a3763b81d/lucene/java/trunk/contrib/ instantiated/src/test/org/apache/lucene/store/instantiated/ TestSerialization.java' svn: Your commit message was left in a temporary file: svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp' karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: SVN karma problem?
I tried clean checkout, upgraded my SVN client and a bunch of other things. I could try to add and remove an alternative dummy file. 11 dec 2008 kl. 23.35 skrev Grant Ingersoll: Does an svn cleanup help? What about on a clean checkout? On Dec 11, 2008, at 5:13 PM, Karl Wettin wrote: I can't seem to commit new files in contrib, only update existing. Or am I misinterpreting the error? svn: Commit failed (details follow): svn: Server sent unexpected return value (400 Bad Request) in response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce- e749-4cd0-a609-6e2a3763b81d/lucene/java/trunk/contrib/instantiated/ src/test/org/apache/lucene/store/instantiated/TestSerialization.java' svn: Your commit message was left in a temporary file: svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp' karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
SVN karma problem?
I can't seem to commit new files in contrib, only update existing. Or am I misinterpreting the error? svn: Commit failed (details follow): svn: Server sent unexpected return value (400 Bad Request) in response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce-e749-4cd0- a609-6e2a3763b81d/lucene/java/trunk/contrib/instantiated/src/test/org/ apache/lucene/store/instantiated/TestSerialization.java' svn: Your commit message was left in a temporary file: svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp' karl - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1462) Instantiated/IndexWriter discrepanies
[ https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1462: Fix Version/s: 2.9 > Instantiated/IndexWriter discrepanies > - > > Key: LUCENE-1462 > URL: https://issues.apache.org/jira/browse/LUCENE-1462 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Critical > Fix For: 2.9 > > Attachments: LUCENE-1462.txt > > > * RAMDirectory seems to do a reset on tokenStreams the first time, this > permits to initialise some objects before starting streaming, > InstantiatedIndex does not. > * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because > of : java.io.NotSerializableException: > org.apache.lucene.index.TermVectorOffsetInfo > http://www.nabble.com/InstatiatedIndex-questions-to20576722.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1462) Instantiated/IndexWriter discrepanies
[ https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1462: Attachment: LUCENE-1462.txt * Made a few classes implement java.io:Serializable * TestCase that makes sure InstantiatedIndex can be passed to an ObjectOutputStream * Added a tokenStream.reset() in InstantiatedIndexWriter I need help to get this committed as it contains a minor change to TermVectorOffsetInfo (implements Serializable) thats outside of the contrib module. > Instantiated/IndexWriter discrepanies > - > > Key: LUCENE-1462 > URL: https://issues.apache.org/jira/browse/LUCENE-1462 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 > Reporter: Karl Wettin >Assignee: Karl Wettin >Priority: Critical > Attachments: LUCENE-1462.txt > > > * RAMDirectory seems to do a reset on tokenStreams the first time, this > permits to initialise some objects before starting streaming, > InstantiatedIndex does not. > * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because > of : java.io.NotSerializableException: > org.apache.lucene.index.TermVectorOffsetInfo > http://www.nabble.com/InstatiatedIndex-questions-to20576722.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
InstantiatedIndexWriter
I was just about to get on with LUCENE-1462 when I noticed the new TokenStream API. (Yeah, I've been really busy with other stuff for a while now.) Rather than keeping InstantiatedIndexWriter in sync with IndexWriter I'm considering suggesting that we simply delete InstantiatedIndexWriter. There is this one major caveats that would go away if we removed InstantiatedIndexWriter: it lacks read/write locks at commit time. Also, the javadocs says "consider using II as an immutable store" all over the place.. I'm a bit split here, I can see the use of beeing able to add a few documents to an existing II, but at the same time these indices are ment to be really small so creating a new one from an IndexReader is really no big deal. This operation means a few seconds of overhead if one needs to append data to the II. I say that we should remove it from trunk. Less hassles. Or is this to remove good functionallity? I never use it, it was written in order to understand Lucene. But if people find it is very useful then of course it should be kept in there. That might be a problem for some people. For instance I think Jason Rutherglens realtime search use this class. karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1462) Instantiated/IndexWriter discrepanies
Instantiated/IndexWriter discrepanies - Key: LUCENE-1462 URL: https://issues.apache.org/jira/browse/LUCENE-1462 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.4 Reporter: Karl Wettin Assignee: Karl Wettin Priority: Critical * RAMDirectory seems to do a reset on tokenStreams the first time, this permits to initialise some objects before starting streaming, InstantiatedIndex does not. * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because of : java.io.NotSerializableException: org.apache.lucene.index.TermVectorOffsetInfo http://www.nabble.com/InstatiatedIndex-questions-to20576722.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Closed: (LUCENE-1423) InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on empty index
[ https://issues.apache.org/jira/browse/LUCENE-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin closed LUCENE-1423. --- Resolution: Fixed committed in rev 705893 > InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on > empty index > -- > > Key: LUCENE-1423 > URL: https://issues.apache.org/jira/browse/LUCENE-1423 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* >Affects Versions: 2.4 >Reporter: Karl Wettin >Assignee: Karl Wettin > Fix For: 2.9 > > > {code} > java.lang.ArrayIndexOutOfBoundsException: 0 > at > org.apache.lucene.store.instantiated.InstantiatedTermEnum.skipTo(InstantiatedTermEnum.java:105) > at > org.apache.lucene.store.instantiated.TestEmptyIndex.termEnumTest(TestEmptyIndex.java:73) > at > org.apache.lucene.store.instantiated.TestEmptyIndex.testTermEnum(TestEmptyIndex.java:54) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-1423) InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on empty index
InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on empty index -- Key: LUCENE-1423 URL: https://issues.apache.org/jira/browse/LUCENE-1423 Project: Lucene - Java Issue Type: Bug Components: contrib/* Affects Versions: 2.4 Reporter: Karl Wettin Assignee: Karl Wettin Fix For: 2.9 {code} java.lang.ArrayIndexOutOfBoundsException: 0 at org.apache.lucene.store.instantiated.InstantiatedTermEnum.skipTo(InstantiatedTermEnum.java:105) at org.apache.lucene.store.instantiated.TestEmptyIndex.termEnumTest(TestEmptyIndex.java:73) at org.apache.lucene.store.instantiated.TestEmptyIndex.testTermEnum(TestEmptyIndex.java:54) {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Setting Fix Version in JIRA
I think it makes more sense to leave fix version to committers when they assign them self to the issue. I say this because of the hundreds of open and unreviewed issues that one would have to update in the tracker between each release. 23 sep 2008 kl. 21.33 skrev Otis Gospodnetic: Hi, When people add new issues to JIRA they most often don't set the "Fix Version" field. Would it not be better to have a default value for that field, so that new entries don't get forgotten when we filter by "Fix Version" looking for issues to fix for the next release? If every issue had "Fix Version" set we'd be able to schedule things better, give reporters and others more insight into when a particular item will be taken care of, etc. When we are ready for the release we'd just bump all unresolved issues to the next planned version (e.g. Solr 1.3.1 or 1.4 or Lucene 2.4 or 2.9) Thoughts? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions
[ https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-1380: Lucene Fields: [New, Patch Available] (was: [Patch Available, New]) Assignee: (was: Karl Wettin) I'm unassigning myself from this issue as there are so many votes and I consider it a hack to add a change whos soul purpose is to change the behavior of a query parser and I don't think such a thing should be committed. I think the focus should be on the query parser and I understand that is a lot more work than modifying the shingle filter. If you really want to do this change is this layer I suggest that you seperate out this feature to a new filter that modify the position increment. > Patch for ShingleFilter.enablePositions > --- > > Key: LUCENE-1380 > URL: https://issues.apache.org/jira/browse/LUCENE-1380 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers >Reporter: Mck SembWever >Priority: Trivial > Attachments: LUCENE-1380.patch, LUCENE-1380.patch > > > Make it possible for *all* words and shingles to be placed at the same > position, that is for _all_ shingles (and unigrams if included) to be treated > as synonyms of each other. > Today the shingles generated are synonyms only to the first term in the > shingle. > For example the query "abcd efgh ijkl" results in: >("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl") > where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh > ijkl" is a synonym of "efgh". > There exists no way today to alter which token a particular shingle is a > synonym for. > This patch takes the first step in making it possible to make all shingles > (and unigrams if included) synonyms of each other. > See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for > mailing list thread. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1387) Add LocalLucene
[ https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633102#action_12633102 ] Karl Wettin commented on LUCENE-1387: - bq. I'm struggling to get two of the existing tests to pass... I don't think it is from my modifications since they don't pass on the original either. On my box the test fails with different results due to the writer not beeing comitted in setUp, giving me 0 results. After adding a commit it fails with the results you are reporting here. Is it possible that you are getting one sort of result in the original due to non committed writer and another error in this version due to your changes to the distance measurement? All points in the list are rather close to each other so very small changes to the algorithm might be the problem. I have a hard time tracing the code and I'm sort of hoping this might be the problem. > Add LocalLucene > --- > > Key: LUCENE-1387 > URL: https://issues.apache.org/jira/browse/LUCENE-1387 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* >Reporter: Grant Ingersoll >Priority: Minor > Attachments: spatial.zip > > > Local Lucene (Geo-search) has been donated to the Lucene project, per > https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle > the Lucene portion of integration. > See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 2.4 release candidate 1
403 access denied :( Index: package.html === --- package.html(revision 697120) +++ package.html(arbetskopia) @@ -56,6 +56,8 @@ Mileage may vary depending on term saturation. + + Populated with a single document InstantiatedIndex is almost, but not quite, as fast as MemoryIndex. Index: doc-files/HitCollectionBench.jpg 19 sep 2008 kl. 16.42 skrev Michael McCandless: I agree it makes sense to get this into 2.4. Yes I'll roll an RC2 soon, with all the little fixes pending on 2.4: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310110&fixfor=12312681 I'm not certain but I would assume you have the karma to commit to contrib on the 2.4 branch. Try it out and see? Make sure you commit to trunk too. Mike Karl Wettin wrote: There is going to be an rc2, right? A couple of people have asked me questions about the performance of InstantiatedIndex (via private mail and on the freenode #lucene channel). They have tried to use it as a replacement for RAMDirectory when it has been a rather large corpora. There is a graph in the JIRA issue that clearly shows this is not always a good idea, and I think it would be a good thing to include this graph is the package javadocs. http://issues.apache.org/jira/secure/attachment/12353601/HitCollectionBench.jpg Is there still time to get that in there? As this will be the first release containing InstantiatedIndex I'd say it makes a lot of sense to pop it in. Do I have karma to modify the branch? Binary files and patches does not compute according to svn diff. karl 18 sep 2008 kl. 20.29 skrev Michael McCandless: Hi, I just created the first release candidate for 2.4, here: http://people.apache.org/~mikemccand/staging-area/lucene2.4rc1 Please download the release candidate, kick the tires and report back on any issues you encounter. The plan is to make only serious bug fixes or build/doc fixes, to 2.4 for ~10 days, after which if there are no blockers I'll call a vote for the actual release. Happy testing, and thanks! Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]