[jira] Commented: (LUCENE-826) Language detector

2010-01-26 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805027#action_12805027
 ] 

Karl Wettin commented on LUCENE-826:


Hi Ken,

it's hard for me to compare. I'll rant a bit about my experience from language 
detection though. 

I still haven't found a one strategy that works good on any text: a user query, 
a sentence, a paragraph or a complete document. 1-5 grams using SVM or NB works 
pretty good for them all but you really need to train it with the same sort of 
data you want to classify. Even when training with a mix of text lengths it 
tend to perform a lot worse than if you had one classifier for each data type. 
And you still probably want to twiddle with the classifier knobs to make it 
work great with the data you are classifying and training with.

In some cases I've used 1-10 grams and other times I've used 2-4 grams. 
Sometimes I've used SVM and other times I've used a simple desiction tree.

To sum it up, to achieve good quality I've always had to  build a classifier 
for that specific use case. Weka has a great test suite for figuring out what 
to use. Set it up, press play and return one week later to find out what to use.

> Language detector
> -
>
> Key: LUCENE-826
> URL: https://issues.apache.org/jira/browse/LUCENE-826
> Project: Lucene - Java
>      Issue Type: New Feature
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Attachments: ld.tar.gz, ld.tar.gz
>
>
> A formula 1A token/ngram-based language detector. Requires a paragraph of 
> text to avoid false positive classifications. 
> Depends on contrib/analyzers/ngrams for tokenization, Weka for classification 
> (logistic support vector models) feature selection and normalization of token 
> freuencies.  Optionally Wikipedia and NekoHTML for training data harvesting.
> Initialized like this:
> {code}
> LanguageRoot root = new LanguageRoot(new 
> File("documentClassifier/language root"));
> root.addBranch("uralic");
> root.addBranch("fino-ugric", "uralic");
> root.addBranch("ugric", "uralic");
> root.addLanguage("fino-ugric", "fin", "finnish", "fi", "Suomi");
> root.addBranch("proto-indo european");
> root.addBranch("germanic", "proto-indo european");
> root.addBranch("northern germanic", "germanic");
> root.addLanguage("northern germanic", "dan", "danish", "da", "Danmark");
> root.addLanguage("northern germanic", "nor", "norwegian", "no", "Norge");
> root.addLanguage("northern germanic", "swe", "swedish", "sv", "Sverige");
> root.addBranch("west germanic", "germanic");
> root.addLanguage("west germanic", "eng", "english", "en", "UK");
> root.mkdirs();
> LanguageClassifier classifier = new LanguageClassifier(root);
> if (!new File(root.getDataPath(), "trainingData.arff").exists()) {
>   classifier.compileTrainingData(); // from wikipedia
> }
> classifier.buildClassifier();
> {code}
> Training set build from Wikipedia is the pages describing the home country of 
> each registred language in the language to train. Above example pass this 
> test:
> (testEquals is the same as assertEquals, just not required. Only one of them 
> fail, see comment.)
> {code}
> assertEquals("swe", classifier.classify(sweden_in_swedish).getISO());
> testEquals("swe", classifier.classify(norway_in_swedish).getISO());
> testEquals("swe", classifier.classify(denmark_in_swedish).getISO());
> testEquals("swe", classifier.classify(finland_in_swedish).getISO());
> testEquals("swe", classifier.classify(uk_in_swedish).getISO());
> testEquals("nor", classifier.classify(sweden_in_norwegian).getISO());
> assertEquals("nor", classifier.classify(norway_in_norwegian).getISO());
> testEquals("nor", classifier.classify(denmark_in_norwegian).getISO());
> testEquals("nor", classifier.classify(finland_in_norwegian).getISO());
> testEquals("nor", classifier.classify(uk_in_norwegian).getISO());
> testEquals("fin", classifier.classify(sweden_in_finnish).getISO());
> testEquals("fin", classifier.classify(norway_i

[jira] Commented: (LUCENE-626) Extended spell checker with phrase support and adaptive user session analysis.

2010-01-26 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805021#action_12805021
 ] 

Karl Wettin commented on LUCENE-626:


Hej Mikkel,

the test case data set is on an HDD hidden away on an attic 600km away from me, 
but I've asked for someone in the vicinity to fetch it for me. Might take a 
little while. Sorry!

However extremely cool that you're working with this old beast! I'm super busy 
as always but I promise to follow your progress in case there is something you 
wonder about. It's been a few years since I looked at the code though.

> Extended spell checker with phrase support and adaptive user session analysis.
> --
>
> Key: LUCENE-626
> URL: https://issues.apache.org/jira/browse/LUCENE-626
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-626_20071023.txt
>
>
> Extensive javadocs available in patch, but I also try to keep it compiled 
> here: 
> http://ginandtonique.org/~kalle/javadocs/didyoumean/org/apache/lucene/search/didyoumean/package-summary.html#package_description
> A semi-retarded reinforcement learning thingy backed by algorithmic second 
> level suggestion schemes that learns from and adapts to user behavior as 
> queries change, suggestions are accepted or declined, etc.
> Except for detecting spelling errors it considers context, 
> composition/decomposition and a few other things.
> heroes of light and magik -> heroes of might and magic
> vinci da code -> da vinci code
> java docs -> javadocs
> blacksabbath -> black sabbath
> Depends on LUCENE-550

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2194) improve efficiency of snowballfilter

2010-01-07 Thread Karl Wettin

+1

7 jan 2010 kl. 19.50 skrev Robert Muir (JIRA):



   [ https://issues.apache.org/jira/browse/LUCENE-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797736 
#action_12797736 ]


Robert Muir commented on LUCENE-2194:
-

i tested this with some English, benchmark pkg, etc and at most it  
seems to only improve processing speed 10%. but I think its worth  
the trouble since its an easy improvement.


i'll commit in a few days if no one objects.


improve efficiency of snowballfilter


   Key: LUCENE-2194
   URL: https://issues.apache.org/jira/browse/LUCENE-2194
   Project: Lucene - Java
Issue Type: Improvement
Components: contrib/analyzers
  Reporter: Robert Muir
  Assignee: Robert Muir
  Priority: Minor
   Fix For: 3.1

   Attachments: LUCENE-2194.patch


snowball stemming currently creates 2 new strings and 1 new  
stringbuilder for every word.

all of this is unnecessary, so don't do it.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1515) Improved(?) Swedish snowball stemmer

2010-01-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795968#action_12795968
 ] 

Karl Wettin commented on LUCENE-1515:
-

I just posted this to the Snowball users list:

The Swedish Snowball stemmer does a terrible job according to 
<http://web.jhu.edu/bin/q/b/p75-mcnamee.pdf>. It even claims that lfs5, i.e. 
substring(0,5), does a better job. (It also says that 5-grams cracks the nut.)

This didn't come as surprise to me as I've identified problems in the past and 
implemented my own augmentation that's been posted to this list before, now 
living at <http://issues.apache.org/jira/browse/LUCENE-1515>.

Reading the paper made me take a closer look at what's wrong.

define main_suffix as (
setlimit tomark p1 for ([substring])
among(
'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne'
'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter'
'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens'
'hetens' 'erns' 'at' 'andet' 'het' 'ast'
'era' 'erar' 'erarna' 'erarnas' 
// augmentation starts here
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
'ansernas'
'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades'
'ikation'
'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens'
// augmentation ends here
(delete)

's'
(s_ending delete)



In conjunction with ~200 exception rules these additions help. There are 
however quite a bit of problems with many of the old rules.


E.g. 's' (s_ending delete) is a pluralis rule but have ~5300 exceptions where 
words ends with s is nominative case singularis. The problem is when written in 
other form than nominative case. 

kurs (course)
kursen (the course)
kursens (the [undefined noun] of the course)
kurser (courses)
kurserna (the courses)
kursernas (the [undefined noun] of the courses)

Kurs is stemmed to "kur" (which by the way will missmatch with kur as in 
remedy) while all the others are correctly stemmed as "kurs".

All together there are, according to my estimation, some 10 000 words that will 
create incompatible stems between nominative case singularis and any other 
form. That is about 8% of the official language. 

One rather simple solution is to always use both unstemmed and stemmed words, 
e.g. as synonyms in an inverted index. But if only using the stemmed output 
(from the official stemmer or my augmentation) I'd argue it's better to skip 
stemming all together.

A better solution would be to set up the stemmer to ignore the 10 000 
exceptions. What would be the best way to implement this? I'd like the 
generated Java code to simply contain a HashSet noStemExceptions; that 
was checked first, or something like that.


> Improved(?) Swedish snowball stemmer
> 
>
> Key: LUCENE-1515
> URL: https://issues.apache.org/jira/browse/LUCENE-1515
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
> Attachments: LUCENE-1515.txt
>
>
> Snowball stemmer for Swedish lacks support for '-an' and '-ans' related 
> suffix stripping, ending up with non compatible stems for example "klocka", 
> "klockor", "klockornas", "klockAN", "klockANS".  Complete list of new suffix 
> stripping rules:
> {pre}
> 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
> 'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
> 'ansernas'
> 'iera'
> (delete)
> {pre}
> Th

[jira] Commented: (LUCENE-1515) Improved(?) Swedish snowball stemmer

2010-01-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795967#action_12795967
 ] 

Karl Wettin commented on LUCENE-1515:
-

I've added a few more rules. I'll have to add a few more tests etc before I 
post a new patch.

{code}

define main_suffix as (
setlimit tomark p1 for ([substring])
among(
'a' 'arna' 'erna' 'heterna' 'orna' 'ad' 'e' 'ade' 'ande' 'arne'
'are' 'aste' 'en' 'anden' 'aren' 'heten' 'ern' 'ar' 'er' 'heter'
'or' 'as' 'arnas' 'ernas' 'ornas' 'es' 'ades' 'andes' 'ens' 'arens'
'hetens' 'erns' 'at' 'andet' 'het' 'ast'
'era' 'erar' 'erarna' 'erarnas' 
// augmentation starts here
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
'ansernas'
'iera' 'ierat' 'ierats' 'ierad' 'ierade' 'ierades'
'ikation'
'ikat' 'ikatet' 'ikatets' 'ikaten' 'ikatens'
// augmentation ends here
(delete)

's'
(s_ending delete)

{code}

> Improved(?) Swedish snowball stemmer
> 
>
> Key: LUCENE-1515
> URL: https://issues.apache.org/jira/browse/LUCENE-1515
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
> Attachments: LUCENE-1515.txt
>
>
> Snowball stemmer for Swedish lacks support for '-an' and '-ans' related 
> suffix stripping, ending up with non compatible stems for example "klocka", 
> "klockor", "klockornas", "klockAN", "klockANS".  Complete list of new suffix 
> stripping rules:
> {pre}
> 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
> 'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
> 'ansernas'
> 'iera'
> (delete)
> {pre}
> The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
> this is an attempt at solving that problem. The rules and exceptions are 
> based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] 
> entries suffixed with 'an' and 'ans'. There a few known problematic stemming 
> rules but seems to work quite a bit better than the current SwedishStemmer. 
> It would not be a bad idea to check all of SAOL entries in order to make sure 
> the integrity of the rules.
> My Snowball syntax skills are rather limited so I'm certain the code could be 
> optimized quite a bit.
> *The code is released under BSD and not ASL*. I've been posting a bit in the 
> Snowball forum and privatly to Martin Porter himself but never got any 
> response so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: LUCENE-1515

2010-01-02 Thread Karl Wettin


1 jan 2010 kl. 14.28 skrev Grant Ingersoll:

Please, no Swedish2 or any variant like that.  How about something  
that let's users know what it is and why they should use it?


In my view Swedish2 is a better name than  
MoreSupportForGenitiveCaseSufficesThanSwedishStemmer. Such a name can  
turn out pretty far fetched if someone adds more rules to it in the  
future.


Perhaps AugmentedSwedishStemmer?


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: LUCENE-1515

2010-01-02 Thread Karl Wettin
I'm actually not sure I understand the question. Afaik backwards  
compatibillity with the current SwedishStemmer could only be acheived  
by stemming using both classes and make diffing output synonyms.


I just did a bit of testing and the problems I've identified in 1515  
is also available in SwedishStemmer. Not that surprising as 1515 is an  
augmentation of SwedishStemmer...


Personally I would not mind deprecating SwedishStemmer (renaming it to  
OldSwedish or something) and later on replace it with 1515, but that  
might mess with some people that don't read the README and just  
upgrade the jar while running on the same old index.



31 dec 2009 kl. 21.55 skrev Simon Willnauer:


Is there any chance to get the best of both worlds? Could we merge
both together and preserve bw compat with version?
Introducing another stemmer doing almost the same thing as an already
existing one does is exactly what we try to prevent right now. I don't
doubt that this issue is an improvement just thinking of a way to keep
code duplication as low as possible.

I haven't looked at the code yet so if my question are completely
nonsense let me know.

simon

On Thu, Dec 31, 2009 at 6:05 PM, Karl Wettin   
wrote:


31 dec 2009 kl. 17.43 skrev Simon Willnauer:

what is the essential difference between the existing and  
LUCENE-1515

stemmer?


1515 handles genitive case suffices better. An example:

klocka (a clock)
klockan (the clock)
klockans (the [insert noun] of the clock)
klockornas (the [insert noun] of the clocks)

Using snowball SwedishStemmer:

klocka -> klock
klockan -> klock
klockans  -> klockans
klockornas -> klockornas


karl



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: LUCENE-1515

2009-12-31 Thread Karl Wettin


31 dec 2009 kl. 17.43 skrev Simon Willnauer:

what is the essential difference between the existing and  
LUCENE-1515 stemmer?


1515 handles genitive case suffices better. An example:

klocka (a clock)
klockan (the clock)
klockans (the [insert noun] of the clock)
klockornas (the [insert noun] of the clocks)

Using snowball SwedishStemmer:

klocka -> klock
klockan -> klock
klockans  -> klockans
klockornas -> klockornas


 karl



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



LUCENE-1515

2009-12-31 Thread Karl Wettin
1515 is an alternative Swedish stemmer that handles a couple of things  
unsupported by the original stemmer. A few things is handled worse,  
but all together I think it's a better algorithm. I've used it in two  
commercial applications. I'd like to commit it. Even though I've done  
my best to make them notice it, the snowball community never commented  
on it. Perhaps I should attempt once again before pushing it to Lucene.


The code is, as the rest of the snowball contrib package, BSD. That  
shouldn'y cause any problems, right?


What should I call this stemmer? Swedish2? SwedishToo? Svenska? :)


http://issues.apache.org/jira/browse/LUCENE-1515


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-11 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789248#action_12789248
 ] 

Karl Wettin commented on LUCENE-2144:
-

I don't have any strong feelings about this line of code, but let me at least 
explain it.

I like the idea that IIFoo behaves the same way a SegementFoo, even during 
incorrect/undocumented use of the API. 

There are no real use cases for this in the Lucene distribution, there are 
however effects people might use even though caused by invalid use of the API 
and not recommened. E.g. a skipTo to a target greater than the greatest 
document associated with that term will position the enum at the greatest 
document number for that term. Even though I wouldn't do something like this 
others might. 

In this case, where an immediate #next() on IR#termDocs() is called, it's might 
look silly to compare the behaviour of II and Segment as it's such blatantly 
erroneous use of the API, but even I have been known to come up with some 
rather strange solution now and then when nobody else is looking.

One alternative is that  #next would produce an InvalidStateException or 
something instead of just accepting the call, but then there is of course the 
small extra cost associated with checking if the enum has been seeked yet and 
#next is a rather commonly used method.

> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-10 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789021#action_12789021
 ] 

Karl Wettin commented on LUCENE-2144:
-

Committed change to trunk.

In 3.0 comment out ~line 227 in TestIndicesEquals

// this is invalid use of the API,
// but if the response differs then it's an indication that something might 
have changed.
// in 2.9 and 3.0 the two TermDocs-implementations returned different 
values at this point.
assertEquals("Descripency during invalid use of the TermDocs API, see 
comments in test code for details.",
aprioriTermDocs.next(), testTermDocs.next());


> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-10 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788966#action_12788966
 ] 

Karl Wettin commented on LUCENE-2144:
-

bq. at 
org.apache.lucene.store.instantiated.TestIndicesEquals.testTermDocsSomeMore(TestIndicesEquals.java:226)

I have no idea. How do I merge back locally so I can debug it?




> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-10 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788950#action_12788950
 ] 

Karl Wettin commented on LUCENE-2144:
-

bq. We should fix this on at least 3.0 as well right?

Would be great if you had the bandwidth to fix that.

> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-10 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-2144:


Attachment: LUCENE-2144.txt

BUILD SUCCESSFUL
Total time: 36 minutes 4 seconds


> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>    Reporter: Karl Wettin
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

2009-12-10 Thread Karl Wettin (JIRA)
InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
-

 Key: LUCENE-2144
 URL: https://issues.apache.org/jira/browse/LUCENE-2144
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0, 2.9.1, 2.9
Reporter: Karl Wettin
Priority: Critical


This patch contains core changes so someone else needs to commit it.

Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.

AllTermDocs now has a superclass, AbstractAllTermDocs that also 
InstantiatedAllTermDocs extend.

Also:

 * II-tests made less plausable to pass on future incompatible changes to 
TermDocs and TermEnum
 * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
from SegmentTermDocs#dito when returning false
 * II now uses BitVector rather than sets for deleted documents


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-774) TopDocs and TopFieldDocs does not implement equals and hashCode

2009-12-10 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-774.
--

Resolution: Won't Fix

> TopDocs and TopFieldDocs does not implement equals and hashCode
> ---
>
> Key: LUCENE-774
> URL: https://issues.apache.org/jira/browse/LUCENE-774
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>    Reporter: Karl Wettin
>Priority: Trivial
> Attachments: extendsObject.diff
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

2009-11-06 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774252#action_12774252
 ] 

Karl Wettin commented on LUCENE-1370:
-

Oups, I seem to have assigned this to me and then forgotten about it. Sorry!
I'll check it out this weekend!

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --
>
> Key: LUCENE-1370
> URL: https://issues.apache.org/jira/browse/LUCENE-1370
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Chris Harris
>Assignee: Karl Wettin
> Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, 
> LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token 
> stream is only one token long, then ShingleFilter.next() won't return any 
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this 
> option is set and the underlying stream is only one token long, then 
> ShingleFilter will return that token, regardless of the setting of 
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using 
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using 
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters 
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very 
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a 
> single word query would tokenize like this:
> "please" ->
>[no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like 
> this:
> "please" ->
>   "please"
> 
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> 
> I'm not sure if the patch in this state is useful to anyone else, but I 
> thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [VOTE] Release Apache Lucene Java 2.9.1, take 3

2009-11-01 Thread Karl Wettin

+1

30 okt 2009 kl. 00.27 skrev Michael McCandless:


OK, let's try this again!

I've built new release artifacts from svn rev 831145 (on the 2.9
branch), here:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/

Changes are here:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1changes/

Please vote to officially release these artifacts as Apache Lucene
Java 2.9.1.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [Lucene-java Wiki] Update of "LuceneAtApacheConUs2009" by HossMan

2009-10-20 Thread Karl Wettin


20 okt 2009 kl. 07.15 skrev Apache Wiki:

+ There will be a Lucene/Search !MeetUp on Tuesday night at 8PM.   
'This event is open to anyone who wants to come, even if you are  
not registered for the conference'.


That is a really nice thing, and completely new if I'm not misstaken.  
Perhaps even worth advertise as news on the front page.



   karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-10 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1958.
---

Resolution: Won't Fix

Not a problem in 2.9

> ShingleFilter creates shingles across two consecutives documents : bug or 
> normal behaviour ?
> 
>
> Key: LUCENE-1958
> URL: https://issues.apache.org/jira/browse/LUCENE-1958
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: Windows XP / jdk1.6.0_15
>Reporter: MRIT64
>Priority: Minor
>
> HI
> I add two consecutive documents that are indexed with some filters. The last 
> one is ShingleFilter.
> ShingleFilter creates a shingle spannnig the two documents, which has no 
> sense in my context.
> Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
> behaviour, is it possible to change it optionnaly ?
> Thanks
> MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1947.
---

Resolution: Fixed

Committed in revision 823445

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch, LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Using payload during indexes with Lucene 2.9.0

2009-10-08 Thread Karl Wettin

Hi Mauro,

this is the -dev list where we discuss the development of the API.  
Questions about how to use the API should be sent to the -users list.  
Please try use the -users list for future questions on how to use the  
API or if responding to this mail.


In answer to your question, the classes you are looking for are  
located in the contrib/analyzers package.

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/common/src/java/org/apache/lucene/analysis/payloads/
http://repo2.maven.org/maven2/org/apache/lucene/lucene-analyzers/2.9.0/

  karl


8 okt 2009 kl. 22.45 skrev Mauro Dragoni:


Hi to everyone,
I'm new in this mailing list... :)

Some days ago I downloaded the new versione of Lucene, but I didn't
find the classes that I used to index terms with payload
(PayloadEncoder, DelimitedPayloadTokenFilter, etc.)
So, I would ask you where may I find an example to use payload with
the new lucene version.

Thanks in advance to everyone.
Mauro.

--
Dott. Mauro Dragoni
Ph.D. Università di Milano, Italy

My Business Site: http://www.dragotechpro.com
My Research Site: http://www.genalgo.com


Confidentially Notice. This electronic mail transmission may contain
legally privileged and/or confidential information. Do not read this,
if you are not the person named to.
Any use, distribution, copying or disclosure by any other person is
strictly prohibited.
If you received this transmission in error, please notify the sender
and delete the original transmission and its attachments without
reading or saving it in any manner.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Output from a small Snowball benchmark

2009-10-08 Thread Karl Wettin
There have been a few small comments in the Jira about the reflection  
in Snowball's Among class. There is very little to do about this  
unless one want to redesign the stemmers so they include an inner  
class that handle the method callbacks. That's quite a bit of work and  
I don't even know how much CPU one would save by doing this.


So I was thinking maybe it would save a some resources if one reused  
the stemmers instead of reinstantiating them, which I presume  
everybody does.


I thought it would make most sense to simulate query time stemming so  
my benchmark contained 4 words where 2 of them are plural. Each test  
ran 1 000 000 times. The amount of CPU time used is bearly noticeable  
relative to what other things cost: 0.0109ms/iteration when  
reinstantiating, 0.0067ms/iteration when reusing.


The heap consuption was however rather different. At the end of  
reinstantiation it had consumed about 10x more than when reusing.  
~20MB vs. ~2MB.



I realize people don't usally run 1 000 000 queries in so short time,  
but at least this is an indication that one could save some GC time  
here. Many a mickle makes a muckle...


So I was thinking that perhaps it would make sense with something like  
a singleton concurrent queue in the SnowballFilter and a new  
constructor that takes the snowball program implementation class as an  
argument.


But this might also be way premature optimization.


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-07 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1947:


Attachment: LUCENE-1947.patch

* Added Snowball license header to static Snowball classes (SnowballProgram, 
Among and TestApp)
* Refactored StringBuffer to StringBuilder in all classes
* Added notes about above in README and package overview.

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch, LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1948) Deprecating InstantiatedIndexWriter

2009-10-05 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1948:


Attachment: LUCENE-1948.patch

> Deprecating InstantiatedIndexWriter
> ---
>
> Key: LUCENE-1948
> URL: https://issues.apache.org/jira/browse/LUCENE-1948
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1948.patch
>
>
> http://markmail.org/message/j6ip266fpzuaibf7
> I suppose that should have been suggested before 2.9 rather than  
> after...
> There are at least three reasons to why I want to do this:
> The code is based on the behaviour or the Directory IndexWriter as of  
> 2.3 and I have not been touching it since then. If there will be  
> changes in the future one will have to keep IIW in sync, something  
> that's easy to forget.
> There is no locking which will cause concurrent modification  
> exceptions when accessing the index via searcher/reader while  
> committing.
> It use the old token stream API so it has to be upgraded in case it  
> should stay.
> The java- and package level docs have since it was committed been  
> suggesting that one should consider using II as if it was immutable  
> due to the locklessness. My suggestion is that we make it immutable  
> for real.
> Since II is ment for small corpora there is very little time lost by  
> using the constructor that builts the index from an IndexReader. I.e.  
> rather than using InstantiatedIndexWriter one would have to use a  
> Directory and an IndexWriter and then pass an IndexReader to a new  
> InstantiatedIndex.
> Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1948) Deprecating InstantiatedIndexWriter

2009-10-05 Thread Karl Wettin (JIRA)
Deprecating InstantiatedIndexWriter
---

 Key: LUCENE-1948
 URL: https://issues.apache.org/jira/browse/LUCENE-1948
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


http://markmail.org/message/j6ip266fpzuaibf7

I suppose that should have been suggested before 2.9 rather than  
after...

There are at least three reasons to why I want to do this:

The code is based on the behaviour or the Directory IndexWriter as of  
2.3 and I have not been touching it since then. If there will be  
changes in the future one will have to keep IIW in sync, something  
that's easy to forget.
There is no locking which will cause concurrent modification  
exceptions when accessing the index via searcher/reader while  
committing.
It use the old token stream API so it has to be upgraded in case it  
should stay.

The java- and package level docs have since it was committed been  
suggesting that one should consider using II as if it was immutable  
due to the locklessness. My suggestion is that we make it immutable  
for real.

Since II is ment for small corpora there is very little time lost by  
using the constructor that builts the index from an IndexReader. I.e.  
rather than using InstantiatedIndexWriter one would have to use a  
Directory and an IndexWriter and then pass an IndexReader to a new  
InstantiatedIndex.

Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1947:


Attachment: LUCENE-1947.patch

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-05 Thread Karl Wettin (JIRA)
Snowball package contains BSD licensed code with ASL header
---

 Key: LUCENE-1947
 URL: https://issues.apache.org/jira/browse/LUCENE-1947
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) has 
for some reason been given an ASL header. These classes are licensed with BSD. 
Thus the ASL header should be removed. I suppose this a misstake or possible 
due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-05 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1939.
---

   Resolution: Fixed
Fix Version/s: 3.0

Committed in 821888.

Thanks Patrick!

(I'll consider the other stuff mentioned in the issue later this week, and if 
managable then as a new issue.)

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-05 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762131#action_12762131
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. err... looks like perhaps its only hit once though and then reused.. maybe 
not so nasty. My first time looking at this code, so I'm sure you can clear it 
up ...

Mark, are you referring to the reflection in Among? Those are pretty tough to 
get rid of.

I think we should replace the StringBuffers in the stemmers if nobody else 
minds. But I think we should do that in another issue. I also found a bit of 
ASL headers in some of the classes. Suppose they have been added automatically 
at some point. These classes are all BSD.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257_messages.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Deprecating InstantiatedIndexWriter

2009-10-03 Thread Karl Wettin
I suppose that should have been suggested before 2.9 rather than  
after...


There are at least three reasons to why I want to do this:

The code is based on the behaviour or the Directory IndexWriter as of  
2.3 and I have not been touching it since then. If there will be  
changes in the future one will have to keep IIW in sync, something  
that's easy to forget.
There is no locking which will cause concurrent modification  
exceptions when accessing the index via searcher/reader while  
committing.
It use the old token stream API so it has to be upgraded in case it  
should stay.


The java- and package level docs have since it was committed been  
suggesting that one should consider using II as if it was immutable  
due to the locklessness. My suggestion is that we make it immutable  
for real.


Since II is ment for small corpora there is very little time lost by  
using the constructor that builts the index from an IndexReader. I.e.  
rather than using InstantiatedIndexWriter one would have to use a  
Directory and an IndexWriter and then pass an IndexReader to a new  
InstantiatedIndex.



Any objections?

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761924#action_12761924
 ] 

Karl Wettin commented on LUCENE-1939:
-

The exception is thrown when ts#next (incrementToken) is called again after 
already having returned null (false) once. So this is a nice catch!

But this means that RemoveDuplicatesTokenFilter in Solr calls incrementToken 
one extra time for some reason. Can you please post the complete stacktrace so 
I can take a look in there too? 

I suppose the expected behaviour would be that a token stream keep returning 
false when incrementToken is called upon after returning false already, but the 
javadocs doesn't  really say anything about this, nor is there a generic test 
case that ensure this for all filters. Thus this error might be available in 
other filters. I'll see if I can do something about that before committing.

Thanks for the report Patrick!

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761877#action_12761877
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Fix for InstantiadexIndex compile error caused by code committed in 
revision 821277

Committed in rev 821315

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> lucene1257surround1.patch, lucene1257surround1.patch, 
> shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1257:


Attachment: instantiated_fieldable.patch

Fix for InstantiadexIndex compile error caused by code committed in revision 
821277
List rather than List

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> lucene1257surround1.patch, lucene1257surround1.patch, 
> shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761875#action_12761875
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. how that?

It asserted that a Document contained a List rather than List 
in ctor(IndexReader) , which I actually think is true at that point using that 
code.


> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761874#action_12761874
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Generified ShingleMatrixFilter

Committed in rev 821311

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1257:


Attachment: shinglematrixfilter_generified.patch

Generified ShingleMatrixFilter

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761870#action_12761870
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Generification of Document. It makes now clear what getFields() returns 
really. This was very bad documented. Now its a List.

This broke InstantiatedIndex in the trunk. Patch and commit is on the way.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761868#action_12761868
 ] 

Karl Wettin commented on LUCENE-1939:
-

Patrick,

I can't manage to reproduce this error. Uwe is right though, you are getting 
this error using 2.4.1 or earlier, not by using 2.9.

bq. at 
org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)

Can you please try with 2.9? It would also be very helpful if you could list 
the applicable Solr configurations and some example data you are passing to the 
filter when it's thrown.

Thanks in advance.


> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761862#action_12761862
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Wait ... do you mean you got rid of some of the reflection or did we lose 
your changes? I'm seeing some nasty slow reflection in there still ...

My changes was to the abstract Snowball stemmer class. I simply added an 
abstract method and got rid of the reflection in the Lucene filter. 

One could argue that we should update the Snowball compiler rather than 
updating the Java code it renders. But honestly I think we should just update 
the rendered code and then report any improvement found to the Snowball ml and 
keep track of it in the package readme.

bq. err... looks like perhaps its only hit once though and then reused.. maybe 
not so nasty. My first time looking at this code, so I'm sure you can clear it 
up ...

It could still be rather expensive per stem at query time. I vote for getting 
rid of it if we can. I'll throw an eye at it.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1257) Port to Java5

2009-10-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761755#action_12761755
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. I vote to move to StringBuilder anyway if its in Contrib. Though probably 
not with Snowball, since we don't really write/maintain that code.

Actually I patched the Snowball stemmer code to get ridth of the use of 
reflection. So what we use is an altered version of their code. I tried to get 
Dr Porter to commit those changes for years but it's still the same. Based on 
this I think we could just keep going with our own stuff in there as long we 
keep a record of what we have done in case we want to merge with their trunk. 

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761712#action_12761712
 ] 

Karl Wettin commented on LUCENE-1939:
-

bq. I also think so, because the above stack dump seems to be from 2.4.1 (in 
2.9 there should be incrementToken() instead of next() for all filters listed 
there).

Ah, I missunderstood your comment. The thing is that ShingleMatrixFilter was 
left using the old API because of its complexity. I told whoever it was that 
gave it a shot that I'd look in to upgrading it, just haven't had time to do so 
yet. There will be a new generified and updated version of the filter any year 
now. At least before 3.0.

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761706#action_12761706
 ] 

Karl Wettin commented on LUCENE-1939:
-

bq. Is this caused by the rewrite because of the new TokenStream API?

Nah, I think it's just a miss in the code never cought before. Not sure though 
so I'll write a test or two this weekend.

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method

2009-10-02 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1939:
---

Assignee: Karl Wettin

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-625) Query auto completer

2009-07-29 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736923#action_12736923
 ] 

Karl Wettin commented on LUCENE-625:


bq. Karl, did you ever proceed on this patch? I'm interested in adding 
autosuggest to Solr.

I used this patch for a few things a couple of years ago. If I recall 
everything right I ended up using the bootstrapped apriori corpus of LUCENE-626 
as training data the last time. Made the corpus rather small, speedy and still 
relevant for most users.

But the major caveat is that this patch is a trie and is thus a "precise 
forward only" thing. So that might not fit all use cases. It might be easier to 
get things going using an index with ngrams of untokenized user queries (i.e. 
including whitespace) or subject-like fields. 

But I really prefere user queries as using only the last n queries will make it 
sensitive to trends. That will however require quite a bit of data to work 
well. A lot as in hundreds of thousands of user queries, according to my 
experience.

Not sure if this was an answer to your question.. : )

> Query auto completer
> 
>
> Key: LUCENE-625
> URL: https://issues.apache.org/jira/browse/LUCENE-625
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: autocomplete_0.0.1.tar.gz, autocomplete_20060730.tar.gz
>
>
> A trie that helps users to type in their query. Made for AJAX, works great 
> with ruby on rails common scripts <http://script.aculo.us/>. Similar to the 
> Google labs suggester.
> Trained by user queries. Optimizable. Uses an in memory corpus. Serializable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2009-06-22 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722575#action_12722575
 ] 

Karl Wettin commented on LUCENE-1260:
-

Hi Johan,

didn't try it out yet but the patch looks nice and clean. +1 from me. Let's try 
to convince some of the old -1:ers. 

YONIK? See, it's not just me. ; )

I do however still think it's nice with the serializable codec interface as in 
the previous patches in order for all applications to use the index as intended 
(Luke and what not). 256 bytes stored to a file and by default backed by a 
binary search or so unless there is a registred codec that handles it 
algorithmic. I'll copy and paste that in as an alternative suggestion ASAP.

(I think the next move should be to allow for per field variable norms 
resolution, but that is a whole new issue.)

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
> Attachments: Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, 
> LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-06-13 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin resolved LUCENE-1578.
-

Resolution: Fixed

comitted

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: svn commit: r784481 - in /lucene/java/trunk/contrib: ./ instantiated/src/java/org/apache/lucene/store/instantiated/ instantiated/src/test/org/apache/lucene/store/instantiated/

2009-06-13 Thread Karl Wettin

oups, an error in the code. im on it.

13 jun 2009 kl. 23.54 skrev ka...@apache.org:


Author: kalle
Date: Sat Jun 13 21:54:07 2009
New Revision: 784481

URL: http://svn.apache.org/viewvc?rev=784481&view=rev
Log:
LUCENE-1578: Support for loading unoptimized readers to the  
constructor of InstantiatedIndex. (Karl Wettin)



Added:
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestUnoptimizedReaderOnConstructor.java

Modified:
   lucene/java/trunk/contrib/CHANGES.txt
   lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/ 
store/instantiated/InstantiatedIndex.java
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestIndicesEquals.java
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestRealTime.java


Modified: lucene/java/trunk/contrib/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/CHANGES.txt?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==

--- lucene/java/trunk/contrib/CHANGES.txt (original)
+++ lucene/java/trunk/contrib/CHANGES.txt Sat Jun 13 21:54:07 2009
@@ -62,8 +62,11 @@
(Xiaoping Gao via Mike McCandless)


-6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for  
automatically adding payloads "in-stream" (Grant Ingersoll)

-
+ 6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for  
automatically adding payloads "in-stream" (Grant Ingersoll)

+
+ 7. LUCENE-1578: Support for loading unoptimized readers to the
+    constructor of InstantiatedIndex. (Karl Wettin)
+
Optimizations

  1. LUCENE-1643: Re-use the collation key (RawCollationKey) for

Modified: lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java

URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==
--- lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java (original)
+++ lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java Sat Jun 13 21:54:07  
2009

@@ -110,7 +110,8 @@
  public InstantiatedIndex(IndexReader sourceIndexReader,  
Set fields) throws IOException {


if (!sourceIndexReader.isOptimized()) {
-  throw new IOException("Source index is not optimized.");
+  System.out.println(("Source index is not optimized."));
+  //throw new IOException("Source index is not optimized.");
}


@@ -170,11 +171,14 @@
}


-documentsByNumber = new  
InstantiatedDocument[sourceIndexReader.numDocs()];
+documentsByNumber = new  
InstantiatedDocument[sourceIndexReader.maxDoc()];

+

// create documents
-for (int i = 0; i < sourceIndexReader.numDocs(); i++) {
-  if (!sourceIndexReader.isDeleted(i)) {
+for (int i = 0; i < sourceIndexReader.maxDoc(); i++) {
+  if (sourceIndexReader.isDeleted(i)) {
+deletedDocuments.add(i);
+  } else {
InstantiatedDocument document = new InstantiatedDocument();
// copy stored fields from source reader
Document sourceDocument = sourceIndexReader.document(i);
@@ -259,6 +263,9 @@

// load offsets to term-document informations
for (InstantiatedDocument document : getDocumentsByNumber()) {
+  if (document == null) {
+continue; // deleted
+  }
  for (Field field : (List)  
document.getDocument().getFields()) {
if (field.isTermVectorStored() &&  
field.isStoreOffsetWithTermVector()) {
  TermPositionVector termPositionVector =  
(TermPositionVector)  
sourceIndexReader.getTermFreqVector(document.getDocumentNumber(),  
field.name());


Modified: lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java

URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==
--- lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java (original)
+++ lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java Sat Jun 13 21:54:07  
2009

@@ -40,6 +40,10 @@
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.TermQuery;
+import org.apac

[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
 ] 

Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-

Although you have a valid point I'd like to argue this a bit. 

My arguments are probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

  was (Author: karl.wettin):
Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 
  
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
 ] 

Karl Wettin commented on LUCENE-1491:
-

Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2009-06-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715567#action_12715567
 ] 

Karl Wettin commented on LUCENE-1491:
-

bq. Perhaps we need boolean keepSmaller somewhere, so we can explicitly control 
the behaviour?

I'm not sure. Is there a use case for this or is it an XY-problem?



> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: HitCollector#collect(int,float,Collection)

2009-06-02 Thread Karl Wettin
So, I've been sleeping on this for a few weeks. Would it be possible  
to solve this with a decorator? Perhaps a top level decorator that  
also decorates all subqueries at rewrite-time and then keeps the  
instantiated scorers bound to the top level decorator, i.e. makes the  
decorated query non resuable.


Query realQuery = ...
DecoratedQuery dq = new DecoratedQuery(realQuery);
searcher.search(dq, ..);
Map dq.getScoringQueries();

Not quite sure if this is terrible or elegant.


karl

7 apr 2009 kl. 12.17 skrev Michael McCandless:

On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin   
wrote:


7 apr 2009 kl. 10.23 skrev Michael McCandless:


Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?

EG for a boolean query w/ N clauses, which of those N clauses  
matched?


This is exactly what I mean. I do however think it makes sense to get
information about non atomic queries as it seems reasonble that the  
first
clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is  
more

interesting than only getting to know that one of the clauses of that
boolean query is matching.


Ahh OK I agree.  So every query in the full tree should be able to
state whether it matched the doc.


A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use  
this

functionality.


I'll look in to it.

Thanks for the feedback.


karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-06-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715494#action_12715494
 ] 

Karl Wettin commented on LUCENE-1578:
-

Jason,

did you get a chanse to try this out? It seems to work fine for me and I plan 
to pop it in the trunk in a few days. I think I'll have to add a warning of 
some kind in runtime though as it could slow down the index a bit if the reader 
is way fragmented.

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-06-02 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1578:
---

Assignee: Karl Wettin

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2009-06-02 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715492#action_12715492
 ] 

Karl Wettin commented on LUCENE-1260:
-

bq. Wouldn't the simplest solution be to refactor out the static methods, 
replace them with instance methods and remove the getNormDecoder method? This 
would enable a pluggable behavior without introducing a new Codec.

Hi Johan,

feel free to post a patch!



> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
> Attachments: LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: InstantiatedIndex Memory required

2009-05-13 Thread Karl Wettin

Hi Ravichandra,

this is a question better fitted the java-users maillinglist. On this  
list we talk about the development of the Lucene API rather than how  
to use it.


To answer your question, there is no simple formula that says how much  
RAM an InstantiatedIndex will consume given the FSDirectory or  
RAMDirectory size. Your index is however probably way too large for  
when InstantiatedIndex is considerably faster than RAMDirecotry. There  
is a diagram in the Javadocs that shows the speed on a Reuters index  
as it grows in size:


http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/store/instantiated/package-summary.html#package_description

As milage varies on term saturation you should still try benchmarking  
and see if there is anything to be gained. Try increasing Xmx to  
whatever you have, you can also take a look at -XX:+AggressiveHeap.



 karl

12 maj 2009 kl. 18.43 skrev thiruvee:



Hi

So far I am using RAMDirectory for my indexes. To meet the SLA of our
project, i thought of using InstantiatedIndex. But when I used that,  
i am
not able to get any out put from that and its throwing out of memory  
error.


What is the ratio between Index size and memory size, when using
InstantiatedIndex.
Here are my index details:

Index size : 200mB
RAM Size : 1 GB


If i try with a small test index of size 100KB, its working.
Please help me with this.

Thanks
Ravichandra






--
View this message in context: 
http://www.nabble.com/InstantiatedIndex-Memory-required-tp23506231p23506231.html
Sent from the Lucene - Java Developer mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: HitCollector#collect(int,float,Collection)

2009-04-07 Thread Karl Wettin


7 apr 2009 kl. 10.23 skrev Michael McCandless:


Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?

EG for a boolean query w/ N clauses, which of those N clauses matched?


This is exactly what I mean. I do however think it makes sense to get  
information about non atomic queries as it seems reasonble that the  
first clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching  
is more interesting than only getting to know that one of the clauses  
of that boolean query is matching.



A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use this
functionality.


I'll look in to it.

Thanks for the feedback.


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



HitCollector#collect(int,float,Collection)

2009-04-06 Thread Karl Wettin
How crazy would it be to refactor HitCollector so it also accept the  
matching queries?


Let's ignore my use case (not sure it makes sense yet, it's related to  
finding a threadshold between probably interesting and definitly not  
interesting results of huge OR-statements, but I really have to try it  
out before I can say if it's any good) and just focus on the speed  
impact. If I cleared and reused the Collection passed down to the  
HitCollector then it shouldn't really slow things down, right? And if  
I reused the collections in my TopDocsCollector as low scoring results  
was pushed down then it shouldn't have to be expensive there either. Or?



karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store

2009-03-30 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693744#action_12693744
 ] 

Karl Wettin commented on LUCENE-1039:
-

Vaijanath,

can you please post a small test case that demonstrates the problem?

> Bayesian classifiers using Lucene as data store
> ---
>
> Key: LUCENE-1039
> URL: https://issues.apache.org/jira/browse/LUCENE-1039
> Project: Lucene - Java
>  Issue Type: New Feature
>    Reporter: Karl Wettin
>    Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1039.txt
>
>
> Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
> Fisher method algorithms as described by Toby Segaran in "Programming 
> Collective Intelligence", ISBN 978-0-596-52932-1. 
> Have fun.
> Poor java docs, but the TestCase shows how to use it:
> {code:java}
> public class TestClassifier extends TestCase {
>   public void test() throws Exception {
> InstanceFactory instanceFactory = new InstanceFactory() {
>   public Document factory(String text, String _class) {
> Document doc = new Document();
> doc.add(new Field("class", _class, Field.Store.YES, 
> Field.Index.NO_NORMS));
> doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, 
> Field.TermVector.NO));
> doc.add(new Field("text/ngrams/start", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/end", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> return doc;
>   }
>   Analyzer analyzer = new Analyzer() {
> private int minGram = 2;
> private int maxGram = 3;
> public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new StandardTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   if (fieldName.endsWith("/ngrams/start")) {
> ts = new EdgeNGramTokenFilter(ts, 
> EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/inner")) {
> ts = new NGramTokenFilter(ts, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/end")) {
> ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
> minGram, maxGram);
>   }
>   return ts;
> }
>   };
>   public Analyzer getAnalyzer() {
> return analyzer;
>   }
> };
> Directory dir = new RAMDirectory();
> new IndexWriter(dir, null, true).close();
> Instances instances = new Instances(dir, instanceFactory, "class");
> instances.addInstance("hello world", "en");
> instances.addInstance("hallå världen", "sv");
> instances.addInstance("this is london calling", "en");
> instances.addInstance("detta är london som ringer", "sv");
> instances.addInstance("john has a long mustache", "en");
> instances.addInstance("john har en lång mustache", "sv");
> instances.addInstance("all work and no play makes jack a dull boy", "en");
> instances.addInstance("att bara arbeta och aldrig leka gör jack en trist 
> gosse", "sv");
> instances.addInstance("shrimp sandwich", "en");
> instances.addInstance("räksmörgås", "sv");
> instances.addInstance("it's now or never", "en");
> instances.addInstance("det är nu eller aldrig", "sv");
> instances.addInstance("to tie up at a landing-stage", "en");
> instances.addInstance("att angöra en brygga", "sv");
> instances.addInstance("it's now time for the children's television 
> shows", "en");
> instances.addInstance("nu är det dags för barnprogram", "sv");
> instances.flush();
> testClassifier(instances, new NaiveBayesClassifier());
> testClassifier(instances, new FishersMethodClassifier());
> instances.close();
>   }
>   private void testClassifier(Instances instances, BayesianClassifier 
> classifier) throws IOException {
> assertEquals("sv",

[jira] Updated: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-03-30 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1578:


Attachment: LUCENE-1578.txt

Please test this patch using a couple of different unoptimized readers in the 
constructor.

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: InstantiatedIndex

2009-03-30 Thread Karl Wettin


28 mar 2009 kl. 01.21 skrev Jason Rutherglen:


I'm thinking InstantiatedIndex needs to implement either clone of  
all the index data or needs to be able to accept a non-optimized  
reader, or both.  I forget what the obstacles are to implementing  
the non-optimized reader option?  Do you think there are advantages  
or disadvantages when comparing the solutions?


Hi Jason,

I honestly don't remember the reason but it seems to have something to  
do with deletions.





Realtime search will need to periodically merge  
InstantiatedIndex's.  One option is to clone an existing index, then  
add a document to it, clone, and so on, freeze it and later merge it  
with other indexes.  The other option that provides the same  
functionality is to pass the smaller readers into an  
InstantiatedIndex.


How do you feel about something like this?

public InstantiatedIndex merge(IndexReader[] readers) {
  Directory dir = new RAMDirectory();
  IndexWriter w = new IndexWriter(dir);
  w.addIndexes(readers);
  w.commit();
  w.optimize();
  w.close();
  IndexReader reader = IndexReader.open(dir);
  InstantiatedIndex ii = new InstantiatedIndex(reader);
  reader.close();
  dir.close();
  return ii;
}



 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-03-19 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683423#action_12683423
 ] 

Karl Wettin commented on LUCENE-1543:
-

bq. Karl, is there a reason why a function query can't be used in your 
situation? It seems like it should work?

I'm sure it would. : ) 

I do however not understand why you think it is a more correct/nice/better/what 
not solution than to use this patch. This is how I reason: if the feature of 
norms scoring is available in all other low level queries, than it also makes 
sense to have it in the low level MatchAllDocumentsQuery

> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675118#action_12675118
 ] 

Karl Wettin commented on LUCENE-1543:
-

bq. Couldn't you just use a TermQuery? Or a BooleanQuery with a 
MatchAllDocsQuery and an optional TermQuery?

Wouldn't that require a TermQuery that match all documents? I.e. adding a term 
to a field in all documents?

The following stuff doesn't really fit in this issue, but still. It's rather 
related to column stride payloads LUCENE-1231 . I've been considering adding a 
new "norms" field at document level for a couple of years now. 8 more bits at 
document level would allow for moving general document boosting to move it out 
the norms-boost-per-field-blob and increase the length normalization and per 
field boost resolution quite a bit at a low cost. 

(I hope that is not yet another can of worms I get to open.)


> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1543:


Attachment: LUCENE-1543.txt

> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)
Field specified norms in MatchAllDocumentsScorer 
-

 Key: LUCENE-1543
 URL: https://issues.apache.org/jira/browse/LUCENE-1543
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.4
Reporter: Karl Wettin
Priority: Minor
 Fix For: 2.9
 Attachments: LUCENE-1543.txt

This patch allows for optionally setting a field to use for norms factoring 
when scoring a MatchingAllDocumentsQuery.

>From the test case:
{code:java}
.
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
IndexWriter.MaxFieldLength.LIMITED);
iw.setMaxBufferedDocs(2);  // force multi-segment
addDoc("one", iw, 1f);
addDoc("two", iw, 20f);
addDoc("three four", iw, 300f);
iw.close();

IndexReader ir = IndexReader.open(dir);
IndexSearcher is = new IndexSearcher(ir);
ScoreDoc[] hits;

// assert with norms scoring turned off

hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
assertEquals(3, hits.length);
assertEquals("one", ir.document(hits[0].doc).get("key"));
assertEquals("two", ir.document(hits[1].doc).get("key"));
assertEquals("three four", ir.document(hits[2].doc).get("key"));

// assert with norms scoring turned on

MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
assertEquals(3, hits.length);
//is.explain(normsQuery, hits[0].doc);
hits = is.search(normsQuery, null, 1000).scoreDocs;

assertEquals("three four", ir.document(hits[0].doc).get("key"));
assertEquals("two", ir.document(hits[1].doc).get("key"));
assertEquals("one", ir.document(hits[2].doc).get("key"));
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1537) InstantiatedIndexReader.clone

2009-02-15 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673610#action_12673610
 ] 

Karl Wettin commented on LUCENE-1537:
-

I didn't try it out yet, but I have a few comments and questions on the patch:

{code}
Index: 
contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java
+  
+  public Object clone() {
+try {
+  doCommit();
+  InstantiatedIndex clonedIndex = index.cloneWithDeletesNorms();
+  return new InstantiatedIndexReader(clonedIndex);
+} catch (IOException ioe) {
+  throw new RuntimeException("", ioe);
+}
+  }

Index: 
contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java
+
+  InstantiatedIndex cloneWithDeletesNorms() {
+InstantiatedIndex clone = new InstantiatedIndex();
+clone.version = System.currentTimeMillis();
+clone.documentsByNumber = documentsByNumber;
+clone.deletedDocuments = new HashSet(deletedDocuments);
+clone.termsByFieldAndText = termsByFieldAndText;
+clone.orderedTerms = orderedTerms;
+clone.normsByFieldNameAndDocumentNumber = new HashMap(normsByFieldNameAndDocumentNumber);
+clone.fieldSettings = fieldSettings;
+return clone;
+  }
{code}

Perhaps we should move deleted documents to the reader? It might be a bit of 
work to hook it up with term enum et c, but it could be worth looking in to. I 
think it makes more sense to keep the same instance of InstantiatedIndex and 
only produce a cloned InstantiatedIndexReader. It is the reader#clone we call 
upon so cloning the store sounds like a future placeholder for unwanted bugs.



I see there are some left overs from your attempt to handle none  optimized 
readers:

{code}
-documentsByNumber = new InstantiatedDocument[sourceIndexReader.numDocs()];
+documentsByNumber = new InstantiatedDocument[sourceIndexReader.maxDoc()];
 
 // create documents
 for (int i = 0; i < sourceIndexReader.numDocs(); i++) {
{code}

I think if you switch to maxDoc it should also use maxDoc int the loop and skip 
any deleted document. 



{code}
-for (InstantiatedDocument document : getDocumentsByNumber()) {
+//for (InstantiatedDocument document : getDocumentsByNumber()) {
+for (InstantiatedDocument document : getDocumentsNotDeleted()) {
   for (Field field : (List) document.getDocument().getFields()) {
 if (field.isTermVectorStored() && field.isStoreOffsetWithTermVector()) 
{
   TermPositionVector termPositionVector = (TermPositionVector) 
sourceIndexReader.getTermFreqVector(document.getDocumentNumber(), field.name());
@@ -312,7 +325,15 @@
   public InstantiatedDocument[] getDocumentsByNumber() {
 return documentsByNumber;
   }
-
+  
+  public List getDocumentsNotDeleted() {
+List list = new 
ArrayList(documentsByNumber.length-deletedDocuments.size());
+for (int x=0; x < documentsByNumber.length; x++) {
+  if (!deletedDocuments.contains(x)) list.add(documentsByNumber[x]);
+}
+return list;
+  } 
+  
{code}

As the source never contains any deleted documents this really doesn't do 
anything but consume a bit of resources, or?



{code}
-int maxVal = 
getAssociatedDocuments()[max].getDocument().getDocumentNumber();
+InstantiatedTermDocumentInformation itdi = getAssociatedDocuments()[max];
+InstantiatedDocument id = itdi.getDocument();
+int maxVal = id.getDocumentNumber();
+//int maxVal = 
getAssociatedDocuments()[max].getDocument().getDocumentNumber();
{code}

Is this refactor just for debugging purposes? I find it harder to read than the 
original one-liner.

> InstantiatedIndexReader.clone
> -
>
> Key: LUCENE-1537
> URL: https://issues.apache.org/jira/browse/LUCENE-1537
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Jason Rutherglen
>Assignee: Karl Wettin
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1537.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> This patch will implement IndexReader.clone for InstantiatedIndexReader.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1537) InstantiatedIndexReader.clone

2009-02-15 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1537:
---

Assignee: Karl Wettin

> InstantiatedIndexReader.clone
> -
>
> Key: LUCENE-1537
> URL: https://issues.apache.org/jira/browse/LUCENE-1537
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1537.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> This patch will implement IndexReader.clone for InstantiatedIndexReader.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-02-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1531.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 742411

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Partial / starts with searching

2009-02-05 Thread Karl Wettin

Hi Jori,

your question is better suited the java-users lists, on this list we  
discuss about developing the API.


To answer your question, ngrams might solve your problem, tokenizers  
are available in contrib/analyzers.



karl

5 feb 2009 kl. 10.19 skrev d-fader:


Hi,

I'm new to this list, so please don't be too harsh if I missed some  
rules or something. Since about half a year I'm using Lucene and I  
think it's awesome, respect for all your efforts!


Maybe the 'issue' I'm addressing now is discussed thouroughly  
already, in that case I think I need some redirection to the sources  
of those discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene  
(except the asterix method with e.g. the StandardAnalyzer -> ambul*  
to find ambulance). My problem with that method is that my index  
consists of quite a few terms. This means that if a user would  
search for 'ambu amster' (ambulance amsterdam), there will be so  
many terms to search, it's not doable. Now I started thinking why  
it's impossible to search only a 'part' of a term or even only the  
'start' of a term and the only reason I could think of was that the  
Index terms are stored tokenized (in that way you (of course) can't  
find partial terms, since the index actually doesn't contain the  
literal terms, but tokens instead). But Lucene can also store all  
terms untokenized, so in that case a partial search would be  
possible in my humble opinion, since all terms would be stored  
'literally'.


Maybe my thinking is wrong, I only have a black box view of Lucene,  
so I don't know much about indexing algorithm and all, but I just  
want to know if this could be done or else why not :) You see, the  
users of my index want to know why they can't search parts of the  
words they enter and I still can't give them a really good answer,  
except the 'it would result in too many OR operators in the query'  
statement :)


Thanks in advance!

Jori

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-02-03 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670240#action_12670240
 ] 

Karl Wettin commented on LUCENE-1531:
-

Any objections to this patch? If not I'll pop in the trunk in a few days from 
now.

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1531:


Attachment: LUCENE-1531.txt

Previous patch was messed up from cloning SpanTerm..

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1531:


Attachment: LUCENE-1531.txt

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)
contrib/xml-query-parser, BoostingTermQuery support
---

 Key: LUCENE-1531
 URL: https://issues.apache.org/jira/browse/LUCENE-1531
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 2.9


I'm not 100% on this patch. 

BooleanTermQuery is a part of the spans family, but I generally use that class 
as a replacement for TermQuery.  Thus in the DTD I have stated that it can be a 
part of the root queries as well as a part of a span. 

However, SpanFooQueries xml elements are named  rather than 
, I have however chosen to call it . It 
would be possible to set it up so it would be parsed as  
when inside of a , but I just find that confusing.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Filesystem based bitset

2009-01-09 Thread Karl Wettin

Thinking out loud,

SSD is pretty close to RAM when it comes to seeking. Wouldn't that  
mean that a bitset stored on an SSD would be more or less as fast as a  
bitset in RAM? So how about storing all permutations of filters one  
use on SSD? Perhaps loading them to RAM in case they are frequently  
used? To me it sounds like a great idea.


Not sure if one should focus at OpenBitSet or a fixed size BitSet, I'd  
really need to do some real tests to tell. Still, I'm rather convinced  
the bang for the buck ratio is quite a bit more using SSD than RAM  
given IO throughput (compare an index in RAM vs on SSD vs on HDD)  
isn't an issue.


The only real issue I can this of is the lack of  
DocSetIterator#close()..




karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1515) Improved(?) Swedish snowball stemmer

2009-01-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1515:


Attachment: LUCENE-1515.txt

snowball code, generated java class and unit test.

> Improved(?) Swedish snowball stemmer
> 
>
> Key: LUCENE-1515
> URL: https://issues.apache.org/jira/browse/LUCENE-1515
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
> Attachments: LUCENE-1515.txt
>
>
> Snowball stemmer for Swedish lacks support for '-an' and '-ans' related 
> suffix stripping, ending up with non compatible stems for example "klocka", 
> "klockor", "klockornas", "klockAN", "klockANS".  Complete list of new suffix 
> stripping rules:
> {pre}
> 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
> 'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
> 'ansernas'
> 'iera'
> (delete)
> {pre}
> The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
> this is an attempt at solving that problem. The rules and exceptions are 
> based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] 
> entries suffixed with 'an' and 'ans'. There a few known problematic stemming 
> rules but seems to work quite a bit better than the current SwedishStemmer. 
> It would not be a bad idea to check all of SAOL entries in order to make sure 
> the integrity of the rules.
> My Snowball syntax skills are rather limited so I'm certain the code could be 
> optimized quite a bit.
> *The code is released under BSD and not ASL*. I've been posting a bit in the 
> Snowball forum and privatly to Martin Porter himself but never got any 
> response so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store

2009-01-09 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662467#action_12662467
 ] 

Karl Wettin commented on LUCENE-1039:
-

What do you people think, should I commit this to Lucene or Mahout?

> Bayesian classifiers using Lucene as data store
> ---
>
> Key: LUCENE-1039
> URL: https://issues.apache.org/jira/browse/LUCENE-1039
> Project: Lucene - Java
>  Issue Type: New Feature
>    Reporter: Karl Wettin
>    Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1039.txt
>
>
> Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
> Fisher method algorithms as described by Toby Segaran in "Programming 
> Collective Intelligence", ISBN 978-0-596-52932-1. 
> Have fun.
> Poor java docs, but the TestCase shows how to use it:
> {code:java}
> public class TestClassifier extends TestCase {
>   public void test() throws Exception {
> InstanceFactory instanceFactory = new InstanceFactory() {
>   public Document factory(String text, String _class) {
> Document doc = new Document();
> doc.add(new Field("class", _class, Field.Store.YES, 
> Field.Index.NO_NORMS));
> doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, 
> Field.TermVector.NO));
> doc.add(new Field("text/ngrams/start", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/end", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> return doc;
>   }
>   Analyzer analyzer = new Analyzer() {
> private int minGram = 2;
> private int maxGram = 3;
> public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new StandardTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   if (fieldName.endsWith("/ngrams/start")) {
> ts = new EdgeNGramTokenFilter(ts, 
> EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/inner")) {
> ts = new NGramTokenFilter(ts, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/end")) {
> ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
> minGram, maxGram);
>   }
>   return ts;
> }
>   };
>   public Analyzer getAnalyzer() {
> return analyzer;
>   }
> };
> Directory dir = new RAMDirectory();
> new IndexWriter(dir, null, true).close();
> Instances instances = new Instances(dir, instanceFactory, "class");
> instances.addInstance("hello world", "en");
> instances.addInstance("hallå världen", "sv");
> instances.addInstance("this is london calling", "en");
> instances.addInstance("detta är london som ringer", "sv");
> instances.addInstance("john has a long mustache", "en");
> instances.addInstance("john har en lång mustache", "sv");
> instances.addInstance("all work and no play makes jack a dull boy", "en");
> instances.addInstance("att bara arbeta och aldrig leka gör jack en trist 
> gosse", "sv");
> instances.addInstance("shrimp sandwich", "en");
> instances.addInstance("räksmörgås", "sv");
> instances.addInstance("it's now or never", "en");
> instances.addInstance("det är nu eller aldrig", "sv");
> instances.addInstance("to tie up at a landing-stage", "en");
> instances.addInstance("att angöra en brygga", "sv");
> instances.addInstance("it's now time for the children's television 
> shows", "en");
> instances.addInstance("nu är det dags för barnprogram", "sv");
> instances.flush();
> testClassifier(instances, new NaiveBayesClassifier());
> testClassifier(instances, new FishersMethodClassifier());
> instances.close();
>   }
>   private void testClassifier(Instances instances, BayesianClassifier 
> classifier) throws IOException {
> assertEquals("sv", classifie

[jira] Created: (LUCENE-1515) Improved(?) Swedish snowball stemmer

2009-01-09 Thread Karl Wettin (JIRA)
Improved(?) Swedish snowball stemmer


 Key: LUCENE-1515
 URL: https://issues.apache.org/jira/browse/LUCENE-1515
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin


Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix 
stripping, ending up with non compatible stems for example "klocka", "klockor", 
"klockornas", "klockAN", "klockANS".  Complete list of new suffix stripping 
rules:

{pre}
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
'ansernas'
'iera'
(delete)
{pre}

The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
this is an attempt at solving that problem. The rules and exceptions are based 
on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries 
suffixed with 'an' and 'ans'. There a few known problematic stemming rules but 
seems to work quite a bit better than the current SwedishStemmer. It would not 
be a bad idea to check all of SAOL entries in order to make sure the integrity 
of the rules.

My Snowball syntax skills are rather limited so I'm certain the code could be 
optimized quite a bit.

*The code is released under BSD and not ASL*. I've been posting a bit in the 
Snowball forum and privatly to Martin Porter himself but never got any response 
so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows

2009-01-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1514.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed in revision 733064

> ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix 
> grows
> --
>
> Key: LUCENE-1514
> URL: https://issues.apache.org/jira/browse/LUCENE-1514
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1514.txt
>
>
> ShingleMatrixFilter#next makes a recursive function invocation when the 
> current permutation iterator is exhausted or if the current state of the 
> permutation iterator already has produced an identical shingle. In a not too 
> complex matrix this will require a gigabyte sized stack per thread.
> My solution is to avoid the recursive invocation by refactoring like this:
> {code:java}
> public Token next(final Token reusableToken) throws IOException {
> assert reusableToken != null;
> if (matrix == null) {
>   matrix = new Matrix();
>   // fill matrix with maximumShingleSize columns
>   while (matrix.columns.size() < maximumShingleSize && readColumn()) {
> // this loop looks ugly
>   }
> }
> // this loop exists in order to avoid recursive calls to the next method
> // as the complexity of a large matrix
> // then would require a multi gigabyte sized stack.
> Token token;
> do {
>   token = produceNextToken(reusableToken);
> } while (token == request_next_token);
> return token;
>   }
>   
>   private static final Token request_next_token = new Token();
>   /**
>* This method exists in order to avoid reursive calls to the method
>* as the complexity of a fairlt small matrix then easily would require
>* a gigabyte sized stack per thread.
>*
>* @param reusableToken
>* @return null if exhausted, instance request_next_token if one more call 
> is required for an answer, or instance parameter resuableToken.
>* @throws IOException
>*/
>   private Token produceNextToken(final Token reusableToken) throws 
> IOException {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows

2009-01-09 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1514:


Attachment: LUCENE-1514.txt

> ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix 
> grows
> --
>
> Key: LUCENE-1514
> URL: https://issues.apache.org/jira/browse/LUCENE-1514
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1514.txt
>
>
> ShingleMatrixFilter#next makes a recursive function invocation when the 
> current permutation iterator is exhausted or if the current state of the 
> permutation iterator already has produced an identical shingle. In a not too 
> complex matrix this will require a gigabyte sized stack per thread.
> My solution is to avoid the recursive invocation by refactoring like this:
> {code:java}
> public Token next(final Token reusableToken) throws IOException {
> assert reusableToken != null;
> if (matrix == null) {
>   matrix = new Matrix();
>   // fill matrix with maximumShingleSize columns
>   while (matrix.columns.size() < maximumShingleSize && readColumn()) {
> // this loop looks ugly
>   }
> }
> // this loop exists in order to avoid recursive calls to the next method
> // as the complexity of a large matrix
> // then would require a multi gigabyte sized stack.
> Token token;
> do {
>   token = produceNextToken(reusableToken);
> } while (token == request_next_token);
> return token;
>   }
>   
>   private static final Token request_next_token = new Token();
>   /**
>* This method exists in order to avoid reursive calls to the method
>* as the complexity of a fairlt small matrix then easily would require
>* a gigabyte sized stack per thread.
>*
>* @param reusableToken
>* @return null if exhausted, instance request_next_token if one more call 
> is required for an answer, or instance parameter resuableToken.
>* @throws IOException
>*/
>   private Token produceNextToken(final Token reusableToken) throws 
> IOException {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows

2009-01-09 Thread Karl Wettin (JIRA)
ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix 
grows
--

 Key: LUCENE-1514
 URL: https://issues.apache.org/jira/browse/LUCENE-1514
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/analyzers
Affects Versions: 2.4
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 2.9
 Attachments: LUCENE-1514.txt

ShingleMatrixFilter#next makes a recursive function invocation when the current 
permutation iterator is exhausted or if the current state of the permutation 
iterator already has produced an identical shingle. In a not too complex matrix 
this will require a gigabyte sized stack per thread.

My solution is to avoid the recursive invocation by refactoring like this:

{code:java}
public Token next(final Token reusableToken) throws IOException {
assert reusableToken != null;
if (matrix == null) {
  matrix = new Matrix();
  // fill matrix with maximumShingleSize columns
  while (matrix.columns.size() < maximumShingleSize && readColumn()) {
// this loop looks ugly
  }
}

// this loop exists in order to avoid recursive calls to the next method
// as the complexity of a large matrix
// then would require a multi gigabyte sized stack.
Token token;
do {
  token = produceNextToken(reusableToken);
} while (token == request_next_token);
return token;
  }

  
  private static final Token request_next_token = new Token();

  /**
   * This method exists in order to avoid reursive calls to the method
   * as the complexity of a fairlt small matrix then easily would require
   * a gigabyte sized stack per thread.
   *
   * @param reusableToken
   * @return null if exhausted, instance request_next_token if one more call is 
required for an answer, or instance parameter resuableToken.
   * @throws IOException
   */
  private Token produceNextToken(final Token reusableToken) throws IOException {

{code}



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-08 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1510.
---

   Resolution: Fixed
Fix Version/s: 2.9

> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-08 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661908#action_12661908
 ] 

Karl Wettin commented on LUCENE-1510:
-

Thanks for the report Robert!

I've committed a fix in revision 732661. Please check it out and let me know 
how it works for you. There was a bit of discrepancies between how the 
InstantiatedIndexReader handled null norms compared to a SegmentReader. I think 
these problems are fixed now.

 

> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1510) InstantiatedIndexReader throws NullPointerException in norms() when used with a MultiReader

2009-01-03 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1510:
---

Assignee: Karl Wettin

> InstantiatedIndexReader throws NullPointerException in norms() when used with 
> a MultiReader
> ---
>
> Key: LUCENE-1510
> URL: https://issues.apache.org/jira/browse/LUCENE-1510
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Robert Newson
>Assignee: Karl Wettin
> Attachments: TestWithMultiReader.java
>
>
> When using InstantiatedIndexReader under a MultiReader where the other Reader 
> contains documents, a NullPointerException is thrown here;
>  public void norms(String field, byte[] bytes, int offset) throws IOException 
> {
> byte[] norms = 
> getIndex().getNormsByFieldNameAndDocumentNumber().get(field);
> System.arraycopy(norms, 0, bytes, offset, norms.length);
>   }
> the 'norms' variable is null. Performing the copy only when norms is not null 
> does work, though I'm sure it's not the right fix.
> java.lang.NullPointerException
>   at 
> org.apache.lucene.store.instantiated.InstantiatedIndexReader.norms(InstantiatedIndexReader.java:297)
>   at org.apache.lucene.index.MultiReader.norms(MultiReader.java:273)
>   at 
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:70)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:131)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:112)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:136)
>   at org.apache.lucene.search.Searcher.search(Searcher.java:146)
>   at 
> org.apache.lucene.store.instantiated.TestWithMultiReader.test(TestWithMultiReader.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at junit.framework.TestCase.runTest(TestCase.java:164)
>   at junit.framework.TestCase.runBare(TestCase.java:130)
>   at junit.framework.TestResult$1.protect(TestResult.java:106)
>   at junit.framework.TestResult.runProtected(TestResult.java:124)
>   at junit.framework.TestResult.run(TestResult.java:109)
>   at junit.framework.TestCase.run(TestCase.java:120)
>   at junit.framework.TestSuite.runTest(TestSuite.java:230)
>   at junit.framework.TestSuite.run(TestSuite.java:225)
>   at 
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>   at 
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>   at 
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1501) Phonetic filters

2009-01-01 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660196#action_12660196
 ] 

Karl Wettin commented on LUCENE-1501:
-

bq. Ryan McKinley - 30/Dec/08 10:36 AM
bq. FYI, solr includes phonetic filters also... perhaps we should consolidate?

Ah, yes I think we should. I'll take a look at how they differ.

> Phonetic filters
> 
>
> Key: LUCENE-1501
> URL: https://issues.apache.org/jira/browse/LUCENE-1501
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1501.txt
>
>
> Metaphone, double metaphone, soundex and refined soundex filters using 
> commons codec API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1501) Phonetic filters

2008-12-28 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1501:


Attachment: LUCENE-1501.txt

This is in need of a bit of documentation about the different algorithms. It 
could also use some tests with with alternative languages.


> Phonetic filters
> 
>
> Key: LUCENE-1501
> URL: https://issues.apache.org/jira/browse/LUCENE-1501
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>    Reporter: Karl Wettin
>    Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1501.txt
>
>
> Metaphone, double metaphone, soundex and refined soundex filters using 
> commons codec API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1501) Phonetic filters

2008-12-28 Thread Karl Wettin (JIRA)
Phonetic filters


 Key: LUCENE-1501
 URL: https://issues.apache.org/jira/browse/LUCENE-1501
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Minor


Metaphone, double metaphone, soundex and refined soundex filters using commons 
codec API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-12-12 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1462.
---

Resolution: Fixed

Committed in r726030 and r 725837.

> Instantiated/IndexWriter discrepanies
> -
>
> Key: LUCENE-1462
> URL: https://issues.apache.org/jira/browse/LUCENE-1462
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Critical
> Fix For: 2.9
>
> Attachments: LUCENE-1462.txt
>
>
>  * RAMDirectory seems to do a reset on tokenStreams the first time, this 
> permits to initialise some objects before starting streaming, 
> InstantiatedIndex does not.
>  * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
> of : java.io.NotSerializableException: 
> org.apache.lucene.index.TermVectorOffsetInfo
> http://www.nabble.com/InstatiatedIndex-questions-to20576722.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SVN karma problem?

2008-12-12 Thread Karl Wettin
Everything worked great when I switched from svn.eu.apache.org to  
svn.apache.org. I suppose I should report that to someone. Infra?


12 dec 2008 kl. 00.13 skrev Grant Ingersoll:


http://www.nabble.com/Committing-new-files-to-(write-through-proxy)-slave-repo-fails---400-Bad-Request-td20083914.html

Any of that ring a bell?

On Dec 11, 2008, at 5:49 PM, Karl Wettin wrote:

I tried clean checkout, upgraded my SVN client and a bunch of other  
things. I could try to add and remove an alternative dummy file.


11 dec 2008 kl. 23.35 skrev Grant Ingersoll:


Does an svn cleanup help?  What about on a clean checkout?

On Dec 11, 2008, at 5:13 PM, Karl Wettin wrote:

I can't seem to commit new files in contrib, only update  
existing. Or am I misinterpreting the error?


svn: Commit failed (details follow):
svn: Server sent unexpected return value (400 Bad Request) in  
response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce- 
e749-4cd0-a609-6e2a3763b81d/lucene/java/trunk/contrib/ 
instantiated/src/test/org/apache/lucene/store/instantiated/ 
TestSerialization.java'

svn: Your commit message was left in a temporary file:
svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp'



  karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SVN karma problem?

2008-12-11 Thread Karl Wettin
I tried clean checkout, upgraded my SVN client and a bunch of other  
things. I could try to add and remove an alternative dummy file.


11 dec 2008 kl. 23.35 skrev Grant Ingersoll:


Does an svn cleanup help?  What about on a clean checkout?

On Dec 11, 2008, at 5:13 PM, Karl Wettin wrote:

I can't seem to commit new files in contrib, only update existing.  
Or am I misinterpreting the error?


svn: Commit failed (details follow):
svn: Server sent unexpected return value (400 Bad Request) in  
response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce- 
e749-4cd0-a609-6e2a3763b81d/lucene/java/trunk/contrib/instantiated/ 
src/test/org/apache/lucene/store/instantiated/TestSerialization.java'

svn: Your commit message was left in a temporary file:
svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp'



karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



SVN karma problem?

2008-12-11 Thread Karl Wettin
I can't seem to commit new files in contrib, only update existing. Or  
am I misinterpreting the error?


svn: Commit failed (details follow):
svn: Server sent unexpected return value (400 Bad Request) in response  
to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce-e749-4cd0- 
a609-6e2a3763b81d/lucene/java/trunk/contrib/instantiated/src/test/org/ 
apache/lucene/store/instantiated/TestSerialization.java'

svn: Your commit message was left in a temporary file:
svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp'



  karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-12-01 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1462:


Fix Version/s: 2.9

> Instantiated/IndexWriter discrepanies
> -
>
> Key: LUCENE-1462
> URL: https://issues.apache.org/jira/browse/LUCENE-1462
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Critical
> Fix For: 2.9
>
> Attachments: LUCENE-1462.txt
>
>
>  * RAMDirectory seems to do a reset on tokenStreams the first time, this 
> permits to initialise some objects before starting streaming, 
> InstantiatedIndex does not.
>  * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
> of : java.io.NotSerializableException: 
> org.apache.lucene.index.TermVectorOffsetInfo
> http://www.nabble.com/InstatiatedIndex-questions-to20576722.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-11-26 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1462:


Attachment: LUCENE-1462.txt

 * Made a few classes implement java.io:Serializable
 * TestCase that makes sure InstantiatedIndex can be passed to an 
ObjectOutputStream
 * Added a tokenStream.reset() in InstantiatedIndexWriter 

I need help to get this committed as it contains a minor change to 
TermVectorOffsetInfo (implements Serializable) thats outside of the contrib 
module.




> Instantiated/IndexWriter discrepanies
> -
>
> Key: LUCENE-1462
> URL: https://issues.apache.org/jira/browse/LUCENE-1462
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Critical
> Attachments: LUCENE-1462.txt
>
>
>  * RAMDirectory seems to do a reset on tokenStreams the first time, this 
> permits to initialise some objects before starting streaming, 
> InstantiatedIndex does not.
>  * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
> of : java.io.NotSerializableException: 
> org.apache.lucene.index.TermVectorOffsetInfo
> http://www.nabble.com/InstatiatedIndex-questions-to20576722.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



InstantiatedIndexWriter

2008-11-26 Thread Karl Wettin
I was just about to get on with LUCENE-1462 when I noticed the new  
TokenStream API. (Yeah, I've been really busy with other stuff for a  
while now.)


Rather than keeping InstantiatedIndexWriter in sync with IndexWriter  
I'm considering suggesting that we simply delete  
InstantiatedIndexWriter.


There is this one major caveats that would go away if we removed  
InstantiatedIndexWriter: it lacks read/write locks at commit time.  
Also, the javadocs says "consider using II as an immutable store" all  
over the place..


I'm a bit split here, I can see the use of beeing able to add a few  
documents to an existing II, but at the same time these indices are  
ment to be really small so creating a new one from an IndexReader is  
really no big deal. This operation means a few seconds of overhead if  
one needs to append data to the II.



I say that we should remove it from trunk. Less hassles. Or is this to  
remove good functionallity? I never use it, it was written in order to  
understand Lucene. But if people find it is very useful then of course  
it should be kept in there.


That might be a problem for some people. For instance I think Jason  
Rutherglens realtime search use this class.



 karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-11-19 Thread Karl Wettin (JIRA)
Instantiated/IndexWriter discrepanies
-

 Key: LUCENE-1462
 URL: https://issues.apache.org/jira/browse/LUCENE-1462
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin
Assignee: Karl Wettin
Priority: Critical


 * RAMDirectory seems to do a reset on tokenStreams the first time, this 
permits to initialise some objects before starting streaming, InstantiatedIndex 
does not.
 * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
of : java.io.NotSerializableException: 
org.apache.lucene.index.TermVectorOffsetInfo

http://www.nabble.com/InstatiatedIndex-questions-to20576722.html



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Closed: (LUCENE-1423) InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on empty index

2008-10-18 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1423.
---

Resolution: Fixed

committed in rev 705893

> InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on 
> empty index
> --
>
> Key: LUCENE-1423
> URL: https://issues.apache.org/jira/browse/LUCENE-1423
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
>
> {code}
> java.lang.ArrayIndexOutOfBoundsException: 0
>   at 
> org.apache.lucene.store.instantiated.InstantiatedTermEnum.skipTo(InstantiatedTermEnum.java:105)
>   at 
> org.apache.lucene.store.instantiated.TestEmptyIndex.termEnumTest(TestEmptyIndex.java:73)
>   at 
> org.apache.lucene.store.instantiated.TestEmptyIndex.testTermEnum(TestEmptyIndex.java:54)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1423) InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on empty index

2008-10-18 Thread Karl Wettin (JIRA)
InstantiatedTermEnum#skipTo(Term) throws ArrayIndexOutOfBoundsException on 
empty index
--

 Key: LUCENE-1423
 URL: https://issues.apache.org/jira/browse/LUCENE-1423
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 2.9


{code}
java.lang.ArrayIndexOutOfBoundsException: 0
at 
org.apache.lucene.store.instantiated.InstantiatedTermEnum.skipTo(InstantiatedTermEnum.java:105)
at 
org.apache.lucene.store.instantiated.TestEmptyIndex.termEnumTest(TestEmptyIndex.java:73)
at 
org.apache.lucene.store.instantiated.TestEmptyIndex.testTermEnum(TestEmptyIndex.java:54)
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Setting Fix Version in JIRA

2008-09-23 Thread Karl Wettin
I think it makes more sense to leave fix version to committers when  
they assign them self to the issue. I say this because of the hundreds  
of open and unreviewed issues that one would have to update in the  
tracker between each release.


23 sep 2008 kl. 21.33 skrev Otis Gospodnetic:


Hi,

When people add new issues to JIRA they most often don't set the  
"Fix Version" field.  Would it not be better to have a default value  
for that field, so that new entries don't get forgotten when we  
filter by "Fix Version" looking for issues to fix for the next  
release?  If every issue had "Fix Version" set we'd be able to  
schedule things better, give reporters and others more insight into  
when a particular item will be taken care of, etc.  When we are  
ready for the release we'd just bump all unresolved issues to the  
next planned version (e.g. Solr 1.3.1 or 1.4 or Lucene 2.4 or 2.9)



Thoughts?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1380) Patch for ShingleFilter.enablePositions

2008-09-22 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1380:


Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
 Assignee: (was: Karl Wettin)

I'm unassigning myself from this issue as there are so many votes and I 
consider it a hack to add a change whos soul purpose is to change the behavior 
of a query parser and I don't think such a thing should be committed. I think 
the focus should be on the query parser and I understand that is a lot more 
work than modifying the shingle filter. If you really want to do this change is 
this layer I suggest that you seperate out this feature to a new filter that 
modify the position increment.

> Patch for ShingleFilter.enablePositions
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1387) Add LocalLucene

2008-09-21 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633102#action_12633102
 ] 

Karl Wettin commented on LUCENE-1387:
-

bq. I'm struggling to get two of the existing tests to pass... I don't think it 
is from my modifications since they don't pass on the original either.

On my box the test fails with different results due to the writer not beeing 
comitted in setUp, giving me 0 results. After adding a commit it fails with the 
results you are reporting here.

Is it possible that you are getting one sort of result in the original due to 
non committed writer and another error in this version due to your changes to 
the distance measurement? All points in the list are rather close to each other 
so very small changes to the algorithm might be the problem.

I have a hard time tracing the code and I'm sort of hoping this might be the 
problem.

> Add LocalLucene
> ---
>
> Key: LUCENE-1387
> URL: https://issues.apache.org/jira/browse/LUCENE-1387
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Grant Ingersoll
>Priority: Minor
> Attachments: spatial.zip
>
>
> Local Lucene (Geo-search) has been donated to the Lucene project, per 
> https://issues.apache.org/jira/browse/INCUBATOR-77.  This issue is to handle 
> the Lucene portion of integration.
> See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 2.4 release candidate 1

2008-09-19 Thread Karl Wettin

403 access denied :(

Index: package.html
===
--- package.html(revision 697120)
+++ package.html(arbetskopia)
@@ -56,6 +56,8 @@

 Mileage may vary depending on term saturation.

+
+
 
   Populated with a single document InstantiatedIndex is almost, but  
not quite, as fast as MemoryIndex.

 
Index: doc-files/HitCollectionBench.jpg



19 sep 2008 kl. 16.42 skrev Michael McCandless:



I agree it makes sense to get this into 2.4.

Yes I'll roll an RC2 soon, with all the little fixes pending on 2.4:

   
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310110&fixfor=12312681

I'm not certain but I would assume you have the karma to commit to  
contrib on the 2.4 branch.  Try it out and see?  Make sure you  
commit to trunk too.


Mike

Karl Wettin wrote:


There is going to be an rc2, right?

A couple of people have asked me questions about the performance of  
InstantiatedIndex (via private mail and on the freenode #lucene  
channel). They have tried to use it as a replacement for  
RAMDirectory when it has been a rather large corpora. There is a  
graph in the JIRA issue that clearly shows this is not always a  
good idea, and I think it would be a good thing to include this  
graph is the package javadocs.


http://issues.apache.org/jira/secure/attachment/12353601/HitCollectionBench.jpg

Is there still time to get that in there? As this will be the first  
release containing InstantiatedIndex I'd say it makes a lot of  
sense to pop it in.


Do I have karma to modify the branch? Binary files and patches does  
not compute according to svn diff.


 karl

18 sep 2008 kl. 20.29 skrev Michael McCandless:



Hi,

I just created the first release candidate for 2.4, here:

http://people.apache.org/~mikemccand/staging-area/lucene2.4rc1

Please download the release candidate, kick the tires and report  
back

on any issues you encounter.

The plan is to make only serious bug fixes or build/doc fixes, to
2.4 for ~10 days, after which if there are no blockers I'll call a
vote for the actual release.

Happy testing, and thanks!

Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



  1   2   3   4   5   6   7   >