from:"karl wettin"


[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789021#action_12789021
 ] 

Karl Wettin commented on LUCENE-2144:
-

Committed change to trunk.

In 3.0 comment out ~line 227 in TestIndicesEquals

// this is invalid use of the API,
// but if the response differs then it's an indication that something might 
have changed.
// in 2.9 and 3.0 the two TermDocs-implementations returned different 
values at this point.
assertEquals("Descripency during invalid use of the TermDocs API, see 
comments in test code for details.",
aprioriTermDocs.next(), testTermDocs.next());


> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144-30.patch, LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)


[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788966#action_12788966
 ] 

Karl Wettin commented on LUCENE-2144:
-

bq. at 
org.apache.lucene.store.instantiated.TestIndicesEquals.testTermDocsSomeMore(TestIndicesEquals.java:226)

I have no idea. How do I merge back locally so I can debug it?




> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)


[ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788950#action_12788950
 ] 

Karl Wettin commented on LUCENE-2144:
-

bq. We should fix this on at least 3.0 as well right?

Would be great if you had the bandwidth to fix that.

> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Karl Wettin
>Assignee: Michael McCandless
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-2144:


Attachment: LUCENE-2144.txt

BUILD SUCCESSFUL
Total time: 36 minutes 4 seconds


> InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
> -
>
> Key: LUCENE-2144
> URL: https://issues.apache.org/jira/browse/LUCENE-2144
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9, 2.9.1, 3.0
>    Reporter: Karl Wettin
>Priority: Critical
> Attachments: LUCENE-2144.txt
>
>
> This patch contains core changes so someone else needs to commit it.
> Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
> FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.
> AllTermDocs now has a superclass, AbstractAllTermDocs that also 
> InstantiatedAllTermDocs extend.
> Also:
>  * II-tests made less plausable to pass on future incompatible changes to 
> TermDocs and TermEnum
>  * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
> from SegmentTermDocs#dito when returning false
>  * II now uses BitVector rather than sets for deleted documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2144) InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)

InstantiatedIndexReader does not handle #termDocs(null) correct (AllTermDocs)
-

 Key: LUCENE-2144
 URL: https://issues.apache.org/jira/browse/LUCENE-2144
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 3.0, 2.9.1, 2.9
Reporter: Karl Wettin
Priority: Critical


This patch contains core changes so someone else needs to commit it.

Due to the incompatible #termDocs(null) behaviour at least MatchAllDocsQuery, 
FieldCacheRangeFilter and ValueSourceQuery fails using II since 2.9.

AllTermDocs now has a superclass, AbstractAllTermDocs that also 
InstantiatedAllTermDocs extend.

Also:

 * II-tests made less plausable to pass on future incompatible changes to 
TermDocs and TermEnum
 * IITermDocs#skipTo and #next mimics the behaviour of document posisioning 
from SegmentTermDocs#dito when returning false
 * II now uses BitVector rather than sets for deleted documents


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-774) TopDocs and TopFieldDocs does not implement equals and hashCode


 [ 
https://issues.apache.org/jira/browse/LUCENE-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-774.
--

Resolution: Won't Fix

> TopDocs and TopFieldDocs does not implement equals and hashCode
> ---
>
> Key: LUCENE-774
> URL: https://issues.apache.org/jira/browse/LUCENE-774
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>    Reporter: Karl Wettin
>Priority: Trivial
> Attachments: extendsObject.diff
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1370) Patch to make ShingleFilter output a unigram if no ngrams can be generated

2009-11-06 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774252#action_12774252
 ] 

Karl Wettin commented on LUCENE-1370:
-

Oups, I seem to have assigned this to me and then forgotten about it. Sorry!
I'll check it out this weekend!

> Patch to make ShingleFilter output a unigram if no ngrams can be generated
> --
>
> Key: LUCENE-1370
> URL: https://issues.apache.org/jira/browse/LUCENE-1370
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Chris Harris
>Assignee: Karl Wettin
> Attachments: LUCENE-1370.patch, LUCENE-1370.patch, LUCENE-1370.patch, 
> LUCENE-1370.patch, ShingleFilter.patch
>
>
> Currently if ShingleFilter.outputUnigrams==false and the underlying token 
> stream is only one token long, then ShingleFilter.next() won't return any 
> tokens. This patch provides a new option, outputUnigramIfNoNgrams; if this 
> option is set and the underlying stream is only one token long, then 
> ShingleFilter will return that token, regardless of the setting of 
> outputUnigrams.
> My use case here is speeding up phrase queries. The technique is as follows:
> First, doing index-time analysis using ShingleFilter (using 
> outputUnigrams==true), thereby expanding things as follows:
> "please divide this sentence into shingles" ->
>  "please", "please divide"
>  "divide", "divide this"
>  "this", "this sentence"
>  "sentence", "sentence into"
>  "into", "into shingles"
>  "shingles"
> Second, do query-time analysis using ShingleFilter (using 
> outputUnigrams==false and outputUnigramIfNoNgrams==true). If the user enters 
> a phrase query, it will get tokenized in the following manner:
> "please divide this sentence into shingles" ->
>  "please divide"
>  "divide this"
>  "this sentence"
>  "sentence into"
>  "into shingles"
> By doing phrase queries with bigrams like this, I can gain a very 
> considerable speedup. Without the outputUnigramIfNoNgrams option, then a 
> single word query would tokenize like this:
> "please" ->
>[no tokens]
> But thanks to outputUnigramIfNoNgrams, single words will now tokenize like 
> this:
> "please" ->
>   "please"
> 
> The patch also adds a little to the pre-outputUnigramIfNoNgrams option tests.
> 
> I'm not sure if the patch in this state is useful to anyone else, but I 
> thought I should throw it up here and try to find out.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [VOTE] Release Apache Lucene Java 2.9.1, take 3

2009-11-01 Thread Karl Wettin


+1

30 okt 2009 kl. 00.27 skrev Michael McCandless:


OK, let's try this again!

I've built new release artifacts from svn rev 831145 (on the 2.9
branch), here:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/

Changes are here:

 http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1changes/

Please vote to officially release these artifacts as Apache Lucene
Java 2.9.1.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [Lucene-java Wiki] Update of "LuceneAtApacheConUs2009" by HossMan

2009-10-20 Thread Karl Wettin



20 okt 2009 kl. 07.15 skrev Apache Wiki:

+ There will be a Lucene/Search !MeetUp on Tuesday night at 8PM.   
'This event is open to anyone who wants to come, even if you are  
not registered for the conference'.


That is a really nice thing, and completely new if I'm not misstaken.  
Perhaps even worth advertise as news on the front page.



   karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1958) ShingleFilter creates shingles across two consecutives documents : bug or normal behaviour ?

2009-10-10 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1958.
---

Resolution: Won't Fix

Not a problem in 2.9

> ShingleFilter creates shingles across two consecutives documents : bug or 
> normal behaviour ?
> 
>
> Key: LUCENE-1958
> URL: https://issues.apache.org/jira/browse/LUCENE-1958
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.4.1
> Environment: Windows XP / jdk1.6.0_15
>Reporter: MRIT64
>Priority: Minor
>
> HI
> I add two consecutive documents that are indexed with some filters. The last 
> one is ShingleFilter.
> ShingleFilter creates a shingle spannnig the two documents, which has no 
> sense in my context.
> Is that a bug oris it  ShingleFilter normal behaviour ? If it's normal 
> behaviour, is it possible to change it optionnaly ?
> Thanks
> MR

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-09 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1947.
---

Resolution: Fixed

Committed in revision 823445

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch, LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Using payload during indexes with Lucene 2.9.0

2009-10-08 Thread Karl Wettin

Hi Mauro,

this is the -dev list where we discuss the development of the API.
Questions about how to use the API should be sent to the -users list.
Please try use the -users list for future questions on how to use the
API or if responding to this mail.

In answer to your question, the classes you are looking for are
located in the contrib/analyzers package.

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/common/src/java/org/apache/lucene/analysis/payloads/
http://repo2.maven.org/maven2/org/apache/lucene/lucene-analyzers/2.9.0/

karl

8 okt 2009 kl. 22.45 skrev Mauro Dragoni:

Hi to everyone,
I'm new in this mailing list... :)

Some days ago I downloaded the new versione of Lucene, but I didn't
find the classes that I used to index terms with payload
(PayloadEncoder, DelimitedPayloadTokenFilter, etc.)
So, I would ask you where may I find an example to use payload with
the new lucene version.

Thanks in advance to everyone.
Mauro.

--
Dott. Mauro Dragoni
Ph.D. Università di Milano, Italy

My Business Site: http://www.dragotechpro.com
My Research Site: http://www.genalgo.com

Confidentially Notice. This electronic mail transmission may contain
legally privileged and/or confidential information. Do not read this,
if you are not the person named to.
Any use, distribution, copying or disclosure by any other person is
strictly prohibited.
If you received this transmission in error, please notify the sender
and delete the original transmission and its attachments without
reading or saving it in any manner.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Output from a small Snowball benchmark

2009-10-08 Thread Karl Wettin

There have been a few small comments in the Jira about the reflection  
in Snowball's Among class. There is very little to do about this  
unless one want to redesign the stemmers so they include an inner  
class that handle the method callbacks. That's quite a bit of work and  
I don't even know how much CPU one would save by doing this.


So I was thinking maybe it would save a some resources if one reused  
the stemmers instead of reinstantiating them, which I presume  
everybody does.


I thought it would make most sense to simulate query time stemming so  
my benchmark contained 4 words where 2 of them are plural. Each test  
ran 1 000 000 times. The amount of CPU time used is bearly noticeable  
relative to what other things cost: 0.0109ms/iteration when  
reinstantiating, 0.0067ms/iteration when reusing.


The heap consuption was however rather different. At the end of  
reinstantiation it had consumed about 10x more than when reusing.  
~20MB vs. ~2MB.



I realize people don't usally run 1 000 000 queries in so short time,  
but at least this is an indication that one could save some GC time  
here. Many a mickle makes a muckle...


So I was thinking that perhaps it would make sense with something like  
a singleton concurrent queue in the SnowballFilter and a new  
constructor that takes the snowball program implementation class as an  
argument.


But this might also be way premature optimization.


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

2009-10-07 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1947:


Attachment: LUCENE-1947.patch

* Added Snowball license header to static Snowball classes (SnowballProgram, 
Among and TestApp)
* Refactored StringBuffer to StringBuilder in all classes
* Added notes about above in README and package overview.

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch, LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1948) Deprecating InstantiatedIndexWriter


 [ 
https://issues.apache.org/jira/browse/LUCENE-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1948:


Attachment: LUCENE-1948.patch

> Deprecating InstantiatedIndexWriter
> ---
>
> Key: LUCENE-1948
> URL: https://issues.apache.org/jira/browse/LUCENE-1948
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1948.patch
>
>
> http://markmail.org/message/j6ip266fpzuaibf7
> I suppose that should have been suggested before 2.9 rather than  
> after...
> There are at least three reasons to why I want to do this:
> The code is based on the behaviour or the Directory IndexWriter as of  
> 2.3 and I have not been touching it since then. If there will be  
> changes in the future one will have to keep IIW in sync, something  
> that's easy to forget.
> There is no locking which will cause concurrent modification  
> exceptions when accessing the index via searcher/reader while  
> committing.
> It use the old token stream API so it has to be upgraded in case it  
> should stay.
> The java- and package level docs have since it was committed been  
> suggesting that one should consider using II as if it was immutable  
> due to the locklessness. My suggestion is that we make it immutable  
> for real.
> Since II is ment for small corpora there is very little time lost by  
> using the constructor that builts the index from an IndexReader. I.e.  
> rather than using InstantiatedIndexWriter one would have to use a  
> Directory and an IndexWriter and then pass an IndexReader to a new  
> InstantiatedIndex.
> Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1948) Deprecating InstantiatedIndexWriter

Deprecating InstantiatedIndexWriter
---

 Key: LUCENE-1948
 URL: https://issues.apache.org/jira/browse/LUCENE-1948
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


http://markmail.org/message/j6ip266fpzuaibf7

I suppose that should have been suggested before 2.9 rather than  
after...

There are at least three reasons to why I want to do this:

The code is based on the behaviour or the Directory IndexWriter as of  
2.3 and I have not been touching it since then. If there will be  
changes in the future one will have to keep IIW in sync, something  
that's easy to forget.
There is no locking which will cause concurrent modification  
exceptions when accessing the index via searcher/reader while  
committing.
It use the old token stream API so it has to be upgraded in case it  
should stay.

The java- and package level docs have since it was committed been  
suggesting that one should consider using II as if it was immutable  
due to the locklessness. My suggestion is that we make it immutable  
for real.

Since II is ment for small corpora there is very little time lost by  
using the constructor that builts the index from an IndexReader. I.e.  
rather than using InstantiatedIndexWriter one would have to use a  
Directory and an IndexWriter and then pass an IndexReader to a new  
InstantiatedIndex.

Any objections?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header


 [ 
https://issues.apache.org/jira/browse/LUCENE-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1947:


Attachment: LUCENE-1947.patch

> Snowball package contains BSD licensed code with ASL header
> ---
>
> Key: LUCENE-1947
> URL: https://issues.apache.org/jira/browse/LUCENE-1947
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/analyzers
>Affects Versions: 2.9
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: LUCENE-1947.patch
>
>
> All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) 
> has for some reason been given an ASL header. These classes are licensed with 
> BSD. Thus the ASL header should be removed. I suppose this a misstake or 
> possible due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1947) Snowball package contains BSD licensed code with ASL header

Snowball package contains BSD licensed code with ASL header
---

 Key: LUCENE-1947
 URL: https://issues.apache.org/jira/browse/LUCENE-1947
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/analyzers
Affects Versions: 2.9
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 3.0


All classes in org.tartarus.snowball (but not in org.tartarus.snowball.ext) has 
for some reason been given an ASL header. These classes are licensed with BSD. 
Thus the ASL header should be removed. I suppose this a misstake or possible 
due to the ASL header automation tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


 [ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1939.
---

   Resolution: Fixed
Fix Version/s: 3.0

Committed in 821888.

Thanks Patrick!

(I'll consider the other stuff mentioned in the issue later this week, and if 
managable then as a new issue.)

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Fix For: 3.0
>
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12762131#action_12762131
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. err... looks like perhaps its only hit once though and then reused.. maybe 
not so nasty. My first time looking at this code, so I'm sure you can clear it 
up ...

Mark, are you referring to the reflection in Among? Those are pretty tough to 
get rid of.

I think we should replace the StringBuffers in the stemmers if nobody else 
minds. But I think we should do that in another issue. I also found a bit of 
ASL headers in some of the classes. Suppose they have been added automatically 
at some point. These classes are all BSD.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257_messages.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Deprecating InstantiatedIndexWriter

2009-10-03 Thread Karl Wettin

I suppose that should have been suggested before 2.9 rather than  
after...


There are at least three reasons to why I want to do this:

The code is based on the behaviour or the Directory IndexWriter as of  
2.3 and I have not been touching it since then. If there will be  
changes in the future one will have to keep IIW in sync, something  
that's easy to forget.
There is no locking which will cause concurrent modification  
exceptions when accessing the index via searcher/reader while  
committing.
It use the old token stream API so it has to be upgraded in case it  
should stay.


The java- and package level docs have since it was committed been  
suggesting that one should consider using II as if it was immutable  
due to the locklessness. My suggestion is that we make it immutable  
for real.


Since II is ment for small corpora there is very little time lost by  
using the constructor that builts the index from an IndexReader. I.e.  
rather than using InstantiatedIndexWriter one would have to use a  
Directory and an IndexWriter and then pass an IndexReader to a new  
InstantiatedIndex.



Any objections?

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761924#action_12761924
 ] 

Karl Wettin commented on LUCENE-1939:
-

The exception is thrown when ts#next (incrementToken) is called again after 
already having returned null (false) once. So this is a nice catch!

But this means that RemoveDuplicatesTokenFilter in Solr calls incrementToken 
one extra time for some reason. Can you please post the complete stacktrace so 
I can take a look in there too? 

I suppose the expected behaviour would be that a token stream keep returning 
false when incrementToken is called upon after returning false already, but the 
javadocs doesn't  really say anything about this, nor is there a generic test 
case that ensure this for all filters. Thus this error might be available in 
other filters. I'll see if I can do something about that before committing.

Thanks for the report Patrick!

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761877#action_12761877
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Fix for InstantiadexIndex compile error caused by code committed in 
revision 821277

Committed in rev 821315

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> lucene1257surround1.patch, lucene1257surround1.patch, 
> shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1257:


Attachment: instantiated_fieldable.patch

Fix for InstantiadexIndex compile error caused by code committed in revision 
821277
List rather than List

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: instantiated_fieldable.patch, java5.patch, 
> LUCENE-1257-Document.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> lucene1257surround1.patch, lucene1257surround1.patch, 
> shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761875#action_12761875
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. how that?

It asserted that a Document contained a List rather than List 
in ctor(IndexReader) , which I actually think is true at that point using that 
code.


> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761874#action_12761874
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Generified ShingleMatrixFilter

Committed in rev 821311

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1257) Port to Java5


 [ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1257:


Attachment: shinglematrixfilter_generified.patch

Generified ShingleMatrixFilter

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch, shinglematrixfilter_generified.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761870#action_12761870
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Generification of Document. It makes now clear what getFields() returns 
really. This was very bad documented. Now its a List.

This broke InstantiatedIndex in the trunk. Patch and commit is on the way.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761868#action_12761868
 ] 

Karl Wettin commented on LUCENE-1939:
-

Patrick,

I can't manage to reproduce this error. Uwe is right though, you are getting 
this error using 2.4.1 or earlier, not by using 2.9.

bq. at 
org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)

Can you please try with 2.9? It would also be very helpful if you could list 
the applicable Solr configurations and some example data you are passing to the 
filter when it's thrown.

Thanks in advance.


> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761862#action_12761862
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. Wait ... do you mean you got rid of some of the reflection or did we lose 
your changes? I'm seeing some nasty slow reflection in there still ...

My changes was to the abstract Snowball stemmer class. I simply added an 
abstract method and got rid of the reflection in the Lucene filter. 

One could argue that we should update the Snowball compiler rather than 
updating the Java code it renders. But honestly I think we should just update 
the rendered code and then report any improvement found to the Snowball ml and 
keep track of it in the package readme.

bq. err... looks like perhaps its only hit once though and then reused.. maybe 
not so nasty. My first time looking at this code, so I'm sure you can clear it 
up ...

It could still be rather expensive per stem at query time. I vote for getting 
rid of it if we can. I'll throw an eye at it.

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-Document.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, lucene1257surround1.patch, 
> lucene1257surround1.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1257) Port to Java5


[ 
https://issues.apache.org/jira/browse/LUCENE-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761755#action_12761755
 ] 

Karl Wettin commented on LUCENE-1257:
-

bq. I vote to move to StringBuilder anyway if its in Contrib. Though probably 
not with Snowball, since we don't really write/maintain that code.

Actually I patched the Snowball stemmer code to get ridth of the use of 
reflection. So what we use is an altered version of their code. I tried to get 
Dr Porter to commit those changes for years but it's still the same. Based on 
this I think we could just keep going with our own stuff in there as long we 
keep a record of what we have done in case we want to merge with their trunk. 

> Port to Java5
> -
>
> Key: LUCENE-1257
> URL: https://issues.apache.org/jira/browse/LUCENE-1257
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis, Examples, Index, Other, Query/Scoring, 
> QueryParser, Search, Store, Term Vectors
>Affects Versions: 2.3.1
>Reporter: Cédric Champeau
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.0
>
> Attachments: java5.patch, LUCENE-1257-StringBuffer.patch, 
> LUCENE-1257-StringBuffer.patch, LUCENE-1257-StringBuffer.patch
>
>
> For my needs I've updated Lucene so that it uses Java 5 constructs. I know 
> Java 5 migration had been planned for 2.1 someday in the past, but don't know 
> when it is planned now. This patch against the trunk includes :
> - most obvious generics usage (there are tons of usages of sets, ... Those 
> which are commonly used have been generified)
> - PriorityQueue generification
> - replacement of indexed for loops with for each constructs
> - removal of unnececessary unboxing
> The code is to my opinion much more readable with those features (you 
> actually *know* what is stored in collections reading the code, without the 
> need to lookup for field definitions everytime) and it simplifies many 
> algorithms.
> Note that this patch also includes an interface for the Query class. This has 
> been done for my company's needs for building custom Query classes which add 
> some behaviour to the base Lucene queries. It prevents multiple unnnecessary 
> casts. I know this introduction is not wanted by the team, but it really 
> makes our developments easier to maintain. If you don't want to use this, 
> replace all /Queriable/ calls with standard /Query/.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761712#action_12761712
 ] 

Karl Wettin commented on LUCENE-1939:
-

bq. I also think so, because the above stack dump seems to be from 2.4.1 (in 
2.9 there should be incrementToken() instead of next() for all filters listed 
there).

Ah, I missunderstood your comment. The thing is that ShingleMatrixFilter was 
left using the old API because of its complexity. I told whoever it was that 
gave it a shot that I'd look in to upgrading it, just haven't had time to do so 
yet. There will be a new generified and updated version of the filter any year 
now. At least before 3.0.

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


[ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761706#action_12761706
 ] 

Karl Wettin commented on LUCENE-1939:
-

bq. Is this caused by the rewrite because of the new TokenStream API?

Nah, I think it's just a miss in the code never cought before. Not sure though 
so I'll write a test or two this weekend.

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1939) IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method


 [ 
https://issues.apache.org/jira/browse/LUCENE-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1939:
---

Assignee: Karl Wettin

> IndexOutOfBoundsException at ShingleMatrixFilter's Iterator#hasNext method
> --
>
> Key: LUCENE-1939
> URL: https://issues.apache.org/jira/browse/LUCENE-1939
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/analyzers
>Affects Versions: 2.9
>Reporter: Patrick Jungermann
>Assignee: Karl Wettin
> Attachments: ShingleMatrixFilter_IndexOutOfBoundsException.patch
>
>
> I tried to use the ShingleMatrixFilter within Solr. To test the functionality 
> etc., I first used the built-in field analysis view.The filter was configured 
> to be used only at query time analysis with "_" as spacer character and a 
> min. and max. shingle size of 2. The generation of the shingles for query 
> strings with this filter seems to work at this view, but by turn on the 
> highlighting of indexed terms that will match the query terms, the exception 
> was thrown. Also, each time I tried to query the index the exception was 
> immediately thrown.
> Stacktrace:
> {code}
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
>   at java.util.ArrayList.RangeCheck(Unknown Source)
>   at java.util.ArrayList.get(Unknown Source)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter$Matrix$1.hasNext(ShingleMatrixFilter.java:729)
>   at 
> org.apache.lucene.analysis.shingle.ShingleMatrixFilter.next(ShingleMatrixFilter.java:380)
>   at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:120)
>   at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:47)
>   ...
> {code}
> Within the hasNext method, there is the {{s-1}}-th Column from the ArrayList 
> {{columns}} requested, but there isn't this entry within columns.
> I created a patch that checks, if {{columns}} contains enough entries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-625) Query auto completer

2009-07-29 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736923#action_12736923
 ] 

Karl Wettin commented on LUCENE-625:


bq. Karl, did you ever proceed on this patch? I'm interested in adding 
autosuggest to Solr.

I used this patch for a few things a couple of years ago. If I recall 
everything right I ended up using the bootstrapped apriori corpus of LUCENE-626 
as training data the last time. Made the corpus rather small, speedy and still 
relevant for most users.

But the major caveat is that this patch is a trie and is thus a "precise 
forward only" thing. So that might not fit all use cases. It might be easier to 
get things going using an index with ngrams of untokenized user queries (i.e. 
including whitespace) or subject-like fields. 

But I really prefere user queries as using only the last n queries will make it 
sensitive to trends. That will however require quite a bit of data to work 
well. A lot as in hundreds of thousands of user queries, according to my 
experience.

Not sure if this was an answer to your question.. : )

> Query auto completer
> 
>
> Key: LUCENE-625
> URL: https://issues.apache.org/jira/browse/LUCENE-625
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: autocomplete_0.0.1.tar.gz, autocomplete_20060730.tar.gz
>
>
> A trie that helps users to type in their query. Made for AJAX, works great 
> with ruby on rails common scripts <http://script.aculo.us/>. Similar to the 
> Google labs suggester.
> Trained by user queries. Optimizable. Uses an in memory corpus. Serializable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2009-06-22 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722575#action_12722575
 ] 

Karl Wettin commented on LUCENE-1260:
-

Hi Johan,

didn't try it out yet but the patch looks nice and clean. +1 from me. Let's try 
to convince some of the old -1:ers. 

YONIK? See, it's not just me. ; )

I do however still think it's nice with the serializable codec interface as in 
the previous patches in order for all applications to use the index as intended 
(Luke and what not). 256 bytes stored to a file and by default backed by a 
binary search or so unless there is a registred codec that handles it 
algorithmic. I'll copy and paste that in as an alternative suggestion ASAP.

(I think the next move should be to allow for per field variable norms 
resolution, but that is a whole new issue.)

> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
> Attachments: Lucene-1260.patch, LUCENE-1260.txt, LUCENE-1260.txt, 
> LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-06-13 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin resolved LUCENE-1578.
-

Resolution: Fixed

comitted

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: svn commit: r784481 - in /lucene/java/trunk/contrib: ./ instantiated/src/java/org/apache/lucene/store/instantiated/ instantiated/src/test/org/apache/lucene/store/instantiated/

2009-06-13 Thread Karl Wettin


oups, an error in the code. im on it.

13 jun 2009 kl. 23.54 skrev ka...@apache.org:


Author: kalle
Date: Sat Jun 13 21:54:07 2009
New Revision: 784481

URL: http://svn.apache.org/viewvc?rev=784481&view=rev
Log:
LUCENE-1578: Support for loading unoptimized readers to the  
constructor of InstantiatedIndex. (Karl Wettin)



Added:
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestUnoptimizedReaderOnConstructor.java

Modified:
   lucene/java/trunk/contrib/CHANGES.txt
   lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/ 
store/instantiated/InstantiatedIndex.java
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestIndicesEquals.java
   lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/ 
store/instantiated/TestRealTime.java


Modified: lucene/java/trunk/contrib/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/CHANGES.txt?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==

--- lucene/java/trunk/contrib/CHANGES.txt (original)
+++ lucene/java/trunk/contrib/CHANGES.txt Sat Jun 13 21:54:07 2009
@@ -62,8 +62,11 @@
(Xiaoping Gao via Mike McCandless)


-6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for  
automatically adding payloads "in-stream" (Grant Ingersoll)

-
+ 6. LUCENE-1676: Added DelimitedPayloadTokenFilter class for  
automatically adding payloads "in-stream" (Grant Ingersoll)

+
+ 7. LUCENE-1578: Support for loading unoptimized readers to the
+    constructor of InstantiatedIndex. (Karl Wettin)
+
Optimizations

  1. LUCENE-1643: Re-use the collation key (RawCollationKey) for

Modified: lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java

URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==
--- lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java (original)
+++ lucene/java/trunk/contrib/instantiated/src/java/org/apache/ 
lucene/store/instantiated/InstantiatedIndex.java Sat Jun 13 21:54:07  
2009

@@ -110,7 +110,8 @@
  public InstantiatedIndex(IndexReader sourceIndexReader,  
Set fields) throws IOException {


if (!sourceIndexReader.isOptimized()) {
-  throw new IOException("Source index is not optimized.");
+  System.out.println(("Source index is not optimized."));
+  //throw new IOException("Source index is not optimized.");
}


@@ -170,11 +171,14 @@
}


-documentsByNumber = new  
InstantiatedDocument[sourceIndexReader.numDocs()];
+documentsByNumber = new  
InstantiatedDocument[sourceIndexReader.maxDoc()];

+

// create documents
-for (int i = 0; i < sourceIndexReader.numDocs(); i++) {
-  if (!sourceIndexReader.isDeleted(i)) {
+for (int i = 0; i < sourceIndexReader.maxDoc(); i++) {
+  if (sourceIndexReader.isDeleted(i)) {
+deletedDocuments.add(i);
+  } else {
InstantiatedDocument document = new InstantiatedDocument();
// copy stored fields from source reader
Document sourceDocument = sourceIndexReader.document(i);
@@ -259,6 +263,9 @@

// load offsets to term-document informations
for (InstantiatedDocument document : getDocumentsByNumber()) {
+  if (document == null) {
+continue; // deleted
+  }
  for (Field field : (List)  
document.getDocument().getFields()) {
if (field.isTermVectorStored() &&  
field.isStoreOffsetWithTermVector()) {
  TermPositionVector termPositionVector =  
(TermPositionVector)  
sourceIndexReader.getTermFreqVector(document.getDocumentNumber(),  
field.name());


Modified: lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java

URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/instantiated/src/test/org/apache/lucene/store/instantiated/TestIndicesEquals.java?rev=784481&r1=784480&r2=784481&view=diff
= 
= 
= 
= 
= 
= 
= 
= 
==
--- lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java (original)
+++ lucene/java/trunk/contrib/instantiated/src/test/org/apache/ 
lucene/store/instantiated/TestIndicesEquals.java Sat Jun 13 21:54:07  
2009

@@ -40,6 +40,10 @@
import org.apache.lucene.index.TermPositions;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.TermQuery;
+import org.apac

[jira] Issue Comment Edited: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
 ] 

Karl Wettin edited comment on LUCENE-1491 at 6/2/09 2:51 PM:
-

Although you have a valid point I'd like to argue this a bit. 

My arguments are probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

  was (Author: karl.wettin):
Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 
  
> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715712#action_12715712
 ] 

Karl Wettin commented on LUCENE-1491:
-

Although you have a valid point I'd like to argue this a bit. 

My arguments is probably considered silly by some. Perhaps it's just me that 
use ngrams for something completly different than what everybody else does, but 
here we go: Adding the feature as suggested by this patch is, according to me, 
to fix symptoms from bad use of character ngrams.

BOL, EOL, whitespace and punctuation are all valid parts of character ngrams 
than can increase precision/recall quite a bit. EdgeNGrams could sort of be 
considered such data too. So what I'm saying here is that I consider your 
example a bad use of charachter ngrams, that the whole sentance should have 
been grammed up. So in the case of 4-grams the output would end up as: "to b", 
"o be", " be ", "be o", and so on. Perhaps even "$to ", "to b", "o be", and so 
on.

Supporting what I suggest will of course mean quite a bit of more work. A whole 
new filter that also does input text normalization such as removing double 
spaces and what not. That will probably not be implemented anytime soon. But 
adding the features in the patch to the filter actually means that this use is 
endorsed by the community and I'm not sure that's a good idea. I thus think it 
would be better with some sort of secondary filter that did the exact same 
thing as the patch.

Perhaps I should leave this issue alone and do some more work with LUCENE-1306 

> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.


[ 
https://issues.apache.org/jira/browse/LUCENE-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715567#action_12715567
 ] 

Karl Wettin commented on LUCENE-1491:
-

bq. Perhaps we need boolean keepSmaller somewhere, so we can explicitly control 
the behaviour?

I'm not sure. Is there a use case for this or is it an XY-problem?



> EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.
> 
>
> Key: LUCENE-1491
> URL: https://issues.apache.org/jira/browse/LUCENE-1491
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 2.4, 2.4.1, 2.9, 3.0
>Reporter: Todd Feak
>Assignee: Otis Gospodnetic
> Fix For: 2.9
>
> Attachments: LUCENE-1491.patch
>
>
> If a token is encountered in the stream that is shorter in length than the 
> min gram size, the filter will stop processing the token stream.
> Working up a unit test now, but may be a few days before I can provide it. 
> Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: HitCollector#collect(int,float,Collection)

2009-06-02 Thread Karl Wettin

So, I've been sleeping on this for a few weeks. Would it be possible  
to solve this with a decorator? Perhaps a top level decorator that  
also decorates all subqueries at rewrite-time and then keeps the  
instantiated scorers bound to the top level decorator, i.e. makes the  
decorated query non resuable.


Query realQuery = ...
DecoratedQuery dq = new DecoratedQuery(realQuery);
searcher.search(dq, ..);
Map dq.getScoringQueries();

Not quite sure if this is terrible or elegant.


karl

7 apr 2009 kl. 12.17 skrev Michael McCandless:

On Tue, Apr 7, 2009 at 6:13 AM, Karl Wettin   
wrote:


7 apr 2009 kl. 10.23 skrev Michael McCandless:


Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?

EG for a boolean query w/ N clauses, which of those N clauses  
matched?


This is exactly what I mean. I do however think it makes sense to get
information about non atomic queries as it seems reasonble that the  
first
clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching is  
more

interesting than only getting to know that one of the clauses of that
boolean query is matching.


Ahh OK I agree.  So every query in the full tree should be able to
state whether it matched the doc.


A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use  
this

functionality.


I'll look in to it.

Thanks for the feedback.


karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders


[ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715494#action_12715494
 ] 

Karl Wettin commented on LUCENE-1578:
-

Jason,

did you get a chanse to try this out? It seems to work fine for me and I plan 
to pop it in the trunk in a few days. I think I'll have to add a warning of 
some kind in runtime though as it could slow down the index a bit if the reader 
is way fragmented.

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders


 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1578:
---

Assignee: Karl Wettin

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity


[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715492#action_12715492
 ] 

Karl Wettin commented on LUCENE-1260:
-

bq. Wouldn't the simplest solution be to refactor out the static methods, 
replace them with instance methods and remove the getNormDecoder method? This 
would enable a pluggable behavior without introducing a new Codec.

Hi Johan,

feel free to post a patch!



> Norm codec strategy in Similarity
> -
>
> Key: LUCENE-1260
> URL: https://issues.apache.org/jira/browse/LUCENE-1260
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.3.1
>Reporter: Karl Wettin
> Attachments: LUCENE-1260.txt, LUCENE-1260.txt, LUCENE-1260.txt
>
>
> The static span and resolution of the 8 bit norms codec might not fit with 
> all applications. 
> My use case requires that 100f-250f is discretized in 60 bags instead of the 
> default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: InstantiatedIndex Memory required

2009-05-13 Thread Karl Wettin

Hi Ravichandra,

this is a question better fitted the java-users maillinglist. On this
list we talk about the development of the Lucene API rather than how
to use it.

To answer your question, there is no simple formula that says how much
RAM an InstantiatedIndex will consume given the FSDirectory or
RAMDirectory size. Your index is however probably way too large for
when InstantiatedIndex is considerably faster than RAMDirecotry. There
is a diagram in the Javadocs that shows the speed on a Reuters index
as it grows in size:

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/store/instantiated/package-summary.html#package_description

As milage varies on term saturation you should still try benchmarking
and see if there is anything to be gained. Try increasing Xmx to
whatever you have, you can also take a look at -XX:+AggressiveHeap.

karl

12 maj 2009 kl. 18.43 skrev thiruvee:

So far I am using RAMDirectory for my indexes. To meet the SLA of our
project, i thought of using InstantiatedIndex. But when I used that,
i am
not able to get any out put from that and its throwing out of memory
error.

What is the ratio between Index size and memory size, when using
InstantiatedIndex.
Here are my index details:

Index size : 200mB
RAM Size : 1 GB

If i try with a small test index of size 100KB, its working.
Please help me with this.

Thanks
Ravichandra

--
View this message in context:
http://www.nabble.com/InstantiatedIndex-Memory-required-tp23506231p23506231.html
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: HitCollector#collect(int,float,Collection)

2009-04-07 Thread Karl Wettin



7 apr 2009 kl. 10.23 skrev Michael McCandless:


Do you mean tracking the "atomic queries" that caused a given hit to
match (where "atomic query" is a query that actually uses
TermDocs/Positions to check matching, vs other queries like
BooleanQuery that "glomm together" sub-query matches)?

EG for a boolean query w/ N clauses, which of those N clauses matched?


This is exactly what I mean. I do however think it makes sense to get  
information about non atomic queries as it seems reasonble that the  
first clause (boolean query '+(a b)') in '+(a b) -(+c +d)' is matching  
is more interesting than only getting to know that one of the clauses  
of that boolean query is matching.



A natural place to do this is Scorer API, ie extend it with a
"getMatchingAtomicQueries" or some such.  Probably, for efficiency,
each Query should be pre-assigned an int position, and then the
matching is represented as a bit array, reused across matches.  Your
collector could then ask the scorer for these bits if it wanted.
There should be no performance cost for collectors that don't use this
functionality.


I'll look in to it.

Thanks for the feedback.


 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

HitCollector#collect(int,float,Collection)

2009-04-06 Thread Karl Wettin

How crazy would it be to refactor HitCollector so it also accept the  
matching queries?


Let's ignore my use case (not sure it makes sense yet, it's related to  
finding a threadshold between probably interesting and definitly not  
interesting results of huge OR-statements, but I really have to try it  
out before I can say if it's any good) and just focus on the speed  
impact. If I cleared and reused the Collection passed down to the  
HitCollector then it shouldn't really slow things down, right? And if  
I reused the collections in my TopDocsCollector as low scoring results  
was pushed down then it shouldn't have to be expensive there either. Or?



karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store

2009-03-30 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693744#action_12693744
 ] 

Karl Wettin commented on LUCENE-1039:
-

Vaijanath,

can you please post a small test case that demonstrates the problem?

> Bayesian classifiers using Lucene as data store
> ---
>
> Key: LUCENE-1039
> URL: https://issues.apache.org/jira/browse/LUCENE-1039
> Project: Lucene - Java
>  Issue Type: New Feature
>    Reporter: Karl Wettin
>    Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1039.txt
>
>
> Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
> Fisher method algorithms as described by Toby Segaran in "Programming 
> Collective Intelligence", ISBN 978-0-596-52932-1. 
> Have fun.
> Poor java docs, but the TestCase shows how to use it:
> {code:java}
> public class TestClassifier extends TestCase {
>   public void test() throws Exception {
> InstanceFactory instanceFactory = new InstanceFactory() {
>   public Document factory(String text, String _class) {
> Document doc = new Document();
> doc.add(new Field("class", _class, Field.Store.YES, 
> Field.Index.NO_NORMS));
> doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, 
> Field.TermVector.NO));
> doc.add(new Field("text/ngrams/start", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/end", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> return doc;
>   }
>   Analyzer analyzer = new Analyzer() {
> private int minGram = 2;
> private int maxGram = 3;
> public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new StandardTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   if (fieldName.endsWith("/ngrams/start")) {
> ts = new EdgeNGramTokenFilter(ts, 
> EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/inner")) {
> ts = new NGramTokenFilter(ts, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/end")) {
> ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
> minGram, maxGram);
>   }
>   return ts;
> }
>   };
>   public Analyzer getAnalyzer() {
> return analyzer;
>   }
> };
> Directory dir = new RAMDirectory();
> new IndexWriter(dir, null, true).close();
> Instances instances = new Instances(dir, instanceFactory, "class");
> instances.addInstance("hello world", "en");
> instances.addInstance("hallå världen", "sv");
> instances.addInstance("this is london calling", "en");
> instances.addInstance("detta är london som ringer", "sv");
> instances.addInstance("john has a long mustache", "en");
> instances.addInstance("john har en lång mustache", "sv");
> instances.addInstance("all work and no play makes jack a dull boy", "en");
> instances.addInstance("att bara arbeta och aldrig leka gör jack en trist 
> gosse", "sv");
> instances.addInstance("shrimp sandwich", "en");
> instances.addInstance("räksmörgås", "sv");
> instances.addInstance("it's now or never", "en");
> instances.addInstance("det är nu eller aldrig", "sv");
> instances.addInstance("to tie up at a landing-stage", "en");
> instances.addInstance("att angöra en brygga", "sv");
> instances.addInstance("it's now time for the children's television 
> shows", "en");
> instances.addInstance("nu är det dags för barnprogram", "sv");
> instances.flush();
> testClassifier(instances, new NaiveBayesClassifier());
> testClassifier(instances, new FishersMethodClassifier());
> instances.close();
>   }
>   private void testClassifier(Instances instances, BayesianClassifier 
> classifier) throws IOException {
> assertEquals("sv",

[jira] Updated: (LUCENE-1578) InstantiatedIndex supports non-optimized IndexReaders

2009-03-30 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1578:


Attachment: LUCENE-1578.txt

Please test this patch using a couple of different unoptimized readers in the 
constructor.

> InstantiatedIndex supports non-optimized IndexReaders
> -
>
> Key: LUCENE-1578
> URL: https://issues.apache.org/jira/browse/LUCENE-1578
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Jason Rutherglen
> Fix For: 2.9
>
> Attachments: LUCENE-1578.txt
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> InstantiatedIndex does not currently support non-optimized IndexReaders.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: InstantiatedIndex

2009-03-30 Thread Karl Wettin



28 mar 2009 kl. 01.21 skrev Jason Rutherglen:


I'm thinking InstantiatedIndex needs to implement either clone of  
all the index data or needs to be able to accept a non-optimized  
reader, or both.  I forget what the obstacles are to implementing  
the non-optimized reader option?  Do you think there are advantages  
or disadvantages when comparing the solutions?


Hi Jason,

I honestly don't remember the reason but it seems to have something to  
do with deletions.





Realtime search will need to periodically merge  
InstantiatedIndex's.  One option is to clone an existing index, then  
add a document to it, clone, and so on, freeze it and later merge it  
with other indexes.  The other option that provides the same  
functionality is to pass the smaller readers into an  
InstantiatedIndex.


How do you feel about something like this?

public InstantiatedIndex merge(IndexReader[] readers) {
  Directory dir = new RAMDirectory();
  IndexWriter w = new IndexWriter(dir);
  w.addIndexes(readers);
  w.commit();
  w.optimize();
  w.close();
  IndexReader reader = IndexReader.open(dir);
  InstantiatedIndex ii = new InstantiatedIndex(reader);
  reader.close();
  dir.close();
  return ii;
}



 karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-03-19 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683423#action_12683423
 ] 

Karl Wettin commented on LUCENE-1543:
-

bq. Karl, is there a reason why a function query can't be used in your 
situation? It seems like it should work?

I'm sure it would. : ) 

I do however not understand why you think it is a more correct/nice/better/what 
not solution than to use this patch. This is how I reason: if the feature of 
norms scoring is available in all other low level queries, than it also makes 
sense to have it in the low level MatchAllDocumentsQuery

> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675118#action_12675118
 ] 

Karl Wettin commented on LUCENE-1543:
-

bq. Couldn't you just use a TermQuery? Or a BooleanQuery with a 
MatchAllDocsQuery and an optional TermQuery?

Wouldn't that require a TermQuery that match all documents? I.e. adding a term 
to a field in all documents?

The following stuff doesn't really fit in this issue, but still. It's rather 
related to column stride payloads LUCENE-1231 . I've been considering adding a 
new "norms" field at document level for a couple of years now. 8 more bits at 
document level would allow for moving general document boosting to move it out 
the norms-boost-per-field-blob and increase the length normalization and per 
field boost resolution quite a bit at a low cost. 

(I hope that is not yet another can of worms I get to open.)


> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1543:


Attachment: LUCENE-1543.txt

> Field specified norms in MatchAllDocumentsScorer 
> -
>
> Key: LUCENE-1543
> URL: https://issues.apache.org/jira/browse/LUCENE-1543
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1543.txt
>
>
> This patch allows for optionally setting a field to use for norms factoring 
> when scoring a MatchingAllDocumentsQuery.
> From the test case:
> {code:java}
> .
> RAMDirectory dir = new RAMDirectory();
> IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
> IndexWriter.MaxFieldLength.LIMITED);
> iw.setMaxBufferedDocs(2);  // force multi-segment
> addDoc("one", iw, 1f);
> addDoc("two", iw, 20f);
> addDoc("three four", iw, 300f);
> iw.close();
> IndexReader ir = IndexReader.open(dir);
> IndexSearcher is = new IndexSearcher(ir);
> ScoreDoc[] hits;
> // assert with norms scoring turned off
> hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
> assertEquals(3, hits.length);
> assertEquals("one", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("three four", ir.document(hits[2].doc).get("key"));
> // assert with norms scoring turned on
> MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
> assertEquals(3, hits.length);
> //is.explain(normsQuery, hits[0].doc);
> hits = is.search(normsQuery, null, 1000).scoreDocs;
> assertEquals("three four", ir.document(hits[0].doc).get("key"));
> assertEquals("two", ir.document(hits[1].doc).get("key"));
> assertEquals("one", ir.document(hits[2].doc).get("key"));
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1543) Field specified norms in MatchAllDocumentsScorer

2009-02-19 Thread Karl Wettin (JIRA)

Field specified norms in MatchAllDocumentsScorer 
-

 Key: LUCENE-1543
 URL: https://issues.apache.org/jira/browse/LUCENE-1543
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Query/Scoring
Affects Versions: 2.4
Reporter: Karl Wettin
Priority: Minor
 Fix For: 2.9
 Attachments: LUCENE-1543.txt

This patch allows for optionally setting a field to use for norms factoring 
when scoring a MatchingAllDocumentsQuery.

>From the test case:
{code:java}
.
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer(), true, 
IndexWriter.MaxFieldLength.LIMITED);
iw.setMaxBufferedDocs(2);  // force multi-segment
addDoc("one", iw, 1f);
addDoc("two", iw, 20f);
addDoc("three four", iw, 300f);
iw.close();

IndexReader ir = IndexReader.open(dir);
IndexSearcher is = new IndexSearcher(ir);
ScoreDoc[] hits;

// assert with norms scoring turned off

hits = is.search(new MatchAllDocsQuery(), null, 1000).scoreDocs;
assertEquals(3, hits.length);
assertEquals("one", ir.document(hits[0].doc).get("key"));
assertEquals("two", ir.document(hits[1].doc).get("key"));
assertEquals("three four", ir.document(hits[2].doc).get("key"));

// assert with norms scoring turned on

MatchAllDocsQuery normsQuery = new MatchAllDocsQuery("key");
assertEquals(3, hits.length);
//is.explain(normsQuery, hits[0].doc);
hits = is.search(normsQuery, null, 1000).scoreDocs;

assertEquals("three four", ir.document(hits[0].doc).get("key"));
assertEquals("two", ir.document(hits[1].doc).get("key"));
assertEquals("one", ir.document(hits[2].doc).get("key"));
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1537) InstantiatedIndexReader.clone

2009-02-15 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673610#action_12673610
 ] 

Karl Wettin commented on LUCENE-1537:
-

I didn't try it out yet, but I have a few comments and questions on the patch:

{code}
Index: 
contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndexReader.java
+  
+  public Object clone() {
+try {
+  doCommit();
+  InstantiatedIndex clonedIndex = index.cloneWithDeletesNorms();
+  return new InstantiatedIndexReader(clonedIndex);
+} catch (IOException ioe) {
+  throw new RuntimeException("", ioe);
+}
+  }

Index: 
contrib/instantiated/src/java/org/apache/lucene/store/instantiated/InstantiatedIndex.java
+
+  InstantiatedIndex cloneWithDeletesNorms() {
+InstantiatedIndex clone = new InstantiatedIndex();
+clone.version = System.currentTimeMillis();
+clone.documentsByNumber = documentsByNumber;
+clone.deletedDocuments = new HashSet(deletedDocuments);
+clone.termsByFieldAndText = termsByFieldAndText;
+clone.orderedTerms = orderedTerms;
+clone.normsByFieldNameAndDocumentNumber = new HashMap(normsByFieldNameAndDocumentNumber);
+clone.fieldSettings = fieldSettings;
+return clone;
+  }
{code}

Perhaps we should move deleted documents to the reader? It might be a bit of 
work to hook it up with term enum et c, but it could be worth looking in to. I 
think it makes more sense to keep the same instance of InstantiatedIndex and 
only produce a cloned InstantiatedIndexReader. It is the reader#clone we call 
upon so cloning the store sounds like a future placeholder for unwanted bugs.



I see there are some left overs from your attempt to handle none  optimized 
readers:

{code}
-documentsByNumber = new InstantiatedDocument[sourceIndexReader.numDocs()];
+documentsByNumber = new InstantiatedDocument[sourceIndexReader.maxDoc()];
 
 // create documents
 for (int i = 0; i < sourceIndexReader.numDocs(); i++) {
{code}

I think if you switch to maxDoc it should also use maxDoc int the loop and skip 
any deleted document. 



{code}
-for (InstantiatedDocument document : getDocumentsByNumber()) {
+//for (InstantiatedDocument document : getDocumentsByNumber()) {
+for (InstantiatedDocument document : getDocumentsNotDeleted()) {
   for (Field field : (List) document.getDocument().getFields()) {
 if (field.isTermVectorStored() && field.isStoreOffsetWithTermVector()) 
{
   TermPositionVector termPositionVector = (TermPositionVector) 
sourceIndexReader.getTermFreqVector(document.getDocumentNumber(), field.name());
@@ -312,7 +325,15 @@
   public InstantiatedDocument[] getDocumentsByNumber() {
 return documentsByNumber;
   }
-
+  
+  public List getDocumentsNotDeleted() {
+List list = new 
ArrayList(documentsByNumber.length-deletedDocuments.size());
+for (int x=0; x < documentsByNumber.length; x++) {
+  if (!deletedDocuments.contains(x)) list.add(documentsByNumber[x]);
+}
+return list;
+  } 
+  
{code}

As the source never contains any deleted documents this really doesn't do 
anything but consume a bit of resources, or?



{code}
-int maxVal = 
getAssociatedDocuments()[max].getDocument().getDocumentNumber();
+InstantiatedTermDocumentInformation itdi = getAssociatedDocuments()[max];
+InstantiatedDocument id = itdi.getDocument();
+int maxVal = id.getDocumentNumber();
+//int maxVal = 
getAssociatedDocuments()[max].getDocument().getDocumentNumber();
{code}

Is this refactor just for debugging purposes? I find it harder to read than the 
original one-liner.

> InstantiatedIndexReader.clone
> -
>
> Key: LUCENE-1537
> URL: https://issues.apache.org/jira/browse/LUCENE-1537
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Jason Rutherglen
>Assignee: Karl Wettin
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1537.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> This patch will implement IndexReader.clone for InstantiatedIndexReader.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-1537) InstantiatedIndexReader.clone

2009-02-15 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin reassigned LUCENE-1537:
---

Assignee: Karl Wettin

> InstantiatedIndexReader.clone
> -
>
> Key: LUCENE-1537
> URL: https://issues.apache.org/jira/browse/LUCENE-1537
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Karl Wettin
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1537.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> This patch will implement IndexReader.clone for InstantiatedIndexReader.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-02-09 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1531.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 742411

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Partial / starts with searching

2009-02-05 Thread Karl Wettin


Hi Jori,

your question is better suited the java-users lists, on this list we  
discuss about developing the API.


To answer your question, ngrams might solve your problem, tokenizers  
are available in contrib/analyzers.



karl

5 feb 2009 kl. 10.19 skrev d-fader:


Hi,

I'm new to this list, so please don't be too harsh if I missed some  
rules or something. Since about half a year I'm using Lucene and I  
think it's awesome, respect for all your efforts!


Maybe the 'issue' I'm addressing now is discussed thouroughly  
already, in that case I think I need some redirection to the sources  
of those discussions :) Anyway, here's the thing.
For all I know it's impossible to search partial words with Lucene  
(except the asterix method with e.g. the StandardAnalyzer -> ambul*  
to find ambulance). My problem with that method is that my index  
consists of quite a few terms. This means that if a user would  
search for 'ambu amster' (ambulance amsterdam), there will be so  
many terms to search, it's not doable. Now I started thinking why  
it's impossible to search only a 'part' of a term or even only the  
'start' of a term and the only reason I could think of was that the  
Index terms are stored tokenized (in that way you (of course) can't  
find partial terms, since the index actually doesn't contain the  
literal terms, but tokens instead). But Lucene can also store all  
terms untokenized, so in that case a partial search would be  
possible in my humble opinion, since all terms would be stored  
'literally'.


Maybe my thinking is wrong, I only have a black box view of Lucene,  
so I don't know much about indexing algorithm and all, but I just  
want to know if this could be done or else why not :) You see, the  
users of my index want to know why they can't search parts of the  
words they enter and I still can't give them a really good answer,  
except the 'it would result in too many OR operators in the query'  
statement :)


Thanks in advance!

Jori

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-02-03 Thread Karl Wettin (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670240#action_12670240
 ] 

Karl Wettin commented on LUCENE-1531:
-

Any objections to this patch? If not I'll pop in the trunk in a few days from 
now.

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1531:


Attachment: LUCENE-1531.txt

Previous patch was messed up from cloning SpanTerm..

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt, LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1531:


Attachment: LUCENE-1531.txt

> contrib/xml-query-parser, BoostingTermQuery support
> ---
>
> Key: LUCENE-1531
> URL: https://issues.apache.org/jira/browse/LUCENE-1531
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1531.txt
>
>
> I'm not 100% on this patch. 
> BooleanTermQuery is a part of the spans family, but I generally use that 
> class as a replacement for TermQuery.  Thus in the DTD I have stated that it 
> can be a part of the root queries as well as a part of a span. 
> However, SpanFooQueries xml elements are named  rather than 
> , I have however chosen to call it . It 
> would be possible to set it up so it would be parsed as  
> when inside of a , but I just find that confusing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1531) contrib/xml-query-parser, BoostingTermQuery support

2009-01-29 Thread Karl Wettin (JIRA)

contrib/xml-query-parser, BoostingTermQuery support
---

 Key: LUCENE-1531
 URL: https://issues.apache.org/jira/browse/LUCENE-1531
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin
Assignee: Karl Wettin
 Fix For: 2.9


I'm not 100% on this patch. 

BooleanTermQuery is a part of the spans family, but I generally use that class 
as a replacement for TermQuery.  Thus in the DTD I have stated that it can be a 
part of the root queries as well as a part of a span. 

However, SpanFooQueries xml elements are named  rather than 
, I have however chosen to call it . It 
would be possible to set it up so it would be parsed as  
when inside of a , but I just find that confusing.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Filesystem based bitset

2009-01-09 Thread Karl Wettin


Thinking out loud,

SSD is pretty close to RAM when it comes to seeking. Wouldn't that  
mean that a bitset stored on an SSD would be more or less as fast as a  
bitset in RAM? So how about storing all permutations of filters one  
use on SSD? Perhaps loading them to RAM in case they are frequently  
used? To me it sounds like a great idea.


Not sure if one should focus at OpenBitSet or a fixed size BitSet, I'd  
really need to do some real tests to tell. Still, I'm rather convinced  
the bang for the buck ratio is quite a bit more using SSD than RAM  
given IO throughput (compare an index in RAM vs on SSD vs on HDD)  
isn't an issue.


The only real issue I can this of is the lack of  
DocSetIterator#close()..




karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1515) Improved(?) Swedish snowball stemmer


 [ 
https://issues.apache.org/jira/browse/LUCENE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1515:


Attachment: LUCENE-1515.txt

snowball code, generated java class and unit test.

> Improved(?) Swedish snowball stemmer
> 
>
> Key: LUCENE-1515
> URL: https://issues.apache.org/jira/browse/LUCENE-1515
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>    Reporter: Karl Wettin
> Attachments: LUCENE-1515.txt
>
>
> Snowball stemmer for Swedish lacks support for '-an' and '-ans' related 
> suffix stripping, ending up with non compatible stems for example "klocka", 
> "klockor", "klockornas", "klockAN", "klockANS".  Complete list of new suffix 
> stripping rules:
> {pre}
> 'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
> 'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
> 'ansernas'
> 'iera'
> (delete)
> {pre}
> The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
> this is an attempt at solving that problem. The rules and exceptions are 
> based on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] 
> entries suffixed with 'an' and 'ans'. There a few known problematic stemming 
> rules but seems to work quite a bit better than the current SwedishStemmer. 
> It would not be a bad idea to check all of SAOL entries in order to make sure 
> the integrity of the rules.
> My Snowball syntax skills are rather limited so I'm certain the code could be 
> optimized quite a bit.
> *The code is released under BSD and not ASL*. I've been posting a bit in the 
> Snowball forum and privatly to Martin Porter himself but never got any 
> response so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1039) Bayesian classifiers using Lucene as data store


[ 
https://issues.apache.org/jira/browse/LUCENE-1039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662467#action_12662467
 ] 

Karl Wettin commented on LUCENE-1039:
-

What do you people think, should I commit this to Lucene or Mahout?

> Bayesian classifiers using Lucene as data store
> ---
>
> Key: LUCENE-1039
> URL: https://issues.apache.org/jira/browse/LUCENE-1039
> Project: Lucene - Java
>  Issue Type: New Feature
>    Reporter: Karl Wettin
>    Assignee: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1039.txt
>
>
> Bayesian classifiers using Lucene as data store. Based on the Naive Bayes and 
> Fisher method algorithms as described by Toby Segaran in "Programming 
> Collective Intelligence", ISBN 978-0-596-52932-1. 
> Have fun.
> Poor java docs, but the TestCase shows how to use it:
> {code:java}
> public class TestClassifier extends TestCase {
>   public void test() throws Exception {
> InstanceFactory instanceFactory = new InstanceFactory() {
>   public Document factory(String text, String _class) {
> Document doc = new Document();
> doc.add(new Field("class", _class, Field.Store.YES, 
> Field.Index.NO_NORMS));
> doc.add(new Field("text", text, Field.Store.YES, Field.Index.NO, 
> Field.TermVector.NO));
> doc.add(new Field("text/ngrams/start", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/inner", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> doc.add(new Field("text/ngrams/end", text, Field.Store.NO, 
> Field.Index.TOKENIZED, Field.TermVector.YES));
> return doc;
>   }
>   Analyzer analyzer = new Analyzer() {
> private int minGram = 2;
> private int maxGram = 3;
> public TokenStream tokenStream(String fieldName, Reader reader) {
>   TokenStream ts = new StandardTokenizer(reader);
>   ts = new LowerCaseFilter(ts);
>   if (fieldName.endsWith("/ngrams/start")) {
> ts = new EdgeNGramTokenFilter(ts, 
> EdgeNGramTokenFilter.Side.FRONT, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/inner")) {
> ts = new NGramTokenFilter(ts, minGram, maxGram);
>   } else if (fieldName.endsWith("/ngrams/end")) {
> ts = new EdgeNGramTokenFilter(ts, EdgeNGramTokenFilter.Side.BACK, 
> minGram, maxGram);
>   }
>   return ts;
> }
>   };
>   public Analyzer getAnalyzer() {
> return analyzer;
>   }
> };
> Directory dir = new RAMDirectory();
> new IndexWriter(dir, null, true).close();
> Instances instances = new Instances(dir, instanceFactory, "class");
> instances.addInstance("hello world", "en");
> instances.addInstance("hallå världen", "sv");
> instances.addInstance("this is london calling", "en");
> instances.addInstance("detta är london som ringer", "sv");
> instances.addInstance("john has a long mustache", "en");
> instances.addInstance("john har en lång mustache", "sv");
> instances.addInstance("all work and no play makes jack a dull boy", "en");
> instances.addInstance("att bara arbeta och aldrig leka gör jack en trist 
> gosse", "sv");
> instances.addInstance("shrimp sandwich", "en");
> instances.addInstance("räksmörgås", "sv");
> instances.addInstance("it's now or never", "en");
> instances.addInstance("det är nu eller aldrig", "sv");
> instances.addInstance("to tie up at a landing-stage", "en");
> instances.addInstance("att angöra en brygga", "sv");
> instances.addInstance("it's now time for the children's television 
> shows", "en");
> instances.addInstance("nu är det dags för barnprogram", "sv");
> instances.flush();
> testClassifier(instances, new NaiveBayesClassifier());
> testClassifier(instances, new FishersMethodClassifier());
> instances.close();
>   }
>   private void testClassifier(Instances instances, BayesianClassifier 
> classifier) throws IOException {
> assertEquals("sv", classifie

[jira] Created: (LUCENE-1515) Improved(?) Swedish snowball stemmer

Improved(?) Swedish snowball stemmer


 Key: LUCENE-1515
 URL: https://issues.apache.org/jira/browse/LUCENE-1515
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 2.4
Reporter: Karl Wettin


Snowball stemmer for Swedish lacks support for '-an' and '-ans' related suffix 
stripping, ending up with non compatible stems for example "klocka", "klockor", 
"klockornas", "klockAN", "klockANS".  Complete list of new suffix stripping 
rules:

{pre}
'an' 'anen' 'anens' 'anare' 'aner' 'anerna' 'anernas'
'ans' 'ansen' 'ansens' 'anser' 'ansera'  'anserar' 'anserna' 
'ansernas'
'iera'
(delete)
{pre}

The problem is all the exceptions (e.g. svans|svan, finans|fin, nyans|ny) and 
this is an attempt at solving that problem. The rules and exceptions are based 
on the [SAOL|http://en.wikipedia.org/wiki/Svenska_Akademiens_Ordlista] entries 
suffixed with 'an' and 'ans'. There a few known problematic stemming rules but 
seems to work quite a bit better than the current SwedishStemmer. It would not 
be a bad idea to check all of SAOL entries in order to make sure the integrity 
of the rules.

My Snowball syntax skills are rather limited so I'm certain the code could be 
optimized quite a bit.

*The code is released under BSD and not ASL*. I've been posting a bit in the 
Snowball forum and privatly to Martin Porter himself but never got any response 
so now I post it here instead in hope for some momentum.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows


 [ 
https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1514.
---

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed in revision 733064

> ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix 
> grows
> --
>
> Key: LUCENE-1514
> URL: https://issues.apache.org/jira/browse/LUCENE-1514
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1514.txt
>
>
> ShingleMatrixFilter#next makes a recursive function invocation when the 
> current permutation iterator is exhausted or if the current state of the 
> permutation iterator already has produced an identical shingle. In a not too 
> complex matrix this will require a gigabyte sized stack per thread.
> My solution is to avoid the recursive invocation by refactoring like this:
> {code:java}
> public Token next(final Token reusableToken) throws IOException {
> assert reusableToken != null;
> if (matrix == null) {
>   matrix = new Matrix();
>   // fill matrix with maximumShingleSize columns
>   while (matrix.columns.size() < maximumShingleSize && readColumn()) {
> // this loop looks ugly
>   }
> }
> // this loop exists in order to avoid recursive calls to the next method
> // as the complexity of a large matrix
> // then would require a multi gigabyte sized stack.
> Token token;
> do {
>   token = produceNextToken(reusableToken);
> } while (token == request_next_token);
> return token;
>   }
>   
>   private static final Token request_next_token = new Token();
>   /**
>* This method exists in order to avoid reursive calls to the method
>* as the complexity of a fairlt small matrix then easily would require
>* a gigabyte sized stack per thread.
>*
>* @param reusableToken
>* @return null if exhausted, instance request_next_token if one more call 
> is required for an answer, or instance parameter resuableToken.
>* @throws IOException
>*/
>   private Token produceNextToken(final Token reusableToken) throws 
> IOException {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows


 [ 
https://issues.apache.org/jira/browse/LUCENE-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1514:


Attachment: LUCENE-1514.txt

> ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix 
> grows
> --
>
> Key: LUCENE-1514
> URL: https://issues.apache.org/jira/browse/LUCENE-1514
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
> Fix For: 2.9
>
> Attachments: LUCENE-1514.txt
>
>
> ShingleMatrixFilter#next makes a recursive function invocation when the 
> current permutation iterator is exhausted or if the current state of the 
> permutation iterator already has produced an identical shingle. In a not too 
> complex matrix this will require a gigabyte sized stack per thread.
> My solution is to avoid the recursive invocation by refactoring like this:
> {code:java}
> public Token next(final Token reusableToken) throws IOException {
> assert reusableToken != null;
> if (matrix == null) {
>   matrix = new Matrix();
>   // fill matrix with maximumShingleSize columns
>   while (matrix.columns.size() < maximumShingleSize && readColumn()) {
> // this loop looks ugly
>   }
> }
> // this loop exists in order to avoid recursive calls to the next method
> // as the complexity of a large matrix
> // then would require a multi gigabyte sized stack.
> Token token;
> do {
>   token = produceNextToken(reusableToken);
> } while (token == request_next_token);
> return token;
>   }
>   
>   private static final Token request_next_token = new Token();
>   /**
>* This method exists in order to avoid reursive calls to the method
>* as the complexity of a fairlt small matrix then easily would require
>* a gigabyte sized stack per thread.
>*
>* @param reusableToken
>* @return null if exhausted, instance request_next_token if one more call 
> is required for an answer, or instance parameter resuableToken.
>* @throws IOException
>*/
>   private Token produceNextToken(final Token reusableToken) throws 
> IOException {
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1514) ShingleMatrixFilter eaily throws StackOverFlow as the complexity of a matrix grows