date:20090328

[jira] Updated: (LUCENE-1425) Add ConstantScore highlighting support to SpanScorer

2009-03-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1425:
---

Fix Version/s: 2.9

 Add ConstantScore highlighting support to SpanScorer
 

 Key: LUCENE-1425
 URL: https://issues.apache.org/jira/browse/LUCENE-1425
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/highlighter
Reporter: Mark Miller
Assignee: Mark Miller
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1425.patch, LUCENE-1425.patch


 Its actually easy enough to support the family of constantscore queries with 
 the new SpanScorer. This will also remove the requirement that you rewrite 
 queries against the main index before highlighting (in fact, if you do, the 
 constantscore queries will not highlight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques

2009-03-28 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693414#action_12693414
]

Michael McCandless commented on LUCENE-1577:

Are these tests measuring adding a single doc, then searching on it? What are
the numbers you measure in the results (eg 25882 for LuceneRealtimeWriter)?

I think we need a more realistic test for near real-time search, but I'm not
sure exactly what that is.

In LUCENE-1516 I've added a benchmark task to periodically open a new near
real-time reader from the writer, and then tested it while doing bulk indexing.
But that's not a typical test, I think (normally bulk indexing is done up
front, and only a trickle of updates to doc are then done for near real-time
search). Maybe we just need an updateDocument task, which randomly picks a doc
(identified by a primary-key docid field) and replaces it. Then, benchmark
already has the ability to rate-limit how frequently docs are updated.

Benchmark of different in RAM realtime techniques
-

Key: LUCENE-1577
URL: https://issues.apache.org/jira/browse/LUCENE-1577
Project: Lucene - Java
Issue Type: Improvement
Components: contrib/*
Affects Versions: 2.4.1
Reporter: Jason Rutherglen
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1577.patch

Original Estimate: 168h
Remaining Estimate: 168h

A place to post code that benchmarks the differences in the speed of indexing
and searching using different realtime techniques.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

NIO.2

2009-03-28 Thread Michael Busch


NIO.2 sounds great.
Though, it will probably take a pretty long time before we can switch 
Lucene to Java 1.7 :(


We could write a (contrib) module that we don't ship together with the 
core that has a Directory implementation which uses NIO.2.


http://jcp.org/en/jsr/detail?id=203
http://ronsoft.net/files/WhatsNewNIO2.pdf


-Michael

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693443#action_12693443
 ] 

Shai Erera commented on LUCENE-1575:


{quote}
This turns deprecated HitCollector into a Collector? Seems like it
should be package private?
{quote}

Initially I wrote it but then deleted. I decided to make the decision as I 
create the patch. If this will be used only in IndexSearcher, then it should be 
a private static final class in IndexSearcher, otherwise a package private one. 
However, if it turns out we'd want to use it for now in other places too where 
we deprecate the HitCollector methods, then it will be public.
Anyway, it will be marked deprecated, and I have the intention to make it as 
'invisible' as possible.

{quote}
This is deprecated, so we shouldn't add topDocs(start, howMany)? I
think just switch it back to extending the deprecated TopDocCollector
(like it does in 2.4)?
{quote}

That's a good idea.

{quote}
H good point. I would love to stop screening for 0 score in the
core collectors (like Solr). Maybe we fix the core collectors to not
screen by zero score, but we add a new only keep positive scores
collector chain/wrapper class that does the filtering and the forwards
collection to another collector? This way there's a migration path if
somehow users are relying on this.
{quote}

I can do that. Create a FilterZeroScoresCollector which wraps a Collector and 
passes forward only documents with score  0. BTW, how can a document get a 
zero score?

I thought to split patches to code and test since I believe the code patch can 
be ready sooner for review. The test patch will just fix test cases. If that 
matters so much, I can create a final patch in the end which contains all the 
changes for easier commit?

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the

Re: NIO.2

2009-03-28 Thread Earwin Burrfoot

On Sat, Mar 28, 2009 at 16:44, Michael Busch busch...@gmail.com wrote:
 NIO.2 sounds great.
 Though, it will probably take a pretty long time before we can switch Lucene
 to Java 1.7 :(

 We could write a (contrib) module that we don't ship together with the core
 that has a Directory implementation which uses NIO.2.

 http://jcp.org/en/jsr/detail?id=203
 http://ronsoft.net/files/WhatsNewNIO2.pdf


 -Michael

I was excited for a second, until I noticed they somehow lost
Big*Buffers while merging into JDK7.
http://download.java.net/jdk7/docs/api/java/nio/channels/package-summary.html
- Ctrl+F, MappedBigByteBuffer, that's the only remnant of what was
supposed to be there.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

AW: NIO.2

2009-03-28 Thread Uwe Schindler

From the talk at ApacheCon I remember that NIO.2 will also be available for 
Java 6 as a addon package (a little bit later).

Uwe

Mit einem Mobiltelefon von Sony Ericsson gesendet


 Originalnachricht 
Von: Michael Busch busch...@gmail.com
Gesendet: 
An: java-dev@lucene.apache.org
Betreff: NIO.2

NIO.2 sounds great.
Though, it will probably take a pretty long time before we can switch 
Lucene to Java 1.7 :(

We could write a (contrib) module that we don't ship together with the 
core that has a Directory implementation which uses NIO.2.

http://jcp.org/en/jsr/detail?id=203
http://ronsoft.net/files/WhatsNewNIO2.pdf


-Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Marvin Humphrey (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693448#action_12693448
]

Marvin Humphrey commented on LUCENE-1575:
-

BTW, how can a document get a zero score?

Any number of ways, since Query and Scorer are extensible. How about a
RandomScoreQuery that uses floor(rand(1.9))? Or say that you have a bitset of
docs which should match and you use that to feed a scorer. What score should
you assign? Why not 0? Why not -1? Should it matter?

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

This issue is a result of a recent discussion we've had on the mailing list.
You can read the thread
[here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
We have agreed to do the following refactoring:
* Rename MultiReaderHitCollector to Collector, with the purpose that it will
be the base class for all Collector implementations.
* Deprecate HitCollector in favor of the new Collector.
* Introduce new methods in IndexSearcher that accept Collector, and deprecate
those that accept HitCollector.
** Create a final class HitCollectorWrapper, and use it in the deprecated
methods in IndexSearcher, wrapping the given HitCollector.
** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0,
when we remove HitCollector.
** It will remove any instanceof checks that currently exist in IndexSearcher
code.
* Create a new (abstract) TopDocsCollector, which will:
** Leave collect and setNextReader unimplemented.
** Introduce protected members PriorityQueue and totalHits.
** Introduce a single protected constructor which accepts a PriorityQueue.
** Implement topDocs() and getTotalHits() using the PQ and totalHits members.
These can be used as-are by extending classes, as well as be overridden.
** Introduce a new topDocs(start, howMany) method which will be used a
convenience method when implementing a search application which allows paging
through search results. It will also attempt to improve the memory
allocation, by allocating a ScoreDoc[] of the requested size only.
* Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs()
and getTotalHits() implementations as they are from TopDocsCollector. The
class will also be made final.
* Change TopFieldCollector to extend TopDocsCollector, and make the class
final. Also implement topDocs(start, howMany).
* Change TopFieldDocCollector (deprecated) to extend TopDocsCollector,
instead of TopScoreDocCollector. Implement topDocs(start, howMany)
* Review other places where HitCollector is used, such as in Scorer,
deprecate those places and use Collector instead.
Additionally, the following proposal was made w.r.t. decoupling score from
collect():
* Change collect to accecpt only a doc Id (unbased).
* Introduce a setScorer(Scorer) method.
* If during collect the implementation needs the score, it can call
scorer.score().
If we do this, then we need to review all places in the code where
collect(doc, score) is called, and assert whether Scorer can be passed. Also
this raises few questions:
* What if during collect() Scorer is null? (i.e., not set) - is it even
possible?
* I noticed that many (if not all) of the collect() implementations discard
the document if its score is not greater than 0. Doesn't it mean that score
is needed in collect() always?
Open issues:
* The name for Collector
* TopDocsCollector was mentioned on the thread as TopResultsCollector, but
that was when we thought to call Colletor ResultsColletor. Since we decided
(so far) on Collector, I think TopDocsCollector makes sense, because of its
TopDocs output.
* Decoupling score from collect().
I will post a patch a bit later, as this is expected to be a very large
patch. I will split it into 2: (1) code patch (2) test cases (moving to use
Collector instead of HitCollector, as well as testing the new topDocs(start,
howMany) method.
There might be even a 3rd patch which handles the setScorer thing in
Collector (maybe even a different issue?)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693462#action_12693462
 ] 

Michael McCandless commented on LUCENE-1575:


bq. I thought to split patches to code and test since I believe the code patch 
can be ready sooner for review. The test patch will just fix test cases. If 
that matters so much, I can create a final patch in the end which contains all 
the changes for easier commit?

OK that sounds great.  The back-compat tests will also assert nothing broke.

bq. Anyway, it will be marked deprecated, and I have the intention to make it 
as 'invisible' as possible.

OK.

bq. BTW, how can a document get a zero score?

I've wondered the same thing.  There was this thread recently:

   http://www.nabble.com/TopDocCollector-td22244245.html

 Refactoring Lucene collectors (HitCollector and extensions)
 ---

 Key: LUCENE-1575
 URL: https://issues.apache.org/jira/browse/LUCENE-1575
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This issue is a result of a recent discussion we've had on the mailing list. 
 You can read the thread 
 [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
 We have agreed to do the following refactoring:
 * Rename MultiReaderHitCollector to Collector, with the purpose that it will 
 be the base class for all Collector implementations.
 * Deprecate HitCollector in favor of the new Collector.
 * Introduce new methods in IndexSearcher that accept Collector, and deprecate 
 those that accept HitCollector.
 ** Create a final class HitCollectorWrapper, and use it in the deprecated 
 methods in IndexSearcher, wrapping the given HitCollector.
 ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, 
 when we remove HitCollector.
 ** It will remove any instanceof checks that currently exist in IndexSearcher 
 code.
 * Create a new (abstract) TopDocsCollector, which will:
 ** Leave collect and setNextReader unimplemented.
 ** Introduce protected members PriorityQueue and totalHits.
 ** Introduce a single protected constructor which accepts a PriorityQueue.
 ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. 
 These can be used as-are by extending classes, as well as be overridden.
 ** Introduce a new topDocs(start, howMany) method which will be used a 
 convenience method when implementing a search application which allows paging 
 through search results. It will also attempt to improve the memory 
 allocation, by allocating a ScoreDoc[] of the requested size only.
 * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() 
 and getTotalHits() implementations as they are from TopDocsCollector. The 
 class will also be made final.
 * Change TopFieldCollector to extend TopDocsCollector, and make the class 
 final. Also implement topDocs(start, howMany).
 * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, 
 instead of TopScoreDocCollector. Implement topDocs(start, howMany)
 * Review other places where HitCollector is used, such as in Scorer, 
 deprecate those places and use Collector instead.
 Additionally, the following proposal was made w.r.t. decoupling score from 
 collect():
 * Change collect to accecpt only a doc Id (unbased).
 * Introduce a setScorer(Scorer) method.
 * If during collect the implementation needs the score, it can call 
 scorer.score().
 If we do this, then we need to review all places in the code where 
 collect(doc, score) is called, and assert whether Scorer can be passed. Also 
 this raises few questions:
 * What if during collect() Scorer is null? (i.e., not set) - is it even 
 possible?
 * I noticed that many (if not all) of the collect() implementations discard 
 the document if its score is not greater than 0. Doesn't it mean that score 
 is needed in collect() always?
 Open issues:
 * The name for Collector
 * TopDocsCollector was mentioned on the thread as TopResultsCollector, but 
 that was when we thought to call Colletor ResultsColletor. Since we decided 
 (so far) on Collector, I think TopDocsCollector makes sense, because of its 
 TopDocs output.
 * Decoupling score from collect().
 I will post a patch a bit later, as this is expected to be a very large 
 patch. I will split it into 2: (1) code patch (2) test cases (moving to use 
 Collector instead of HitCollector, as well as testing the new topDocs(start, 
 howMany) method.
 There might be even a 3rd patch which handles the setScorer thing in 
 Collector (maybe even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

2009-03-28 Thread Michael McCandless (JIRA)

Cloned SegmentReaders fail to share FieldCache entries
--

 Key: LUCENE-1579
 URL: https://issues.apache.org/jira/browse/LUCENE-1579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9


I just hit this on LUCENE-1516, which returns a cloned readOnly
readers from IndexWriter.

The problem is, when cloning, we create a new [thin] cloned
SegmentReader for each segment.  FieldCache keys directly off this
object, so if you clone the reader and do a search that requires the
FieldCache (eg, sorting) then that first search is always very slow
because every single segment is reloading the FieldCache.

This is of course a complete showstopper for LUCENE-1516.

With LUCENE-831 we'll switch to a new FieldCache API; we should ensure
this bug is not present there.  We should also fix the bug in the
current FieldCache API since for 2.9, users may hit this.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-03-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693465#action_12693465
 ] 

Michael McCandless commented on LUCENE-1516:


Disregard the search time in the above results... we have a sneaky bug
(LUCENE-1579) that is causing FieldCache to not be re-used for shared
segments in a reopened reader.  This makes the search time after
reopen far worse than it should be.


 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, magnetic.png, ssd.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2009-03-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693464#action_12693464
 ] 

Michael McCandless commented on LUCENE-831:
---

Let's make sure the new API fixes LUCENE-1579.

 Complete overhaul of FieldCache API/Implementation
 --

 Key: LUCENE-831
 URL: https://issues.apache.org/jira/browse/LUCENE-831
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Hoss Man
 Fix For: 3.0

 Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
 fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
 LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
 LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
 LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch


 Motivation:
 1) Complete overhaul the API/implementation of FieldCache type things...
 a) eliminate global static map keyed on IndexReader (thus
 eliminating synch block between completley independent IndexReaders)
 b) allow more customization of cache management (ie: use 
 expiration/replacement strategies, disk backed caches, etc)
 c) allow people to define custom cache data logic (ie: custom
 parsers, complex datatypes, etc... anything tied to a reader)
 d) allow people to inspect what's in a cache (list of CacheKeys) for
 an IndexReader so a new IndexReader can be likewise warmed. 
 e) Lend support for smarter cache management if/when
 IndexReader.reopen is added (merging of cached data from subReaders).
 2) Provide backwards compatibility to support existing FieldCache API with
 the new implementation, so there is no redundent caching as client code
 migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: NIO.2

2009-03-28 Thread Michael McCandless

I think having async IO will be great, though I wonder how we would
change Lucene to take advantage of it.  It ought to gain us
concurrency (eg we can score last chunk while we have an io request
out to retrieve next chunk, of term docs / positions / etc.).

Watch service sounds neat.  Maybe we could use that as a way for
readers to know when to reopen themselves.

Something Lucene would really benefit from is access to madvise, so
that we could tell the OS that the massive amounts of data we are
reading  writing for merging should not be cached.  But it doesn't
look like NIO2 has exposed this...

Another thing would be control over the priority of multiple
outstanding IO tasks.  EG I'd like to say that the IO in a merge
thread is lower priority than indexing thread, which is lower priority
than searching threads.  This is even further out (I don't think OSs
expose this control?).

Mike

On Sat, Mar 28, 2009 at 9:44 AM, Michael Busch busch...@gmail.com wrote:
 NIO.2 sounds great.
 Though, it will probably take a pretty long time before we can switch Lucene
 to Java 1.7 :(

 We could write a (contrib) module that we don't ship together with the core
 that has a Directory implementation which uses NIO.2.

 http://jcp.org/en/jsr/detail?id=203
 http://ronsoft.net/files/WhatsNewNIO2.pdf


 -Michael


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693469#action_12693469
]

Shai Erera commented on LUCENE-1575:

After I posted the question on how can a document get a 0 score, I realized
that it's possible due to extensions of Similarity for example. Thanks Marvin
for clearing that up. I guess though that the Lucene core classes will not
assign = 0 score to a document?

Anyway, whether it's true or not, I think I agree with Mike saying we should
remove this screening from the core collectors. If my application extends
Lucene in a way that it can assign = 0 scores to documents, and it has the
intention of screening those documents, it should use the new
FilterZeroScoresCollector (maybe call it OnlyPositiveScoresCollector?)

I don't think that assigning = 0 score to a document necessarily means it
should be removed from the result set.

However, Mike (and others) - isn't there a back-compatibility issue with
changing the core collectors to not screen on =0 score documents? I mean, what
if my application relies on that and extended Lucene in a way that it sometimes
assigns 0 scores to documents? Now when I'll switch to 2.9, those documents
won't be filtered. I will be able to use the new FilterZeroScoresCollector, but
that'll require me to change my app's code.

Maybe just do it for the new collectors (TopScoreDocCollector and
TopFieldCollector)? I need to change my app's code anyway if I want to use
them, so as long as we document this fact in their javadocs, we should be fine?

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693470#action_12693470
]

Michael McCandless commented on LUCENE-1575:

bq. However, Mike (and others) - isn't there a back-compatibility issue with
changing the core collectors to not screen on =0 score documents?

Hmm right there is, because the search methods will use the new collectors.

bq. I need to change my app's code anyway if I want to use them, so as long as
we document this fact in their javadocs, we should be fine?

Actually there's no change to your code required (the search methods should use
the new collectors). So we do have a back-compat difference.

We could make the change (turn off filtering), but put a setter on
IndexSearcher to have it insert the PositiveScoresOnlyCollector wrapper? I
think the vast majority of users are not relying on = 0 scoring docs to be
filtered out.

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

--
This message is automatically

Re: NIO.2

2009-03-28 Thread Earwin Burrfoot

 I think having async IO will be great, though I wonder how we would
 change Lucene to take advantage of it.  It ought to gain us
 concurrency (eg we can score last chunk while we have an io request
 out to retrieve next chunk, of term docs / positions / etc.).
A presentation given above references Big*Buffers, including
MappedBigByteBuffer, which differ from their not-so-Big counterparts
in using long sizes/offsets. That means (woo-hoo!) a way better
MMapDirectory.

Everything else there is totally irrelevant to lucene. Okay, maybe
atomic file moves, but that part of Directory is long time deprecated
:)

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693474#action_12693474
]

Shai Erera commented on LUCENE-1575:

bq. We could make the change (turn off filtering), but put a setter on
IndexSearcher to have it insert the PositiveScoresOnlyCollector wrapper?

Then why do that at all? If I need to call searcher.setKeepOnlyPositiveScores,
then it means a change to my code. I could then just pass in the
PositiveScoresOnlyCollector to the search methods instead, right?

I guess you are referring to the methods which don't take a collector as a
parameter and instantiate a new TopScoreDocCollector internally? I tend to
think that if someone uses those, it is just because they are simple, and I
find it very hard to imagine that that someone relies on the filtering. So
perhaps we can get away with just documenting the change in behavior?

bq. I think the vast majority of users are not relying on = 0 scoring docs to
be filtered out.

I tend to agree. This has been around for quite some time. I checked my custom
collectors, and they do the same check. I only now realize I just followed the
code practice I saw in Lucene's code, never giving it much thought of whether
this can actually happen. I believe that if I'd have extended Lucene in a way
such that it returns =0 scores, I'd be aware of that and probably won't use
the built-in collectors. I see no reason to filter = 0 scored docs anyway, and
if I wanted that, I'd probably write my own filtering collector ...

I think that if we don't believe people rely on the = 0 filtering, let's just
document it. I'd hate to add a setter method to IndexSearcher, and a unit test,
and check where else it should be added (i.e., in extending searcher classes)
and introduce a new API which we might need to deprecate some day ...
People who'll need that functionality can move to use the methods that accept a
Collector, and pass in the PositiveScoresOnlyCollector. That way we also keep
the 'fast and easy' search methods really simple, fast and easy.

Is that acceptable?

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693478#action_12693478
]

Michael McCandless commented on LUCENE-1575:

bq. Then why do that at all? If I need to call
searcher.setKeepOnlyPositiveScores, then it means a change to my code. I could
then just pass in the PositiveScoresOnlyCollector to the search methods
instead, right?

OK, I agree. Let's add an entry to the top of CHANGES.txt that states this
[minor] break in back compatibility, as well as the code fragment showing how
to use that filter to get back to the pre-2.9 way?

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

2009-03-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1579:
---

Attachment: LUCENE-1579.patch

Attached patch.  I plan to commit in a day or two.

I added a new deprecated expert public method to IndexReader:
getFieldCacheWrapper().  Default impl is to return this, but
SegmentReader overrides that and returns a wrapper class that forwards
hashCode()/equals() to the underlying freqStream.


 Cloned SegmentReaders fail to share FieldCache entries
 --

 Key: LUCENE-1579
 URL: https://issues.apache.org/jira/browse/LUCENE-1579
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1579.patch


 I just hit this on LUCENE-1516, which returns a cloned readOnly
 readers from IndexWriter.
 The problem is, when cloning, we create a new [thin] cloned
 SegmentReader for each segment.  FieldCache keys directly off this
 object, so if you clone the reader and do a search that requires the
 FieldCache (eg, sorting) then that first search is always very slow
 because every single segment is reloading the FieldCache.
 This is of course a complete showstopper for LUCENE-1516.
 With LUCENE-831 we'll switch to a new FieldCache API; we should ensure
 this bug is not present there.  We should also fix the bug in the
 current FieldCache API since for 2.9, users may hit this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d

2009-03-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1573.


Resolution: Fixed

Thanks Jeremy.

 IndexWriter does not do the right thing when a Thread is interrupt()'d
 --

 Key: LUCENE-1573
 URL: https://issues.apache.org/jira/browse/LUCENE-1573
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1573.patch


 Spinoff from here:
 
 http://www.nabble.com/Deadlock-with-concurrent-merges-and-IndexWriter--Lucene-2.4--to22714290.html
 When a Thread is interrupt()'d while inside Lucene, there is a risk currently 
 that it will cause a spinloop and starve BG merges from completing.
 Instead, when possible, we should allow interruption.  But unfortunately for 
 back-compat, we will need to wrap the exception in an unchecked version.  In 
 3.0 we can change that to simply throw InterruptedException.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-652) Compressed fields should be externalized (from Fields into Document)

2009-03-28 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless resolved LUCENE-652.
---

Resolution: Fixed

Compressed fields should be externalized (from Fields into Document)
--

Key: LUCENE-652
URL: https://issues.apache.org/jira/browse/LUCENE-652
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 1.9, 2.0.0, 2.1
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-652.patch, LUCENE-652.patch, LUCENE-652.patch

Right now, as of 2.0 release, Lucene supports compressed stored fields.
However, after discussion on java-dev, the suggestion arose, from Robert
Engels, that it would be better if this logic were moved into the Document
level. This way the indexing level just stores opaque binary fields, and
then Document handles compress/uncompressing as needed.
This approach would have prevented issues like LUCENE-629 because merging of
segments would never need to decompress.
See this thread for the recent discussion:
http://www.gossamer-threads.com/lists/lucene/java-dev/38836
When we do this we should also work on related issue LUCENE-648.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

2009-03-28 Thread Digy (JIRA)

ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
--

 Key: LUCENE-1580
 URL: https://issues.apache.org/jira/browse/LUCENE-1580
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Digy
Priority: Minor
 Attachments: ISOLatin1AccentFilter.patch

Below mappings  are missing

Ğ -- G
ğ -- g
İ -- I
ı -- i
Ş -- S
ş -- s

DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

2009-03-28 Thread Digy (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Digy updated LUCENE-1580:
-

Attachment: ISOLatin1AccentFilter.patch

 ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
 --

 Key: LUCENE-1580
 URL: https://issues.apache.org/jira/browse/LUCENE-1580
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Digy
Priority: Minor
 Attachments: ISOLatin1AccentFilter.patch


 Below mappings  are missing
 Ğ -- G
 ğ -- g
 İ -- I
 ı -- i
 Ş -- S
 ş -- s
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

2009-03-28 Thread Andi Vajda (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andi Vajda resolved LUCENE-1580.


Resolution: Duplicate

See https://issues.apache.org/jira/browse/LUCENE-1390

 ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.
 --

 Key: LUCENE-1580
 URL: https://issues.apache.org/jira/browse/LUCENE-1580
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Digy
Priority: Minor
 Attachments: ISOLatin1AccentFilter.patch


 Below mappings  are missing
 Ğ -- G
 ğ -- g
 İ -- I
 ı -- i
 Ş -- S
 ş -- s
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

2009-03-28 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless updated LUCENE-1579:
---

Attachment: LUCENE-1579.patch

New patch. The last one was causing entries in FieldCache to get booted too
soon.

Cloned SegmentReaders fail to share FieldCache entries
--

Key: LUCENE-1579
URL: https://issues.apache.org/jira/browse/LUCENE-1579
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 2.9

Attachments: LUCENE-1579.patch, LUCENE-1579.patch

I just hit this on LUCENE-1516, which returns a cloned readOnly
readers from IndexWriter.
The problem is, when cloning, we create a new [thin] cloned
SegmentReader for each segment. FieldCache keys directly off this
object, so if you clone the reader and do a search that requires the
FieldCache (eg, sorting) then that first search is always very slow
because every single segment is reloading the FieldCache.
This is of course a complete showstopper for LUCENE-1516.
With LUCENE-831 we'll switch to a new FieldCache API; we should ensure
this bug is not present there. We should also fix the bug in the
current FieldCache API since for 2.9, users may hit this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter

2009-03-28 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1516:
---

Attachment: LUCENE-1516.patch

New patch:  Fixed a few small issues... and made some changes to
contrib/benchmark to help in running more realistic near real-time
tests:

  * Fixed LineDocMaker to properly set docid primary key field.

  * Added UpdateDocTask that calls IndexWriter.updateDocument,
randomly picking a docid.

  * Added NearRealTimeReader task, that creates BG thread that every N
seconds opens a reader, runs a static search and prints results.

This patch also contains patch from LUCENE-1579.

So, using this you can 1) create a large index, 2) create an alg that
does doc updates at a fixed rate and then tests the near real-time
reader performance.


 Integrate IndexReader with IndexWriter 
 ---

 Key: LUCENE-1516
 URL: https://issues.apache.org/jira/browse/LUCENE-1516
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.4
Reporter: Jason Rutherglen
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, 
 LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png

   Original Estimate: 672h
  Remaining Estimate: 672h

 The current problem is an IndexReader and IndexWriter cannot be open
 at the same time and perform updates as they both require a write
 lock to the index. While methods such as IW.deleteDocuments enables
 deleting from IW, methods such as IR.deleteDocument(int doc) and
 norms updating are not available from IW. This limits the
 capabilities of performing updates to the index dynamically or in
 realtime without closing the IW and opening an IR, deleting or
 updating norms, flushing, then opening the IW again, a process which
 can be detrimental to realtime updates. 
 This patch will expose an IndexWriter.getReader method that returns
 the currently flushed state of the index as a class that implements
 IndexReader. The new IR implementation will differ from existing IR
 implementations such as MultiSegmentReader in that flushing will
 synchronize updates with IW in part by sharing the write lock. All
 methods of IR will be usable including reopen and clone. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Possible IndexInput optimization

2009-03-28 Thread Earwin Burrfoot

While drooling over MappedBigByteBuffer, which we'll (hopefully) see
in JDK7, I revisited my own Directory code and noticed a certain
peculiarity, shared by Lucene core classes:
Each and every IndexInput implementation only implements readByte()
and readBytes(), never trying to override readInt/VInt/Long/etc
methods.

Currently RAMDirectory uses a list of byte arrays as a backing store,
and I got some speedup when switched to custom version that knows each
file size beforehand and thus is able to allocate a single byte array
(deliberately accepting 2Gb file size limitation) of exactly needed
length. Nothing strange here, readByte(s) methods are easily most oft
called ones in a Lucene app and they were greatly simplified -
readByte became mere:
public byte readByte() throws IOException {
return buffer[position++]; // I dropped bounds checking, relying
on natural ArrayIndexOOBE, we can't easily catch and recover from it
anyway
}

But now, readInt is four readByte calls, readLong is two readInts (ten
calls in total), readString - god knows how many. Unless you use a
single type of Directory through the lifetime of your application,
these readByte calls are never inlined, JIT invokevirtual
short-circuit optimization (it skips method lookup if it always finds
the same one during this exact invocation) cannot be applied too.

There are three cases when we can override readNNN methods and provide
implementations with zero or minimum method invocations -
RAMDirectory, MMapDirectory and BufferedIndexInput for
FSDirectory/CompoundFileReader. Anybody tried this?


-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-28 Thread Digy (JIRA)

LowerCaseFilter should be able to be configured to use a specific locale.
-

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy


//Since I am a .Net programmer, Sample codes will be in c# but I don't think 
that it would be a problem to understand them.
//

Assume an input text like İ and and analyzer like below
{code}
public class SomeAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, 
System.IO.TextReader reader)
{
TokenStream t = new SomeTokenizer(reader);
t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
t = new LowerCaseFilter(t);
return t;
}

}
{code}


ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
i (if locale is en-US) 
or 
ı' if(locale is tr-TR) (that means,this token should be input to 
another instance of ASCIIFoldingFilter)



So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but 
a better approach can be adding
a new constructor to LowerCaseFilter and forcing it to use a specific locale.
{code}
public sealed class LowerCaseFilter : TokenFilter
{
/* +++ */System.Globalization.CultureInfo CultureInfo = 
System.Globalization.CultureInfo.CurrentCulture;

public LowerCaseFilter(TokenStream in) : base(in)
{
}

/* +++ */  public LowerCaseFilter(TokenStream in, 
System.Globalization.CultureInfo CultureInfo) : base(in)
/* +++ */  {
/* +++ */  this.CultureInfo = CultureInfo;
/* +++ */  }

public override Token Next(Token result)
{
result = Input.Next(result);
if (result != null)
{

char[] buffer = result.TermBuffer();
int length = result.termLength;
for (int i = 0; i  length; i++)
/* +++ */ buffer[i] = 
System.Char.ToLower(buffer[i],CultureInfo);

return result;
}
else
return null;
}
}
{code}

DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

2009-03-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693512#action_12693512
 ] 

Shai Erera commented on LUCENE-1581:


I guess we were telepathying or something because I reviewed LowerCaseFilter 2 
days ago for the same reason :)
Thing is, in Java Character.toLowerCase does not accept a Locale, just char. 
Unlike String which has two variants for toLowerCase and toUpperCase, that 
accept in addition to the String, a Locale parameter.

I believe that Character.toLowerCase in Java works ok, since it's based on the 
UNICODE spec (at least it writes so) - however I have to admit I haven't tested 
this character specifically.

 LowerCaseFilter should be able to be configured to use a specific locale.
 -

 Key: LUCENE-1581
 URL: https://issues.apache.org/jira/browse/LUCENE-1581
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Digy

 //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
 that it would be a problem to understand them.
 //
 Assume an input text like İ and and analyzer like below
 {code}
   public class SomeAnalyzer : Analyzer
   {
   public override TokenStream TokenStream(string fieldName, 
 System.IO.TextReader reader)
   {
   TokenStream t = new SomeTokenizer(reader);
   t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
   t = new LowerCaseFilter(t);
   return t;
   }
 
   }
 {code}
   
 ASCIIFoldingFilter will return I and after, LowerCaseFilter will return
   i (if locale is en-US) 
   or 
   ı' if(locale is tr-TR) (that means,this token should be input to 
 another instance of ASCIIFoldingFilter)
 So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
 but a better approach can be adding
 a new constructor to LowerCaseFilter and forcing it to use a specific locale.
 {code}
 public sealed class LowerCaseFilter : TokenFilter
 {
 /* +++ */System.Globalization.CultureInfo CultureInfo = 
 System.Globalization.CultureInfo.CurrentCulture;
 public LowerCaseFilter(TokenStream in) : base(in)
 {
 }
 /* +++ */  public LowerCaseFilter(TokenStream in, 
 System.Globalization.CultureInfo CultureInfo) : base(in)
 /* +++ */  {
 /* +++ */  this.CultureInfo = CultureInfo;
 /* +++ */  }
   
 public override Token Next(Token result)
 {
 result = Input.Next(result);
 if (result != null)
 {
 char[] buffer = result.TermBuffer();
 int length = result.termLength;
 for (int i = 0; i  length; i++)
 /* +++ */ buffer[i] = 
 System.Char.ToLower(buffer[i],CultureInfo);
 return result;
 }
 else
 return null;
 }
 }
 {code}
 DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

2009-03-28 Thread Shai Erera (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12693513#action_12693513
]

Shai Erera commented on LUCENE-1575:

Great !

Refactoring Lucene collectors (HitCollector and extensions)
---

Key: LUCENE-1575
URL: https://issues.apache.org/jira/browse/LUCENE-1575
Project: Lucene - Java
Issue Type: Improvement
Components: Search
Reporter: Shai Erera
Fix For: 2.9

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1425) Add ConstantScore highlighting support to SpanScorer

[jira] Commented: (LUCENE-1577) Benchmark of different in RAM realtime techniques

NIO.2

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Re: NIO.2

AW: NIO.2

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Created: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

Re: NIO.2

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

Re: NIO.2

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

[jira] Resolved: (LUCENE-1573) IndexWriter does not do the right thing when a Thread is interrupt()'d

[jira] Resolved: (LUCENE-652) Compressed fields should be externalized (from Fields into Document)

[jira] Created: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

[jira] Updated: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

[jira] Resolved: (LUCENE-1580) ISOLatin1AccentFilter does not handle Turkish (UTF-8) chars correctly.

[jira] Updated: (LUCENE-1579) Cloned SegmentReaders fail to share FieldCache entries

[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter

Possible IndexInput optimization

[jira] Created: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)

29 matches

Site Navigation

Mail list logo

Footer information