date:20091129

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783558#action_12783558
 ] 

Robert Muir commented on LUCENE-2091:
-

Yuval, bm25 has been working nicely for me too. 
on some collections, it really helps, but I haven't yet found a case where it 
hurts (compared to lucene's current scoring algorithm)

thanks in advance for working this!


> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2062) Bulgarian Analyzer

2009-11-29 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2062:


Attachment: LUCENE-2062.patch

some improvements on the previous patch, mostly changing the test to work in a 
similar way to TestCzechStemmer, refining stopwords list,  javadocs, etc.

I think this one is ready. I'll commit in a few days if no one objects.


> Bulgarian Analyzer
> --
>
> Key: LUCENE-2062
> URL: https://issues.apache.org/jira/browse/LUCENE-2062
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2062.patch, LUCENE-2062.patch
>
>
> someone asked about bulgarian analysis on solr-user today... 
> http://www.lucidimagination.com/search/document/e1e7a5636edb1db2/non_english_languages
> I was surprised we did not have anything.
> This analyzer implements the algorithm specified here, 
> http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf
> In the measurements there, this improves MAP approx 34%

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Yuval Feinstein (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783555#action_12783555
 ] 

Yuval Feinstein commented on LUCENE-2091:
-

Otis and Robert, Here's my (limited) experience with BM25:
On a proprietary corpus (alas) I got a nice improvement, which was more 
pronounced in recall
(hits that were previously not ranked as top ones, and therefore remained 
unseen, now appear in the top results).
I have worked on lowering the BM25 run time to a reasonable level.
I hope that once this gets into the hands of the Lucene community, BM25 
performance 
will approach the current Lucene scoring's performance. This is a tall order,
as the latter has been in the works for the last eight years or so.
As for use cases, in my use case BM25 helps, I believe this may be true for 
other cases.
  

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783532#action_12783532
 ] 

Robert Muir edited comment on LUCENE-2091 at 11/30/09 4:45 AM:
---

otis attached is a graph i produced from the hamshahri corpus, comparing 4 
different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the 
normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i 
wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring 
system can't completely make up for lack of support like this.

btw, you can play around with openrelevance svn and duplicate my experiments on 
this same corpus yourself if you want. there's an indonesian corpus there too. 
i've also tested hindi with this impl.


  was (Author: rcmuir):
otis attached is a graph i produced from the hamshahri corpus, comparing 4 
different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the 
normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i 
wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring 
system can't completely make up for lack of support like this.
  
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2091:


Attachment: persianlucene.jpg

otis attached is a graph i produced from the hamshahri corpus, comparing 4 
different combinations
Lucene SimpleAnalyzer
Lucene SimpleAnalyzer + BM25
Lucene PersianAnalyzer
Lucene PersianAnalyzer + BM25

the hamshahri corpus contains standardized encoding of persian (i.e. the 
normalization filter is a no-op).
so any analyzer gain is strictly due to "stopwords", although in persian i 
wouldn't call some of these words.

this was mostly to show that the analyzer is actually useful, i.e. the scoring 
system can't completely make up for lack of support like this.

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783530#action_12783530
 ] 

Otis Gospodnetic edited comment on LUCENE-2091 at 11/30/09 4:21 AM:


Has anyone compared this particular BM25 impl. to the current Lucene's 
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Aha, I found something:
http://markmail.org/message/c2r4v7zj7mduzs5d

Also, this issue is marked as contrib/*.  Should this not go straight to core, 
so more people actually use this and provide feedback?  Who knows, there is a 
chance (ha!) BM25 might turn out better than the current approach, and become 
the default.

  was (Author: otis):
Has anyone compared this particular BM25 impl. to the current Lucene's 
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Also, this issue is marked as contrib/*.  Should this not go straight to core, 
so more people actually use this and provide feedback?  Who knows, there is a 
chance (ha!) BM25 might turn out better than the current approach, and become 
the default.
  
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-11-29 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783530#action_12783530
 ] 

Otis Gospodnetic commented on LUCENE-2091:
--

Has anyone compared this particular BM25 impl. to the current Lucene's 
quasi-VSM approach in terms of:
* any of the relevance eval methods
* indexing performance
* search performance
* ...

Also, this issue is marked as contrib/*.  Should this not go straight to core, 
so more people actually use this and provide feedback?  Who knows, there is a 
chance (ha!) BM25 might turn out better than the current approach, and become 
the default.

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-1458:


Attachment: LUCENE-1458_rotate.patch

fwiw here is a patch to use the algorithm from the unicode std for utf8 in 
utf16 sort order.
they claim it is fast because there is no conditional branching... who knows


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Socket and file locks

2009-11-29 Thread Sanne Grinovero

Hello,

I'm glad you appreciate it; I've added the Wiki page here:
http://wiki.apache.org/lucene-java/AvailableLockFactories

I avoided on purpose to copy-paste the full javadocs of each
implementation as that would be out-of-date or too specific to some
version, I limited myself to writing some words to highlight the
differences as a quick overview of what is available.
hope you like it, I'm open to suggestions.

Regards,
Sanne

2009/11/29 Michael McCandless :
> This looks great!
>
> Maybe it makes most sense to create a wiki page
> (http://wiki.apache.org/lucene-java) for interesting LockFactory
> implementations/tradeoffs, and add this there?
>
> Mike
>
> On Sat, Nov 28, 2009 at 9:26 AM, Sanne Grinovero
>  wrote:
>> Hello,
>> Together with the Infinispan Directory we developed such a
>> LockFactory; I'd me more than happy if you wanted to add some pointers
>> to it in the Lucene documention/readme.
>> This depends on Infinispan for multiple-machines communication
>> (JGroups, indirectly) but
>> it's not required to use an Infinispan Directory, you could combine it
>> with a Directory impl of choice.
>> This was tested with the LockVerifyServer mentioned by Michael
>> McCandless and also
>> with some other tests inspired from it (in-VM for lower delay
>> coordination and verify, while the LockFactory was forced to
>> use real network communication).
>>
>> While this is a technology preview and performance regarding the
>> Directory code is still unknown, I believe the LockFactory was the
>> most tested component.
>>
>> free to download and inspect (LGPL):
>> http://anonsvn.jboss.org/repos/infinispan/trunk/lucene-directory/
>>
>> Regards,
>> Sanne
>>
>> 2009/11/27 Michael McCandless :
>>> I think a LockFactory for Lucene that implemented the ideas you &
>>> Marvin are discussing in LUCENE-1877,  and/or the approach you
>>> implemented in the H2 DB, would be a useful addition to Lucene!
>>>
>>> For many apps, the simple LockFactory impls suffice, but for apps
>>> where multiple machines can become the writer, it gets hairy.  Having
>>> an always correct Lock impl for these apps would be great.
>>>
>>> Note that Lucene has some basic tools (in oal.store) for asserting
>>> that a LockFactory is correct (see LockVerifyServer), so it's a useful
>>> way to test that things are working from Lucene's standpoint.
>>>
>>> Mike
>>>
>>> On Fri, Nov 27, 2009 at 9:23 AM, Thomas Mueller
>>>  wrote:
 Hi,

 I'm wondering if your are interested in automatically releasing the
 write lock. See also my comments on
 https://issues.apache.org/jira/browse/LUCENE-1877 - I thought it's a
 problem worth solving, because it's also in the Lucene FAQ list at
 http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_purpose_of_write.lock_file.2C_when_is_it_used.2C_and_by_which_classes.3F

 Unfortunately there seems to be no solution that 'always works', but
 delegating the task and responsibility to the application / to the
 user is problematic as well. For example, a user of the H2 database
 (that supports Lucene fulltext indexing) suggested to automatically
 remove the write.lock file whenever the file is there:
 http://code.google.com/p/h2database/issues/detail?id=141 - sounds a
 bit dangerous in my view.

 So, if you are interested to solve the problem, then maybe I can help.
 If not, then I will not bother you any longer :-)

 Regards,
 Thomas

> > > shouldn't active code like that live in the application layer?
> > Why?
> You can all but guarantee that polling will work at the app layer

 The application layer may also run with low priority. In operating
 systems, it's usually the lower layer that have more 'rights'
 (priority), and not the higher levels (I'm not saying it should be
 like that in Java). I just think the application layer should not have
 to deal with write locks or removing write locks.

> by the time the original process realizes that it doesn't hold the lock 
> anymore, the damage could already have been done.

 Yes, I'm not sure how to best avoid that (with any design). Asking the
 application layer or the user whether the lock file can be removed is
 probably more dangerous than trying the best in Lucene.

 Standby / hibernate: the question is, if the machine process is
 currently not running, does the process still hold the lock? I think
 no, because the machine might as well turned off. How to detect
 whether the machine is turned off versus in hibernate mode? I guess
 that's a problem for all mechanisms (socket / file lock / background
 thread).

 When a hibernated process wakes up again, he thinks he owns the lock.
 Even if the process checks before each write, it is unsafe:

 if (isStillLocked()) {
  write();
 }

 The process could wake up after isStillLocked() but before write().

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783499#action_12783499
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. It would not compare faster because in UTF-8 encoding, only 7 bits are used 
for encoding the chars

yeah you are right I dont think it will be faster on average (i was just posing 
the question because i dont really know NRQ), and you will waste 4 bits by 
using the first bit at the minimum.

i am just always trying to improve collation too, so that's why I am bugging 
you. I guess hopefully soon we have byte[] and can do it properly, and speed up 
both.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783496#action_12783496
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. because it compares from left to right, so even if the terms are 10x as 
long, if they differ 2x as quick its better? 
It would not compare faster because in UTF-8 encoding, only 7 bits are used for 
encoding the chars. The 8th bit is just a marker (simply spoken). If this 
marker is always 0 or always 1 does not make a difference, in UTF-8 only 7 
bits/byte are used for data. And with UTF-8 in the 3rd byte more bits are 
unused!

bq. I hear what you are saying about ASCII-only encoding, but if NRQ's model is 
always best, why do we have two separate "encode byte[] into char[]" models in 
lucene, one that NRQ is using, and one that collation is using!?

I do not know who made this IndexableBinaryStrings encoding, but it would not 
work for NRQ at all with current trunk (too complicated during indexing and 
decoding, because for NRQ, we also need to decode such char[] very fast for 
populating the FieldCache). But as discussed with Yonik (do not know the 
issue), the ASCII only encoding should always perform better (but needs more 
memory in trunk, as char[] is used during indexing -- I think because of that 
it was added). So the difference is not speed, its memory consumption.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) t

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783494#action_12783494
 ] 

Michael McCandless commented on LUCENE-1458:


bq. The idea is to create an additional Attribute: BinaryTermAttribute that 
holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, 
the indexer would choose to write the bytes directly to the index. 
NumericTokenStream could use this attribute and encode the numbers directly to 
byte[] with 8 bits/byte. - the new AttributeSource API was created just because 
of such customizations (not possible with Token).

This sounds like an interesting approach!  We'd have to work out some 
details... eg you presumably can't mix char[] term and byte[] term in the same 
field.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
T

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783493#action_12783493
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Why should they compare faster when encoded by IndexableBinaryStringTools?

because it compares from left to right, so even if the terms are 10x as long, 
if they differ 2x as quick its better? 

I hear what you are saying about ASCII-only encoding, but if NRQ's model is 
always best, why do we have two separate "encode byte[] into char[]" models in 
lucene, one that NRQ is using, and one that collation is using!?


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783492#action_12783492
 ] 

Michael McCandless commented on LUCENE-1458:


bq. I changed the logic in the TermEnum in trunk and 3.0 (it no longer works 
recursive, see LUCENE-2087). We should change this here, too.

Mark has been periodically re-syncing changes down from trunk... we should 
probably just let this change come in through his process (else I think we 
cause more conflicts).

bq. The legacy NumericRangeTermEnum can be removed completely and the protected 
getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call 
this method (maybe only classes in same package, but thats not supported). So 
the enum with the nocommit mark can be removed

Ahh excellent.  Wanna commit that when you get a chance?

bq.  Ideally NRQ would simply not use string terms at all and work directly on 
the byte[], which should then be ordered in binary order.

That'd be great!

bq. With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

Right, this is a change in analysis -> DocumentsWriter -- somehow we have to 
allow a Token to carry a byte[] and that is directly indexes as the opaque 
term.  At search time NRQ is all byte[] already (unlike other queries, which 
are new String()'ing for every term on the enum).

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783491#action_12783491
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. Uwe you are right that the terms would be larger but they would have a more 
distinct alphabet (byte range) and might compare faster... I don't know which 
one is most important to NRQ really. 

The new TermsEnum directly compares the byte[] arrays. Why should they compare 
faster when encoded by IndexableBinaryStringTools? Less bytes are faster to 
compare (it's one CPU instruction if optimized a very native x86/x64 loop). It 
may be faster if we need to decode to char[] but thats not the case (in flex 
branch).

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to ad

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783489#action_12783489
 ] 

Robert Muir commented on LUCENE-1458:
-

Uwe you are right that the terms would be larger but they would have a more 
distinct alphabet (byte range) and might compare faster... I don't know which 
one is most important to NRQ really.

yeah I agree that encoding directly to byte[] is the way to go though, this 
would be nice for collation too...

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783490#action_12783490
 ] 

Uwe Schindler commented on LUCENE-1458:
---

As the codec is per field, we could also add an Attribute to TokenStream that 
holds the codec (the default is Standard). The indexer just uses the codec for 
the field from the TokenStream. NTS would use a NumericCodec (just thinking...) 
- will go sleeping now.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783488#action_12783488
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 11/29/09 10:16 PM:
--

bq. A partial solution for you which does work with tokenstreams, you could use 
indexablebinarystring which won't change between any unicode sort order... (it 
will not encode in any unicode range where there is a difference between the 
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you 
still would not have the "full 8 bits per byte"

This would not change anything, only would make the format incompatible. With 
7bits/char the currently UTF-8 coded index is the smallest possible one (even 
IndexableBinaryString would cost more bytes in the index, because if you would 
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of 
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String 
representation would take less space than currently. See the discussion with 
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much 
faster).

For the TokenStreams: The idea is to create an additional Attribute: 
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute 
instead of TermAttribute, the indexer would choose to write the bytes directly 
to the index. NumericTokenStream could use this attribute and encode the 
numbers directly to byte[] with 8 bits/byte. -- the new AttributeSource API was 
created just because of such customizations (not possible with Token).

  was (Author: thetaphi):
bq. A partial solution for you which does work with tokenstreams, you could 
use indexablebinarystring which won't change between any unicode sort order... 
(it will not encode in any unicode range where there is a difference between 
the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but 
you still would not have the "full 8 bits per byte"

This would not change anything, only would make the format incompatible. With 
7bits/char the currently UTF-8 coded index is the smallest possible one (even 
IndexableBinaryString would cost more bytes in the index, because if you would 
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of 
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String 
representation would take less space than currently. See the discussion with 
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much 
faster).

For the TokenStreams: The idea is to create an additional Attribute: 
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute 
instead of TermAttribute, the indexer would choose to write the bytes directly 
to the index. NumericTokenStream could use this attribute and encode the 
numbers directly to byte[] with 8 bits/byte.
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files,

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783488#action_12783488
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. A partial solution for you which does work with tokenstreams, you could use 
indexablebinarystring which won't change between any unicode sort order... (it 
will not encode in any unicode range where there is a difference between the 
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you 
still would not have the "full 8 bits per byte"

This would not change anything, only would make the format incompatible. With 
7bits/char the currently UTF-8 coded index is the smallest possible one (even 
IndexableBinaryString would cost more bytes in the index, because if you would 
use 14 of the 16 bits/char, most chars would take 3 bytes in index because of 
UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String 
representation would take less space than currently. See the discussion with 
Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much 
faster).

For the TokenStreams: The idea is to create an additional Attribute: 
BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute 
instead of TermAttribute, the indexer would choose to write the bytes directly 
to the index. NumericTokenStream could use this attribute and encode the 
numbers directly to byte[] with 8 bits/byte.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-c

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783485#action_12783485
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

Uwe, it looks like you can do this now (with the exception of tokenstreams). 

A partial solution for you which does work with tokenstreams, you could use 
indexablebinarystring which won't change between any unicode sort order... (it 
will not encode in any unicode range where there is a difference between the 
UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you 
still would not have the "full 8 bits per byte"


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783482#action_12783482
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Robert: I know, because of that I said it works with UTF-8/UTF-16 comparator. 
It would *not* work with a reverse comparator as Mike uses in the test.

With directly on bytes[] I meant that it could not use chars at all and 
directly encode the numbers into byte[] with the full 8 bits per byte. The 
resulting byte[] would be never UTF-8, but if the new TermRef API would be able 
to handle this and also the TokenStreams, it would be fine. Only the terms 
format would change.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783481#action_12783481
 ] 

Robert Muir edited comment on LUCENE-1458 at 11/29/09 9:33 PM:
---

bq. Ideally NRQ would simply not use string terms at all and work directly on 
the byte[], which should then be ordered in binary order.

but isn't this what it does already with the TermsEnum api? the TermRef itself 
is just byte[], and NRQ precomputes all the TermRef's it needs up front, there 
is no unicode conversion there.

edit: btw Uwe, and the comparator is be essentially just comparing bytes, the 
0xee/0xef "shifting" should never take place with NRQ because those bytes will 
never be in a numeric field...


  was (Author: rcmuir):
bq. Ideally NRQ would simply not use string terms at all and work directly 
on the byte[], which should then be ordered in binary order.

but isn't this what it does already with the TermsEnum api? the TermRef itself 
is just byte[], and NRQ precomputes all the TermRef's it needs up front, there 
is no unicode conversion there.


  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the b

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783481#action_12783481
 ] 

Robert Muir commented on LUCENE-1458:
-

bq. Ideally NRQ would simply not use string terms at all and work directly on 
the byte[], which should then be ordered in binary order.

but isn't this what it does already with the TermsEnum api? the TermRef itself 
is just byte[], and NRQ precomputes all the TermRef's it needs up front, there 
is no unicode conversion there.



> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783475#action_12783475
 ] 

Uwe Schindler commented on LUCENE-1458:
---

Hi Mike,

I looked into your commit, looks good. You are right with your comment in NRQ, 
it will only work with UTF-8 or UTF-16. Ideally NRQ would simply not use string 
terms at all and work directly on the byte[], which should then be ordered in 
binary order.

Two things:
- The legacy NumericRangeTermEnum can be removed completely and the protected 
getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call 
this method (maybe only classes in same package, but thats not supported). So 
the enum with the nocommit mark can be removed
- I changed the logic in the TermEnum in trunk and 3.0 (it no longer works 
recursive, see LUCENE-2087). We  should change this here, too. This makes also 
the enum simplier (and it looks more like the Automaton one). The methods in 
trunk 3.0 setEnum() and endEnum() both throw now UOE.

I will look into these two changes tomorrow and change the code.

Uwe

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783471#action_12783471
 ] 

Michael McCandless commented on LUCENE-1458:


OK I finally worked out a solution for the UTF16 sort order problem
(just committed).

I added a TermRef.Comparator class, for comparing TermRefs, and I
removed TermRef.compareTo, and fixed all low-level places in Lucene
that rely on sort order of terms to use this new API instead.

I changed the Terms/TermsEnum/TermsConsumer API, adding a
getTermComparator(), ie, the codec now determines the sort order for
terms in each field.  For the core codecs (standard, pulsing,
intblock) I default to UTF16 sort order, for back compat, but you
could easily instantiate it yourself and use a different term sort.

I changed TestExternalCodecs to test this new capability, by sorting 2
of its fields in reversed unicode code point order.

While this means your codec is now completely free to define the
term sort order per field, in general Lucene queries will not behave
right if you do this, so it's obviously a very advanced use case.

I also changed (yet again!) how DocumentsWriter encodes the terms
bytes, to record the length (in bytes) of the term, up front, followed by the
term bytes (vs the trailing 0xff that I had switched to).  The length
is a 1 or 2 byte vInt, ie if it's < 128 it's 1 byte, else 2 bytes.
This approach means the TermRef.Collector doesn't have to deal with
0xff's (which was messy).

I think this also means that, to the flex API, a term is actually
opaque -- it's just a series of bytes.  It need not be UTF8 bytes.
However, all of analysis, and then how TermsHash builds up these
byte[]s, and what queries do with these bytes, is clearly still very
much Unicode/UTF8.  But one could, in theory (I haven't tested this!)
separately use the flex API to build up a segment whose terms are
arbitrary byte[]'s, eg maybe you want to use 4 bytes to encode int
values, and then interact with those terms at search time
using the flex API.


> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-11-29 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783442#action_12783442
 ] 

Erick Erickson commented on LUCENE-2037:


Darn it! I'll get the comments right sometime and not have to retype them after 
making an attachment

Anyway, this patch allows us to use Junit4 constructs as well as Junit3 
constructs. It includes a sibling class to LuceneTestCase called 
LuceneTestCaseJ4 that provides the functionality we used to get from 
LuceneTestCase.

When creating Junit4-style tests, preferentially import from org.junit rather 
than from junit.framework.

Junit-3.8.2.jar may (should?) be removed from the distro, all tests run just 
fine under Junit-4.7,jar, which is attached to this issue. I wrote a little 
script that compares the results of running the tests and we run exactly the 
same number of TestSuites and each runs exactly the same number of tests, so 
I'm pretty confident about this one. I may be wrong, but I'm not uncertain. 
Single data-points aren't worth much, but on my Macbook Pro, running under 
Junit4 took about a minute longer than Junit3 (about 23 1/2 minutes). Which 
could have been the result of my Time Machine running for all I know

All the tests in test...search.function have been converted to use 
LuceneTestCaseJ4 as an exemplar. I've deprecated LuceneTestCase to prompt 
people. When you derive from LuceneTestCaseJ4, you *must* use the @Before, 
@After and @Test annotations to get the functionality you expect, as must *all* 
subclasses. So one gotcha people will surely run across is deriving form J4 and 
failing to put @Test 

Converting all the tests was my way of working through the derivation issues. I 
don't particularly see the value in doing a massive conversion just for the 
heck of it. Unless someone has a real urge. More along the lines of "I'm in 
this test anyway, lets upgrade it and add new ones".

What about new tests? Should we encourage new patches to use Junit4 rather than 
Junit3? If so, how?

I've noticed the convention of putting underscores in front of some tests to 
keep them from running. The Junit4 convention is the @Ignore annotation, which 
will cause the @Ignored tests to be reported (something like 1300 successful, 0 
failures, 23 ignored), which is a nice way to keep these from getting lost in 
the shuffle.

When this gets applied, I can put up the patch for LocalizedTestCase and we can 
give that a whirl


> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-11-29 Thread Erick Erickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-2037:
---

Attachment: LUCENE-2037.patch

See JIRA comments

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2096) Investigate parallelizing Ant junit tests

2009-11-29 Thread Erick Erickson (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783436#action_12783436
 ] 

Erick Erickson commented on LUCENE-2096:


Parallelizing tests is proving trickier than I'd hoped. Part of the problem is 
my not-wonderful ant skills...

But what I've found so far with trying to use ForEach is that stuff gets in the 
way. In particular, the  tag in the test-macro body I'm pretty sure 
defeats any parallelizing attempts by ForEach. Taking it out isn't 
straightforward.

In some of my experiments, I got tests to fire off in parallel, but then 
started running into wonky errors that were so strange now I can't remember 
them, but some having to do with what looked like file contention for some 
temporary test files.

Googling around I think I remember posts by Jason Ruthgren trying to so 
something similar in SOLR (?). Jason: if I'm remembering right did you find any 
joy?

Then we'd have to rework how success and failure are handled because there's 
contention for that file as well.

Now I'm wondering if the "scary python script" gets us more bang for the buck. 
I wrote a Groovy script the probably is a near-cousin for experiments and I'm 
wondering what would happen if we wrote a special testcase-type target that did 
NOT depend upon compile-test or, really, much of anything else and counted on 
the user to make sure to build the system first before using whatever script 
wecame up with. We don't really lose functionality by recursively looking for 
Test*.java files because that's what's done internally in the build files 
anyway. So doing that outside or inside the ant files doesn't seem like a loss.

I'm putting this in the JIRA issue to preserve it for posterity. Meanwhile, 
I'll appeal to Ant gurus if they want to try whacking the Ant build files, and 
see what the script notion brings...



> Investigate parallelizing Ant junit tests
> -
>
> Key: LUCENE-2096
> URL: https://issues.apache.org/jira/browse/LUCENE-2096
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
>
> Ant Contrib has a "ForEach" construct that may speed up running all of the 
> Junit tests by parallelizing them with a configurable number of threads. I 
> envision this in several stages. First, see if ForEach works for us with 
> hard-coded lists, distribute this for testing then make the changes "for 
> real". I intend to hard-code the list for the first pass, ordered by the time 
> they take. This won't do for check-in, but will give us a fast 
> proof-of-concept.
> This approach will be most useful for multi-core machines.
> In particular, we need to see whether the parallel tasks are isolated enough 
> from each other to prevent mutual interference.
> All this assumes the fragmentary reference I found is still available...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783410#action_12783410
 ] 

Robert Muir commented on LUCENE-2094:
-

bq. This is one thing I thought about too - I did not change it to keep the 
noise as low as possible in the patch but if we want to do it we can do in this 
patch too.

well I think it will be noisy either way (updating all the analyzers, etc), but 
will make things a lot more consistent and easier to maintain...
if you do this then StopFilter takes version so it can be modified / bugfixed 
in the future in other ways too, with less noise.
I also think it will make it easier to write an analyzer.

because even completely ignoring the unicode issue, with the current codebase:

{code}
streams.source = new StandardTokenizer(matchVersion, reader);
streams.result = new StandardFilter(streams.source);
streams.result = new LowerCaseFilter(matchVersion, streams.result);
streams.result = new StopFilter(matchVersion, streams.result, stoptable);
...
{code}

reads a lot easier to me than
{code}
streams.source = new StandardTokenizer(matchVersion, reader);
streams.result = new StandardFilter(streams.source);
streams.result = new LowerCaseFilter(matchVersion, streams.result);
streams.result = new 
StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
  streams.result, stoptable);
...
{code}


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783409#action_12783409
 ] 

Uwe Schindler commented on LUCENE-2094:
---

+1 for pushing version downto StopFilter (it is there already, but hidden in 
this getDefault() method! Its presence was justified by Lucene 2.9/3.0 
migration. Now it should just take a matchVersion and no more setters inside 
StopFilter.

The noise is the same, as all analyzers using stopfilter then need the version 
arg / need to be changed anyhow.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783406#action_12783406
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. I guess i think this getEnablePositionIncrementsVersionDefault should be 
deprecated along with the ctors that take this boolean argument, and it should 
all be driven off a single Version argument for simplicity

This is one thing I thought about too - I did not change it to keep the noise 
as low as possible in the patch but if we want to do it we can do in this patch 
too. 

The question if we want to drop bw. compat and simply update CharArraySet to 
Unicode 4.0 seems to be more important. But IMO if we push Version to 
StopFilter we can also make CharArraySet using Version though.

thoughts?

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783402#action_12783402
 ] 

Michael McCandless commented on LUCENE-2094:


bq. I guess i think this getEnablePositionIncrementsVersionDefault should be 
deprecated along with the ctors that take this boolean argument, and it should 
all be driven off a single Version argument for simplicity

OK, I agree, let's also push Version down into StopFilter (to get posIncr 
setting).

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783399#action_12783399
 ] 

Robert Muir commented on LUCENE-2094:
-

Uwe, yeah, that is what I was thinking. 
I guess I think an alternate ctor that allows explicit control of this with a 
boolean is ok,
but I think if you want the "defaults" it should just be with Version.

This really doesn't have a lot to do with Simon's patch but it becomes 
noticeable now.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396
 ] 

Uwe Schindler edited comment on LUCENE-2094 at 11/29/09 12:56 PM:
--

Mike didn't want to add matchVersion to StopFilter at this time, but when we 
change this, we should remove this static method or deprecate it and not use it 
anymore in the code. Instead use only matchVersion everywhere and eliminate the 
enablePosIncr setting at all.

  was (Author: thetaphi):
Mike didn't wanted to add matchVersion to StopFilter at this time, but when 
we change this, we should remove this static method or deprecate it and not use 
it anymore in the code. Instead use the matchVersion everywhere.
  
> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783396#action_12783396
 ] 

Uwe Schindler commented on LUCENE-2094:
---

Mike didn't wanted to add matchVersion to StopFilter at this time, but when we 
change this, we should remove this static method or deprecate it and not use it 
anymore in the code. Instead use the matchVersion everywhere.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783395#action_12783395
 ] 

Robert Muir commented on LUCENE-2094:
-

Hi Simon,

One thing I noticed is with this patch we get:
{code}
public StopFilter(Version matchVersion, boolean enablePositionIncrements, 
TokenStream input, Set stopWords, boolean ignoreCase)
{code}

I know this is really not related to what you are doing here, but I wonder if 
instead StopFilter should look like this:
{code}
public StopFilter(Version matchVersion, TokenStream input, Set stopWords, 
boolean ignoreCase)
{code}

and use matchVersion to determine enablePositionIncrements. 

I think its already wierd how to create a stopfilter, you have to pass version 
to a static method getEnablePositionIncrementsVersionDefault. I don't think the 
user should have to pass Version twice:
{code}
new StopFilter(Version.WHATEVER, 
StopFilter.getEnablePositionIncrementsVersionDefault(Version.WHATEVER), ...)
{code}

I guess i think this getEnablePositionIncrementsVersionDefault should be 
deprecated along with the ctors that take this boolean argument, and it should 
all be driven off a single Version argument for simplicity


> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783394#action_12783394
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. If the LowerCaseFilter is applied before the stopwords, there is no need 
for doing irgnore-case-checking.

no doubt! :) But if you do not want your terms to be lowercased but you do not 
care if "The" is at has an uppercase "T" you want this behaviour. Yet, either 
way we go we need the version somehow to preserve bw. compat. 

We should rather think about breaking bw. compat for this particular language 
(deseret) but we have no idea what happens with unicode in the future. Its 
tough.



> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783393#action_12783393
 ] 

Uwe Schindler commented on LUCENE-2094:
---

bq. Either way, if the set is lowercased or not the lowercaseing is also 
applied to the values checked against the set.

If the LowerCaseFilter is applied before the stopwords, there is no need for 
doing irgnore-case-checking.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2062) Bulgarian Analyzer

2009-11-29 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-2062:
---

Assignee: Robert Muir

> Bulgarian Analyzer
> --
>
> Key: LUCENE-2062
> URL: https://issues.apache.org/jira/browse/LUCENE-2062
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2062.patch
>
>
> someone asked about bulgarian analysis on solr-user today... 
> http://www.lucidimagination.com/search/document/e1e7a5636edb1db2/non_english_languages
> I was surprised we did not have anything.
> This analyzer implements the algorithm specified here, 
> http://members.unine.ch/jacques.savoy/Papers/BUIR.pdf
> In the measurements there, this improves MAP approx 34%

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0

2009-11-29 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783392#action_12783392
 ] 

Simon Willnauer commented on LUCENE-2094:
-

bq. Why do you use Version.LUCENE_CURRENT for all predefined stop word sets 
(ok, they do not need a match version, because they are already lowercased). 

1. the do not ignore case at all so the version will not affect those sets.
2. they are private and we have the full control over the sets. The are all 
lowercased (as you figured correctly) and none of them contains any 
supplementary character.
3. The are static and private so passing any usersupplied version is not 
feasible.

bq. In my opinion the whole stuff is only needed for chararrayssets, which are 
not already lowercased. So is there any chararrayset in lucene with predefined 
stop-words, that is not lowercased)?
Either way, if the set is lowercased or not the lowercaseing is also applied to 
the values checked against the set.

> Prepare CharArraySet for Unicode 4.0
> 
>
> Key: LUCENE-2094
> URL: https://issues.apache.org/jira/browse/LUCENE-2094
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 
> 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>Reporter: Simon Willnauer
> Fix For: 3.1
>
> Attachments: LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, 
> LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This 
> causes that  String / char[] with uncode 4 chars which are in the set can not 
> be retrieved in "ignorecase" mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2067) Czech Stemmer

2009-11-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783390#action_12783390
 ] 

Robert Muir commented on LUCENE-2067:
-

bq. well at least I figured out there must be something wrong 

i appreciate the review... it is frustrating that you have to pay $ to view the 
paper right now.
on the other hand we are lucky when researchers that are this open about their 
experiments... saves a lot of work.

> Czech Stemmer
> -
>
> Key: LUCENE-2067
> URL: https://issues.apache.org/jira/browse/LUCENE-2067
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, 
> LUCENE-2067.patch
>
>
> Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech 
> stemmer in snowball.
> This patch implements the light stemming algorithm described in: 
> http://portal.acm.org/citation.cfm?id=1598600
> In their measurements, it improves MAP 42%
> The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back 
> compat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2067) Czech Stemmer

2009-11-29 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2067.
-

Resolution: Fixed

Committed revision 885216.

> Czech Stemmer
> -
>
> Key: LUCENE-2067
> URL: https://issues.apache.org/jira/browse/LUCENE-2067
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, 
> LUCENE-2067.patch
>
>
> Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech 
> stemmer in snowball.
> This patch implements the light stemming algorithm described in: 
> http://portal.acm.org/citation.cfm?id=1598600
> In their measurements, it improves MAP 42%
> The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back 
> compat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2067) Czech Stemmer

2009-11-29 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783389#action_12783389
 ] 

Simon Willnauer commented on LUCENE-2067:
-

bq. make the stem filter final, and add explicit test for the mobile e rewrite 
looks good to me! Go ahead and commit.

bq. Sorry for the confusion (pointing you at a slightly different algorithm)... 
well at least I figured out there must be something wrong :)



> Czech Stemmer
> -
>
> Key: LUCENE-2067
> URL: https://issues.apache.org/jira/browse/LUCENE-2067
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2067.patch, LUCENE-2067.patch, LUCENE-2067.patch, 
> LUCENE-2067.patch
>
>
> Currently, the CzechAnalyzer is merely stopwords, and there isn't a czech 
> stemmer in snowball.
> This patch implements the light stemming algorithm described in: 
> http://portal.acm.org/citation.cfm?id=1598600
> In their measurements, it improves MAP 42%
> The analyzer does not use this stemmer if LUCENE_VERSION <= 3.0, for back 
> compat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1844) Speed up junit tests

2009-11-29 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1844.


   Resolution: Fixed
Fix Version/s: 3.1

Thanks Erick & Mark!  Next step is to find some generic way to parallelize the 
tests...

> Speed up junit tests
> 
>
> Key: LUCENE-1844
> URL: https://issues.apache.org/jira/browse/LUCENE-1844
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Mark Miller
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: FastCnstScoreQTest.patch, hi_junit_test_runtimes.png, 
> LUCENE-1844-Junit3.patch, LUCENE-1844.patch, LUCENE-1844.patch, 
> LUCENE-1844.patch
>
>
> As Lucene grows, so does the number of JUnit tests. This is obviously a good 
> thing, but it comes with longer and longer test times. Now that we also run 
> back compat tests in a standard test run, this problem is essentially doubled.
> There are some ways this may get better, including running parallel tests. 
> You will need the hardware to fully take advantage, but it should be a nice 
> gain. There is already an issue for this, and Junit 4.6, 4.7 have the 
> beginnings of something we might be able to count on soon. 4.6 was buggy, and 
> 4.7 still doesn't come with nice ant integration. Parallel tests will come 
> though.
> Beyond parallel testing, I think we also need to concentrate on keeping our 
> tests lean. We don't want to sacrifice coverage or quality, but I'm sure 
> there is plenty of fat to skim.
> I've started making a list of some of the longer tests - I think with some 
> work we can make our tests much faster - and then with parallelization, I 
> think we could see some really great gains.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Socket and file locks

2009-11-29 Thread Michael McCandless

This looks great!

Maybe it makes most sense to create a wiki page
(http://wiki.apache.org/lucene-java) for interesting LockFactory
implementations/tradeoffs, and add this there?

Mike

On Sat, Nov 28, 2009 at 9:26 AM, Sanne Grinovero
 wrote:
> Hello,
> Together with the Infinispan Directory we developed such a
> LockFactory; I'd me more than happy if you wanted to add some pointers
> to it in the Lucene documention/readme.
> This depends on Infinispan for multiple-machines communication
> (JGroups, indirectly) but
> it's not required to use an Infinispan Directory, you could combine it
> with a Directory impl of choice.
> This was tested with the LockVerifyServer mentioned by Michael
> McCandless and also
> with some other tests inspired from it (in-VM for lower delay
> coordination and verify, while the LockFactory was forced to
> use real network communication).
>
> While this is a technology preview and performance regarding the
> Directory code is still unknown, I believe the LockFactory was the
> most tested component.
>
> free to download and inspect (LGPL):
> http://anonsvn.jboss.org/repos/infinispan/trunk/lucene-directory/
>
> Regards,
> Sanne
>
> 2009/11/27 Michael McCandless :
>> I think a LockFactory for Lucene that implemented the ideas you &
>> Marvin are discussing in LUCENE-1877,  and/or the approach you
>> implemented in the H2 DB, would be a useful addition to Lucene!
>>
>> For many apps, the simple LockFactory impls suffice, but for apps
>> where multiple machines can become the writer, it gets hairy.  Having
>> an always correct Lock impl for these apps would be great.
>>
>> Note that Lucene has some basic tools (in oal.store) for asserting
>> that a LockFactory is correct (see LockVerifyServer), so it's a useful
>> way to test that things are working from Lucene's standpoint.
>>
>> Mike
>>
>> On Fri, Nov 27, 2009 at 9:23 AM, Thomas Mueller
>>  wrote:
>>> Hi,
>>>
>>> I'm wondering if your are interested in automatically releasing the
>>> write lock. See also my comments on
>>> https://issues.apache.org/jira/browse/LUCENE-1877 - I thought it's a
>>> problem worth solving, because it's also in the Lucene FAQ list at
>>> http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_purpose_of_write.lock_file.2C_when_is_it_used.2C_and_by_which_classes.3F
>>>
>>> Unfortunately there seems to be no solution that 'always works', but
>>> delegating the task and responsibility to the application / to the
>>> user is problematic as well. For example, a user of the H2 database
>>> (that supports Lucene fulltext indexing) suggested to automatically
>>> remove the write.lock file whenever the file is there:
>>> http://code.google.com/p/h2database/issues/detail?id=141 - sounds a
>>> bit dangerous in my view.
>>>
>>> So, if you are interested to solve the problem, then maybe I can help.
>>> If not, then I will not bother you any longer :-)
>>>
>>> Regards,
>>> Thomas
>>>
>>>
>>>
 > > shouldn't active code like that live in the application layer?
 > Why?
 You can all but guarantee that polling will work at the app layer
>>>
>>> The application layer may also run with low priority. In operating
>>> systems, it's usually the lower layer that have more 'rights'
>>> (priority), and not the higher levels (I'm not saying it should be
>>> like that in Java). I just think the application layer should not have
>>> to deal with write locks or removing write locks.
>>>
 by the time the original process realizes that it doesn't hold the lock 
 anymore, the damage could already have been done.
>>>
>>> Yes, I'm not sure how to best avoid that (with any design). Asking the
>>> application layer or the user whether the lock file can be removed is
>>> probably more dangerous than trying the best in Lucene.
>>>
>>> Standby / hibernate: the question is, if the machine process is
>>> currently not running, does the process still hold the lock? I think
>>> no, because the machine might as well turned off. How to detect
>>> whether the machine is turned off versus in hibernate mode? I guess
>>> that's a problem for all mechanisms (socket / file lock / background
>>> thread).
>>>
>>> When a hibernated process wakes up again, he thinks he owns the lock.
>>> Even if the process checks before each write, it is unsafe:
>>>
>>> if (isStillLocked()) {
>>>  write();
>>> }
>>>
>>> The process could wake up after isStillLocked() but before write().
>>> One protection is: The second process (the one that breaks the lock)
>>> would need to work on a copy of the data instead of the original file
>>> (it could delete / truncate the orginal file after creating a copy).
>>> On Windows, renaming the file might work (not sure); on Linux you
>>> probably need to copy the content to a new file. Like that, the awoken
>>> process can only destroy inactive data.
>>>
>>> The question is: do we need to solve this problem? How big is the
>>> risk? Instead of solving this problem completely, you could detect it
>>> after the fact witho

[jira] Updated: (LUCENE-2097) In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space

2009-11-29 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2097:
---

Attachment: LUCENE-2097.patch

Attached patch with test case that shows the issue.  Not yet sure what's the 
best way to fix it... probably we have to build the CFS before opening the 
reader we want to pool.

> In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space
> 
>
> Key: LUCENE-2097
> URL: https://issues.apache.org/jira/browse/LUCENE-2097
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2097.patch
>
>
> Spinoff of java-user thread titled "searching while optimize"...
> If IndexWriter is in NRT mode (you've called getReader() at least
> once), and CFS is enabled, then internally the writer pools readers.
> However, after a merge completes, it opens the reader against het
> non-CFS segment files, and pools that.  It then builds the CFS file,
> as well, thus tying up the storage for that segment twice.
> Functionally the bug is harmless (it's only a disk space issue).
> Also, when the segment is merged, the disk space is released again
> (though the newly merged segment will also be double-tied-up).
> Simple workaround is to use non-CFS mode, or, don't use getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2097) In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space

2009-11-29 Thread Michael McCandless (JIRA)

In NRT mode, and CFS enabled, IndexWriter incorrectly ties up disk space


 Key: LUCENE-2097
 URL: https://issues.apache.org/jira/browse/LUCENE-2097
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 3.0, 2.9.1, 2.9
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 3.1


Spinoff of java-user thread titled "searching while optimize"...

If IndexWriter is in NRT mode (you've called getReader() at least
once), and CFS is enabled, then internally the writer pools readers.
However, after a merge completes, it opens the reader against het
non-CFS segment files, and pools that.  It then builds the CFS file,
as well, thus tying up the storage for that segment twice.

Functionally the bug is harmless (it's only a disk space issue).
Also, when the segment is merged, the disk space is released again
(though the newly merged segment will also be double-tied-up).

Simple workaround is to use non-CFS mode, or, don't use getReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2061) Create benchmark & approach for testing Lucene's near real-time performance

2009-11-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783376#action_12783376
 ] 

Michael McCandless commented on LUCENE-2061:


bq. Can you post the queries file you've used?

I only used TermQuery "1", sorting by score.  I'd generally like to focus on 
worst case query latency rather than QPS of "easy" queries.  Maybe we should 
switch to harder queries (phrase, boolean).

Though one thing I haven't yet focused on testing (which your work on 
LUCENE-1785 would improve) is queries that hit the FieldCache -- we should test 
that as well.

{quote}
I haven't seen the same results in regards to the OS managing
small files, and I suspect that users in general will choose a
variety of parameters (i.e. 1 max buffered doc) that makes
writing to disk inherently slow. Logically the OS should work as
a write cache, however in practice, it seems a variety of users
have reported otherwise. Maybe 100 docs works, however that
feels like a fairly narrow guideline for user's of NRT.
{quote}

Yeah we need to explore this (when OS doesn't do effective write-caching), in 
practice.

{quote}
The latest LUCENE-1313 is a step in a direction that doesn't
change IW internals too much.
{quote}
I do like this simplification -- basically IW is internally managing how best 
to use RAM in NRT mode -- but I think we need to scrutinize (through 
benchmarking, here) whether this is really needed (ie, whether we can't simply 
rely on the OS to behave, with its IO cache).

> Create benchmark & approach for testing Lucene's near real-time performance
> ---
>
> Key: LUCENE-2061
> URL: https://issues.apache.org/jira/browse/LUCENE-2061
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-2061.patch, LUCENE-2061.patch, LUCENE-2061.patch
>
>
> With the improvements to contrib/benchmark in LUCENE-2050, it's now
> possible to create compelling algs to test indexing & searching
> throughput against a periodically reopened near-real-time reader from
> the IndexWriter.
> Coming out of the discussions in LUCENE-1526, I think to properly
> characterize NRT, we should measure net search throughput as a
> function of both reopen rate (ie how often you get a new NRT reader
> from the writer) and indexing rate.  We should also separately measure
> pure adds vs updates (deletes + adds); the latter is much more work
> for Lucene.
> This can help apps make capacity decisions... and can help us test
> performance of pending improvements for NRT (eg LUCENE-1313,
> LUCENE-2047).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

48 matches

Mail list logo