from:"Earwin Burrfoot"

[
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722996#action_12722996
]

Earwin Burrfoot commented on LUCENE-1712:
-

Having half of your methods constantly fail with an exception depending on
constructor parameter. That just screams - Split me into two classes!

Set default precisionStep for NumericField and NumericRangeFilter
-

Key: LUCENE-1712
URL: https://issues.apache.org/jira/browse/LUCENE-1712
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
Fix For: 2.9

This is a spinoff from LUCENE-1701.
A user using Numeric* should not need to understand what's
under the hood in order to do their indexing searching.
They should be able to simply:
{code}
doc.add(new NumericField(price, 15.50);
{code}
And have a decent default precisionStep selected for them.
Actually, if we add ctors to NumericField for each of the supported
types (so the above code works), we can set the default per-type. I
think we should do that?
4 for int and 6 for long was proposed as good defaults.
The default need not be perfect, as advanced users can always
optimize their precisionStep, and for users experiencing slow
RangeQuery performance, NumericRangeQuery with any of the defaults we
are discussing will be much faster.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary


[ 
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723224#action_12723224
 ] 

Earwin Burrfoot commented on LUCENE-1715:
-

I object nulling references in attempt to speed up GC. It's totally useless on 
any decent JVM implementation and if someone uses indecent JVM, I doubt he's 
concerned with his app efficiency.

 DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
 -

 Key: LUCENE-1715
 URL: https://issues.apache.org/jira/browse/LUCENE-1715
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
 Fix For: 2.9


 DirectoryIndexReader has a finalize method, which causes the JDK to keep a 
 reference to the object until it can be finalized.  SegmentReader and 
 MultiSegmentReader are subclasses that contain references to, potentially, 
 hundreds of megabytes of cached data in a TermInfosReader.
 Some options would be removing finalize() from DirectoryIndexReader (it 
 releases a write lock at the moment) or possibly nulling out references in 
 various close() and doClose() methods throughout the class hierarchy so that 
 the finalizable object doesn't references the Term arrays.
 Original mailing list message:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary


[ 
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723225#action_12723225
 ] 

Earwin Burrfoot commented on LUCENE-1715:
-

And support removing finalizers everywhere if their only point is to guard 
against forgotten close().

 DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
 -

 Key: LUCENE-1715
 URL: https://issues.apache.org/jira/browse/LUCENE-1715
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.4.1
 Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
 Fix For: 2.9


 DirectoryIndexReader has a finalize method, which causes the JDK to keep a 
 reference to the object until it can be finalized.  SegmentReader and 
 MultiSegmentReader are subclasses that contain references to, potentially, 
 hundreds of megabytes of cached data in a TermInfosReader.
 Some options would be removing finalize() from DirectoryIndexReader (it 
 releases a write lock at the moment) or possibly nulling out references in 
 various close() and doClose() methods throughout the class hierarchy so that 
 the finalizable object doesn't references the Term arrays.
 Original mailing list message:
 http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1715) DirectoryIndexReader finalize() holding TermInfosReader longer than necessary

[
https://issues.apache.org/jira/browse/LUCENE-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723289#action_12723289
]

Earwin Burrfoot commented on LUCENE-1715:
-

There's in fact one case where nulling harms. I'm going to try making as much
of IR as possible immutable and final. Load everything upfront on
creation/reopen (or don't load if IR is created for, say, merging). Unlike
nulling references, making frequently accessed fields final does have an impact
under adequate JVMs.

Well, nulling can be added now and removed when/if I finish my IR stuff.

DirectoryIndexReader finalize() holding TermInfosReader longer than necessary
-

Key: LUCENE-1715
URL: https://issues.apache.org/jira/browse/LUCENE-1715
Project: Lucene - Java
Issue Type: Bug
Components: Index
Affects Versions: 2.4.1
Environment: Sun JDK 6 update 12 64-bit, Debian Lenny
Reporter: Brian Groose
Assignee: Michael McCandless
Fix For: 2.9

DirectoryIndexReader has a finalize method, which causes the JDK to keep a
reference to the object until it can be finalized. SegmentReader and
MultiSegmentReader are subclasses that contain references to, potentially,
hundreds of megabytes of cached data in a TermInfosReader.
Some options would be removing finalize() from DirectoryIndexReader (it
releases a write lock at the moment) or possibly nulling out references in
various close() and doClose() methods throughout the class hierarchy so that
the finalizable object doesn't references the Term arrays.
Original mailing list message:
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200906.mbox/%3c7a5cb4a7bbce0c40b81c5145c326c31301a62...@numevp06.na.imtn.com%3e

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1607) String.intern() faster alternative


[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723352#action_12723352
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

Okay, let's have an extra class and ability to switch impls. I liked that 
static method could get inlined (at least its short-path), but that's not 
necessary.

Except I'd like the javadoc demand each impl to be String.intern()-compatible. 
There's nothing bad in it, as in any decent impl an unique string will be 
String.intern()'ed one time at most. And the case when you get an infinite flow 
of unique strings is degenerate anyway, you have to fix something, not deal 
with it. On the other hand, we can remove This should never be changed after 
other Lucene APIs have been used clause.

rewrite 'for' as 'for (Entry e = first;e != null;e = e.next)' for clarity?
'Entry[] arr = cache;' - this can be skipped? 'cache' is already final and 
optimizer loves finals. Plus further down the method you use both cache[slot] 
and arr[slot]. Or am I missing some voodoo?
If check around 'nextToLast = e' can also be removed?
'public String intern(char[] arr, int offset, int len)' - is this needed?

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
Assignee: Yonik Seeley
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations


[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723355#action_12723355
 ] 

Earwin Burrfoot commented on LUCENE-1677:
-

Mike, are we going to postpone actual deletion of these classes for 3.0?

 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1677) Remove GCJ IndexReader specializations


[ 
https://issues.apache.org/jira/browse/LUCENE-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723378#action_12723378
 ] 

Earwin Burrfoot commented on LUCENE-1677:
-

I thought we're doing everything right now as it is broken already.
And I have a half-written patch with SR cleanup after GCJ removal :)


 Remove GCJ IndexReader specializations
 --

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9


 These specializations are outdated, unsupported, most probably pointless due 
 to the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you 
 are going to ask people on java-user, anybody replied that they need it?). 
 While giving nothing, they make SegmentReader instantiation code look real 
 ugly.
 If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722769#action_12722769
 ] 

Earwin Burrfoot edited comment on LUCENE-1701 at 6/22/09 12:18 PM:
---

Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.

Though, if you want really fast dates, chosing hour/day/month/year as precision 
steps is vastly superior, plus it also clicks well with user-selected ranges. 
Still, I dumped this approach for uniformity and clarity.

  was (Author: earwin):
Using 4 for int, 6 for long. Dates-as-longs look a bit sad on 8.
  
 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722775#action_12722775
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

 Design for today.
 And spend two years deprecating and supporting today's designs after you get 
 a better thing tomorrow. Back-compat Lucene-style and agile design aren't 
 something that marries well.
 donating something to Lucene means casting it in concrete.
 We can't let fear of back-compat prevent us from making progress.
My point was that strict back-compat prevents people from donating work which 
is not yet finalized. They either lose comfortable volatility of private code, 
or have to maintain two versions of it - private and Lucene.

 NRT seems to tread the same path, and I'm not sure it's going to win that 
 much turnaround time after newly-introduced per-segment collection.
 I agree, per-segment collection was the bulk of the gains needed for
 NRT. This was a big change and a huge step forward in simple reopen
 turnaround.
I vote it for the most frustrating (in terms of adopting your custom code) and 
most useful change of 2.9 :)

 But, not having to write  read deletes to disk, not commit (fsync)
 from writer in order to see those changes in reader should also give
 us decent gains. fsync is surprisingly and intermittently costly.
I'm not sure this can't be achieved without messing with IR/W guts so much. 
Guys from LinkedIn that drive this feature (if i'm not mistaken), they had a 
prior solution with separate indexes, one on disk, one in RAM. Per-segment 
collection adds superfast reopens and MultiReader that is way greater than 
MultiSearcher - you can finally do adequate fast searches across separate 
indexes. Do we still need to add complexity for minor performance gains?

 And this integration lets us take it a step further with LUCENE-1313,
 where recently created segments can remain in RAM and be shared with
 the reader.
RAMDirectory?

 Some time ago I finished a first version of IR plugins, and enjoy pretty low 
 reopen times (field/facet/filter cache warmups included). (Yes, I'm going to 
 open an issue for plugins once they stabilize enough)
 I'm confused: I thought that effort was to make SegmentReader's
 components fully pluggable? (Not to actually change what components
 SegmentReader is creating). EG does this modularization alter the
 approach to NRT? I thought they were orthogonal.
Yes, they are orthonogal. This was yet another praise to per-segment collection 
and an example of how this approach can be extended on your custom stuff (like 
filtercache).


 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1701-test-tag-special.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, LUCENE-1701.patch, 
 LUCENE-1701.patch, NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java

Re: Shouldn't IndexWriter.commit(Map) accept Properties instead?

2009-06-22 Thread Earwin Burrfoot

 What other issues would we be taking on by using Java's serialization here...?
It's insanely slow. Though, that doesn't apply to a once-per-commit call.

The other point is, if you store Object, you can no longer mix lucene
and user data.
With MapString, whatever approach you could reserve some key space
for lucene and let user add his stuff on top.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter

[
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722843#action_12722843
]

Earwin Burrfoot commented on LUCENE-1712:
-

Am I misunderstanding something or the problem still persists?
Even if you use a common default, what is your base type - int or long? Are
floats converted to ints, or to longs?

Set default precisionStep for NumericField and NumericRangeFilter
-

Key: LUCENE-1712
URL: https://issues.apache.org/jira/browse/LUCENE-1712
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
Fix For: 2.9

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1712) Set default precisionStep for NumericField and NumericRangeFilter

[
https://issues.apache.org/jira/browse/LUCENE-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722851#action_12722851
]

Earwin Burrfoot commented on LUCENE-1712:
-

Aha! And each time you invoke setFloatValue/setDoubleValue it switches base
type behind the scenes? Eeek.

Set default precisionStep for NumericField and NumericRangeFilter
-

Key: LUCENE-1712
URL: https://issues.apache.org/jira/browse/LUCENE-1712
Project: Lucene - Java
Issue Type: Improvement
Affects Versions: 2.9
Reporter: Michael McCandless
Priority: Minor
Fix For: 2.9

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: 3MB lucene-analyzers.jar?

2009-06-21 Thread Earwin Burrfoot

 But: I do not understand the problems with this JAR file. If somebody really
 wants to have smaller files, one could use some tools, that do it
 automatically on class usage.
 I personally have a couple of usecases for that as I have to work in
 very limited environments. Imagine embedded systems or mobile phones
 where 500 kb is a lot. if you realy need the analyzer you can include
 the additional jar.
Jar Jar Links - special tools for special tasks.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721787#action_12721787
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

I vote for factories - escaping back-compat woes by exposing minimum interface.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721830#action_12721830
 ] 

Earwin Burrfoot edited comment on LUCENE-1701 at 6/19/09 8:50 AM:
--

Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use 
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)

  was (Author: earwin):
Mike, I very much agree with everything you said, except factory is less 
consumable than constructor and add stuff to index to handle NumericField.

Out of your three examples the second one is bad, no questions. But first and 
last are absolutely equal in terms of consumability.
Static factories are cool (they allow to switch implementations and 
instantiation logic without changing API) and are as easy to use (probably even 
easier with generics in Java5) as constructors.

If we add some generic storable flags for Lucene fields, this is cool 
(probably), NumericField can then capitalize on it, as well as users writing 
their own NNNFields.
Tying index format to some particular implementation of numerics is bad design. 
Why on earth can't my own split-field (vs single-field as in current Lucene) 
trie-encoded number enjoy the same benefits as NumericField from Lucene core?

bq. By this same logic, should we remove NumericRangeFilter/Query and use
static factories instead?
I do use factory methods for all my queries and filters, and it makes me feel 
warm and fuzzy! :) Under the hood some of them consult FieldInfo to instantiate 
custom-tailored query variants, so I just use range(CREATION_TIME, from, to) 
and don't think if this field is trie-encoded or raw.

Simple things should be simple, okay. Complex things should be simple too, 
argh! :)
  
 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack 
 StopFillCacheException by making it private to FieldCache (currently its 
 public because NumericUtils is in o.a.l.util).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment

[jira] Commented: (LUCENE-1701) Add NumericField and NumericSortField, make plain text numeric parsers public in FieldCache, move trie parsers to FieldCache


[ 
https://issues.apache.org/jira/browse/LUCENE-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722060#action_12722060
 ] 

Earwin Burrfoot commented on LUCENE-1701:
-

bq. Someday maybe I'll convince you to donate this schema layer on top of 
Lucene
It's not generic enough to be of use for every user of Lucene, and it doesn't 
aim to be such. It also evolves, and donating something to Lucene means casting 
it in concrete.
So that's not me being greedy or lazy (okay, maybe a little bit of the latter), 
it's simply not public-quality (as I understand it) code.
I can share the design if anybody's interested, but everyone's coping with it 
themselves it seems.

Solr has its own schema approach, and it has its merits and downfalls compared 
to mine. That's what is nice, we're able to use the same library in differing 
ways, and it doesn't force its sense of 'best practices' on us. 

bq. But I hope there are SOME named classes in there and not all static factory 
methods returning anonymous untyped impls.
SOME of them aren't static :-D

bq. We shouldn't weaken trie's integration to core just because others have 
private implementations.
You shouldn't integrate into core something that is not core functionality. 
Think microkernels.
It's strange seeing you drive CSFs, custom indexing chains, pluggability 
everywhere on one side, and trying to add some weird custom properties into 
index that are tightly interwoven with only one of possible numeric 
implementations on the other side.

bq. Design for today.
And spend two years deprecating and supporting today's designs after you get a 
better thing tomorrow. Back-compat Lucene-style and agile design aren't 
something that marries well.

bq. What's important is that we don't weaken those private implementations with 
trie's addition, and I don't think our approach here has done that.
You're weakening Lucene itself by introducing too much coupling between its 
components.

IndexReader/Writer pair is a good example of what I'm arguing against. A dusty 
closet of microfeatures that are tightly interwoven into a complex 
hard-to-maintain mess with zillions of (possibly broken) control paths - 
remember mutable deletes/norms+clone/reopen permutations? It could be avoided 
if IR/W were kept to the bare minimum (which most people are going to use), and 
more advanced features were built on top of it, not in the same place.

NRT seems to tread the same path, and I'm not sure it's going to win that much 
turnaround time after newly-introduced per-segment collection. Some time ago I 
finished a first version of IR plugins, and enjoy pretty low reopen times 
(field/facet/filter cache warmups included). (Yes, I'm going to open an issue 
for plugins once they stabilize enough)

{quote}
 If we add some generic storable flags for Lucene fields, this is cool 
 (probably), NumericField can then capitalize on it, as well as users writing 
 their own NNNFields.
+1 Wanna make a patch?
{quote}

No, I'd like to continue IR cleanup and play with positionIncrement companion 
value that could enable true multiword synonyms. 
I know, I know, it's do-a-cracy. But it's not an excuse for hacks.

 Add NumericField and NumericSortField, make plain text numeric parsers public 
 in FieldCache, move trie parsers to FieldCache
 

 Key: LUCENE-1701
 URL: https://issues.apache.org/jira/browse/LUCENE-1701
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index, Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: NumericField.java


 In discussions about LUCENE-1673, Mike  me wanted to add a new NumericField 
 to o.a.l.document specific for easy indexing. An alternative would be to add 
 a NumericUtils.newXxxField() factory, that creates a preconfigured Field 
 instance with norms and tf off, optionally a stored text (LUCENE-1699) and 
 the TokenStream already initialized. On the other hand 
 NumericUtils.newXxxSortField could be moved to NumericSortField.
 I and Yonik tend to use the factory for both, Mike tends to create the new 
 classes.
 Also the parsers for string-formatted numerics are not public in FieldCache. 
 As the new SortField API (LUCENE-1478) makes it possible to support a parser 
 in SortField instantiation, it would be good to have the static parsers in 
 FieldCache public available. SortField would init its member variable to them 
 (instead of NULL), so making code a lot easier (FieldComparator has this ugly 
 null checks when retrieving values from the cache).
 Moving the Trie parsers also as static instances into FieldCache would make 
 the code cleaner and we would be able to hide the hack

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-17 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720619#action_12720619
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I wasn't following the issue closely, so this question might by silly - how 
does out-of-order scoring/collection marry with filters?
If I remember right, filter/scorer intersection relies on proper orderness.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1630.patch, LUCENE-1630.patch, LUCENE-1630.patch


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java

Re: madvise(ptr, len, MADV_SEQUENTIAL)

2009-06-16 Thread Earwin Burrfoot

Except, you don't know the size of the file to be written upfront.
One probable solution is to map output file in pages. As a
complementary solution you can map a huge area of the file, and hope
few real memory is allocated by OS unless you actually write all over
that area.
Dunno. The idea of using mmapped write has stopped looking interesting to me.

On Tue, Jun 16, 2009 at 18:32, Uwe Schindleru...@thetaphi.de wrote:
 But to use it, we should change MMapDirectory to also use the mapping when
 writing to files. I thought about it, it is very simple to implement (just
 copy the IndexInput and change all gets() to sets())

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, June 16, 2009 4:22 PM
 To: java-dev@lucene.apache.org
 Cc: Alan Bateman; nio-disc...@openjdk.java.net
 Subject: Re: madvise(ptr, len, MADV_SEQUENTIAL)

 Lucene could really make use of this method.  When a segment merge
 takes place, we can read  write many GB of data, which without
 madvise on many OSs would effectively flush the IO cache (thus hurting
 our search performance).

 Mike

 On Mon, Jun 15, 2009 at 6:01 PM, Jason
 Rutherglenjason.rutherg...@gmail.com wrote:
  Thanks Alan.
 
  I cross posted this to the Lucene dev list where we are discussing using
  madvise for minimizing unnecessary IO cache usage when merging segments
  (where we really want the newly merged segments in the IO cache rather
 than
  the old segment files).
 
  How would the advise method work?  Would there need to be a hint in the
  FileChannel.map method?
 
  -J
 
  On Mon, Jun 15, 2009 at 12:36 AM, Alan Bateman alan.bate...@sun.com
 wrote:
 
  Jason Rutherglen wrote:
 
  Is there going to be a way to do this in the new Java IO APIs?
 
  Good question, as it has come up a few times and is needed for some
  important use-cases. A while back I looked into adding a
  MappedByteBuffer#advise method to allow the application provide hints
 on the
  expected usage but didn't complete it. We should probably look at this
 again
  for jdk7.
 
  -Alan.
 
 
 

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Proposal for changing the backwards-compatibility policy

2009-06-16 Thread Earwin Burrfoot

Oh yes! Again!
+1

One point is missing. What about incompatible behavioral changes that
do not touch API and file format?
Like posIncr=0 at the first token in stream, or analyzer fixes, or
something along these lines.

Are we free to introduce them in a minor release without warning, or
are we going to warn one release before the change, or do we provide
old-behaviour switches that are deprecated since their birth, or we
keep said switches for a couple of major releases?


On Tue, Jun 16, 2009 at 14:37, Michael Buschbusch...@gmail.com wrote:
 Probably everyone is thinking right now Oh no! Not again!. I admit I
 didn't fully read the incredibly long recent thread about
 backwards-compatibility, so maybe what I'm about to propose has been
 proposed already. In that case my apologies in advance.

 Rather than discussing our current backwards-compatibility policy
 again, I'd like to make here a concrete proposal for changing the policy
 after Lucene 3.0 is released.

 I'll call X.Y - X+1.0 a 'major release', X.Y - X.Y+1 a
 'minor release' and X.Y.Z - X.Y.Z+1 a 'bugfix release'. (we can later
 use different names; just for convenience here...)

 1. The file format backwards-compatiblity policy will remain unchanged;
   i.e. Lucene X.Y supports reading all indexes written with Lucene
   X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x
   indexes.

 2. Deprecated public and protected APIs can be removed if they have
   been released in at least one major or minor release. E.g. an 3.1
   API can be released as deprecated in 3.2 and removed in 3.3 or 4.0
   (if 4.0 comes after 3.2).

 3. No public or protected APIs are changed in a bugfix release; except
   if a severe bug can't be changed otherwise.

 4. Each release will have release notes with a new section
   Incompatible changes, which lists, as the names says, all changes that
   break backwards compatibility. The list should also have information
   about how to convert to the new API. I think the eclipse releases
   have such a release notes section.


 The big change here apparently is 2. Consider the current situation:
 We can release e.g. the new TokenStream API with 2.9; then we can
 remove it a month later in 3.0, while still complying with our current
 backwards-compatibility policy. A transition period of one month is
 very short for such an important API. On the other hand, a transition
 period of presumably 2 years, until 4.0 is released, seems very long
 to stick with a deprecated API that clutters the APIs and docs. With
 the proposed change, we couldn't do that. Given our current release
 schedule, the transition period would at least be 6-9 months, which
 seems a very reasonable timeframe.

 We should also not consider 2. as a must. I.e. we don't *have* to
 deprecate after one major or minor release already. We could for a
 very popular API like the TokenStream API send a mail to java-user,
 asking if people need more transition time and be flexible.

 I think this policy is much more dynamic and flexible, but should
 still give our users enough confidence. It also removes the need to
 do things just for the sake of the current policy rather than because
 they make the most sense, like our somewhat goofy X.9 releases. :)

 Just to make myself clear: I think we should definitely stick with our
 2.9 and 3.0 plans and change the policy afterwards.

 My +1 to all 4 points above.

 -Michael


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1673) Move TrieRange to core

2009-06-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720231#action_12720231
 ] 

Earwin Burrfoot commented on LUCENE-1673:
-

bq. This is that baking in a specific implementation into the index format that 
I don't like.
+many

bq. I do agree that retrieving a doc is already buggy, in that various things 
are lost from your index time doc (a well known issue at this point!)
How on earth is it buggy? You're working with an inverted index, you aren't 
supposed to get original document from it in the first place. It's like saying 
a hash function is buggy because it is not reversible.

The less coupling various lucene components have on each other - the better. If 
you'd like to have end-to-end experience for numeric fields, build something 
schema-like and put it in contribs. If it's hard to build - Lucene core is to 
blame, it's not extensible enough. From my experience, for that purporse it's 
okay as it is.

 Move TrieRange to core
 --

 Key: LUCENE-1673
 URL: https://issues.apache.org/jira/browse/LUCENE-1673
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.9
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: 2.9

 Attachments: LUCENE-1673.patch, LUCENE-1673.patch, LUCENE-1673.patch


 TrieRange was iterated many times and seems stable now (LUCENE-1470, 
 LUCENE-1582, LUCENE-1602). There is lots of user interest, Solr added it to 
 its default FieldTypes (SOLR-940) and if possible I want to move it to core 
 before release of 2.9.
 Before this can be done, there are some things to think about:
 # There are now classes called LongTrieRangeQuery, IntTrieRangeQuery, how 
 should they be called in core? I would suggest to leave it as it is. On the 
 other hand, if this keeps our only numeric query implementation, we could 
 call it LongRangeQuery, IntRangeQuery or NumericRangeQuery (see below, here 
 are problems). Same for the TokenStreams and Filters.
 # Maybe the pairs of classes for indexing and searching should be moved into 
 one class: NumericTokenStream, NumericRangeQuery, NumericRangeFilter. The 
 problem here: ctors must be able to pass int, long, double, float as range 
 parameters. For the end user, mixing these 4 types in one class is hard to 
 handle. If somebody forgets to add a L to a long, it suddenly instantiates a 
 int version of range query, hitting no results and so on. Same with other 
 types. Maybe accept java.lang.Number as parameter (because nullable for 
 half-open bounds) and one enum for the type.
 # TrieUtils move into o.a.l.util? or document or?
 # Move TokenStreams into o.a.l.analysis, ShiftAttribute into 
 o.a.l.analysis.tokenattributes? Somewhere else?
 # If we rename the classes, should Solr stay with Trie (because there are 
 different impls)?
 # Maybe add a subclass of AbstractField, that automatically creates these 
 TokenStreams and omits norms/tf per default for easier addition to Document 
 instances?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

I like the last option most. Creating dummy scorer looks ugly to me, and looks 
like it will cause more problems of the same kind in the future.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719539#action_12719539
 ] 

Earwin Burrfoot edited comment on LUCENE-1630 at 6/15/09 5:36 AM:
--

I like the last option (move scoresOutOfOrder to Weight) most. Creating dummy 
scorer looks ugly to me, and looks like it will cause more problems of the same 
kind in the future.


  was (Author: earwin):
I like the last option most. Creating dummy scorer looks ugly to me, and 
looks like it will cause more problems of the same kind in the future.
  
 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online

Re: Payloads and TrieRangeQuery

2009-06-14 Thread Earwin Burrfoot

 Just to throw something out, the new Token API is not very consumable in my
 experience. The old one was very intuitive and very easy to follow the code.

 I've had to refigure out what the heck was going on with the new one more
 than once now. Writing some example code with it is hard to follow or
 justify to a new user.

 What was the big improvement with it again? Advanced, expert custom indexing
 chains require less casting or something right?

 I dunno - anyone else have any thoughts now that the new API has been in
 circulation for some time?
I have an advanced, expert custom indexing chain, and it's still not
ported over the new API.
It's counter intuitive alright, with names not really saying what's
going on (please, for an AttributeSource, whose Attribute is it?
Attribute is a quality of 'something', but that 'something' is amiss),
but the biggest problem for me is that it capitalizes on the idea of
token stream even further, making filters whose output is several
times the input tokenwise, or which need to inspect a number of tokens
before emitting something - much harder to write. I most probably
missed something and there IS a way not to trash your memory with
non-reused linkedhashmaps, but than again, there's no pointers.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2009-06-14 Thread Earwin Burrfoot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719322#action_12719322
]

Earwin Burrfoot commented on LUCENE-1488:
-

bq. But this can't replace ArabicAnalyzer completely, because ArabicAnalyzer
stems arabic text in a language-specific way, which has a huge effect on
retrieval quality for Arabic language text.
What about separating word-tokenizing from morphological processing?

issues with standardanalyzer on multilingual text
-

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

2009-06-10 Thread Earwin Burrfoot (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718009#action_12718009
]

Earwin Burrfoot commented on LUCENE-1453:
-

bq. As the Filter is just a deprecated wrapper, that is removed in 3.0, I think
reusing SegmentReader.Ref for that is ok.
Ok. Maybe you are right.

bq. Closeable is a Java 1.5 interface only, so this refactoring must wait until
3.0, but the idea is good!
We can introduce our own Closeable, and replace it with java native in 3.0,
thank gods the interface is simple :)

When reopen returns a new IndexReader, both IndexReaders may now control the
lifecycle of the underlying Directory which is managed by reference counting
-

Key: LUCENE-1453
URL: https://issues.apache.org/jira/browse/LUCENE-1453
Project: Lucene - Java
Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.4.1, 2.9

Attachments: Failing-testcase-LUCENE-1453.patch,
LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch,
LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch

Rough summary. Basically, FSDirectory tracks references to FSDirectory and
when IndexReader.reopen shares a Directory with a created IndexReader and
closeDirectory is true, FSDirectory's ref management will see two decrements
for one increment. You can end up getting an AlreadyClosed exception on the
Directory when the IndexReader is open.
I have a test I'll put up. A solution seems fairly straightforward (at least
in what needs to be accomplished).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot

 And this information about the trie
 structure and where payloads are should be stored in FieldInfos.

 As is the case today, the info is encoded in the class you use (and
 it's settings)... no need to add it to the index structure.  In any
 case, it's a completely different issue and shouldn't be tied to
 TrieRange improvements.

 The problem is, because the details of Trie* at index time affect
 what's in each segment, this information needs to be stored per
 segment.

And then, when you merge segments indexed with different Trie*
settings, you need to convert them to some common form.
Sounds like something too complex and with minimum returns.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1607) String.intern() faster alternative

2009-06-10 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718198#action_12718198
 ] 

Earwin Burrfoot commented on LUCENE-1607:
-

bq. but I was waiting for some kind of feedback if people in general thought it 
was the right approach. It introduces another static, and people tend to not 
like that.
Just forgot somehow about this issue.
You're right about static, it's not clear how and when to initialize it, plus 
you introduce some public classes we'll be unable to change/remove later.
I still have a feeling we should expose a single static method - intern() and 
hide implementation away, possibly tuning it to be advantageous for thousands 
of fields, and degrading to raw String.intern() level if there are more fields.

I'm going to be away from AC power for three days starting now, so I won't be 
able to reply until then.

 String.intern() faster alternative
 --

 Key: LUCENE-1607
 URL: https://issues.apache.org/jira/browse/LUCENE-1607
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot
 Fix For: 2.9

 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, 
 LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch


 By using our own interned string pool on top of default, String.intern() can 
 be greatly optimized.
 On my setup (java 6) this alternative runs ~15.8x faster for already interned 
 strings, and ~2.2x faster for 'new String(interned)'
 For java 5 and 4 speedup is lower, but still considerable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Payloads and TrieRangeQuery

2009-06-10 Thread Earwin Burrfoot

  * Was the field even indexed w/ Trie, or indexed as simple text?
    It's useful to know this automatically at search time, so eg a
    RangeQuery can do the right thing by default.  FieldInfos seems
    like the natural place to store this.  It's basically Lucene's
    per-segment write-once schema.  Eg we use this to record did any
    token in this field have a Payload?, which is analogous.
This should really be in a schema of some kind (like in my project for
instance).
Why do you do autodetection for tries, but recently removed it for FieldCache?
Things should be concise, either store all settings in the index (and
die in the process), or don't store them there at all.

  * We have a bug (or an important improvement) in how Trie encodes
    terms that we need to fix.  This one is not easy to handle, since
    such a change could alter the term order, and merging segments
    then becomes problematic.  Not sure how to handle that.  Yonik,
    has Solr ever had to make a change to NumberUtils?
There are cases when reindexing is inevitable. What so horrible about
it anyway? Even if you have a humongous index, you can rebuild it in a
matter of days, and you don't do this often.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

[
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717657#action_12717657
]

Earwin Burrfoot commented on LUCENE-1453:
-

Patch looks fine. I read the last one, LUCENE-1453-with-FSDir-open.patch.

When reopen returns a new IndexReader, both IndexReaders may now control the
lifecycle of the underlying Directory which is managed by reference counting
-

Attachments: Failing-testcase-LUCENE-1453.patch,
LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch,
LUCENE-1453.patch, LUCENE-1453.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Some thoughts around the use of reader.isDeleted and hasDeletions

2009-06-09 Thread Earwin Burrfoot

 Actually: I think we should also change IndexReader.document to not
 check if it's deleted?  (Renaming it to something like rawDocument(),
 storedDocument(), something, in the process, and deprecating the old
 one).
Yup. After all the most common use-case is to load a document after
finding it in one or another way. Pretty hard to come up with id of a
deleted document.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

[
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717769#action_12717769
]

Earwin Burrfoot commented on LUCENE-1453:
-

bq. I think it should (be closed in a finally clause).

Then there's the next question of the same sort, but probably belonging in a
separate issue. If we close a DR and one of SR throws an exception - should we
close the others (currently we don't)? What is the right way, in general, of
handling IOExceptions on IR close? Can we retry the close? What does this
exception mean?

When reopen returns a new IndexReader, both IndexReaders may now control the
lifecycle of the underlying Directory which is managed by reference counting
-

Attachments: Failing-testcase-LUCENE-1453.patch,
LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch,
LUCENE-1453.patch, LUCENE-1453.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream


[ 
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717823#action_12717823
 ] 

Earwin Burrfoot commented on LUCENE-1678:
-

Second this. Though I lost any hope for sane Lucene release/compat rules.

 Deprecate Analyzer.tokenStream
 --

 Key: LUCENE-1678
 URL: https://issues.apache.org/jira/browse/LUCENE-1678
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9


 The addition of reusableTokenStream to the core analyzers unfortunately broke 
 back compat of external subclasses:
 
 http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
 On upgrading, such subclasses would silently not be used anymore, since 
 Lucene's indexing invokes reusableTokenStream.
 I think we should should at least deprecate Analyzer.tokenStream, today, so 
 that users see deprecation warnings if their classes override this method.  
 But going forward when we want to change the API of core classes that are 
 extended, I think we have to  introduce entirely new classes, to keep back 
 compatibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

[
https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717862#action_12717862
]

Earwin Burrfoot commented on LUCENE-1678:
-

bq. If there are sane/smart ways to change our back compat policy, I think you
have seen that no one would object.
It's not a matter of finding a smart way. It is a matter of sacrifice that has
to be made and readiness to take the blame for decision that can be unpopular
with someone.
You go zealously for back-compat - you sacrifice readability/maintainability of
your code but free users from any troubles when they want to 'simply upgrade'.
You adopt more relaxed policy - you sacrifice users' time, but in return you
gain cleaner codebase and new stuff can be written and used faster.
There's no way to ride two horses at once.

Some people are comfortable with current policies. Few cringe when they hear
things like above. Most theoretically want to relax the rules. Nobody's ready
to give up something for it.

Okay, there's an escape hatch I (and someone else) mentioned on the list
before. Adopting a fixed release cycle with small intervals between releases
(compared to what we have now). Fixed - as in, releases are made each N months
instead of when everyone feels they finished and polished up all their pet
projects and there's nothing else exciting to do. That way we can keep the
current policy, but deletion-through-deprecation approach will work, at last!
This solution is halfassed, I can already see discussions like That was a big
change, let's keep the deprecates around longer, say - for a couple of
releases., it doesn't solve good-name-thrashing problem, as you have to go
through two rounds of deprecation to change semantics on something, but keep
the name.
But this is something better than what we have now, a-a-and this is something
that needs commiter backing.

bq. Thats a great indication to me that the issue is not simple.
The issue is simple, the choice is not. And maintaining status quo is free.

bq. Giving up is really not the answer though
It is the answer. I have no moral right to hammer my ideals into heads that did
tremendously more for the project, than I did. And maintaining a patch queue
over Lucene trunk is not 'that' hard.

Deprecate Analyzer.tokenStream
--

Key: LUCENE-1678
URL: https://issues.apache.org/jira/browse/LUCENE-1678
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Fix For: 2.9

The addition of reusableTokenStream to the core analyzers unfortunately broke
back compat of external subclasses:

http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html
On upgrading, such subclasses would silently not be used anymore, since
Lucene's indexing invokes reusableTokenStream.
I think we should should at least deprecate Analyzer.tokenStream, today, so
that users see deprecation warnings if their classes override this method.
But going forward when we want to change the API of core classes that are
extended, I think we have to introduce entirely new classes, to keep back
compatibility.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting


[ 
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717866#action_12717866
 ] 

Earwin Burrfoot commented on LUCENE-1453:
-

Two suggestions:

Factor out RefCount class and use it everywhere through Lucene. I see at least 
one identical to yours in SegmentReader. Would be easier to replace all these 
uses with AtomicInteger later.

Looking at the new unsightly loop in doClose(), what if we change all Lucene 
closeable classes to implement java.io.Closeable and create a static utility 
method(-s) that receives a bunch of Closeables (an array, iterable, vararg in 
1.5) and tries to close them all?
The method should be nullsafe (so you can skip != null checks) and will 
handle/rethrow exceptions. The most proper way to handle exceptions is probably 
this - rethrow original exception if it is the only one (be it Runtime or IO), 
if there's more - gather all exceptions and wrap them into a special 
IOException subclass that concatenates their messages and keeps them around, so 
they are inspectable at debug-time or if you implement special treatment for 
that exception in your code.
This method can be reused in a heap of places later, SR.doClose() comes first 
to mind.

I can do the latter one in a separate patch to close this issue faster.

 When reopen returns a new IndexReader, both IndexReaders may now control the 
 lifecycle of the underlying Directory which is managed by reference counting
 -

 Key: LUCENE-1453
 URL: https://issues.apache.org/jira/browse/LUCENE-1453
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.4
Reporter: Mark Miller
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.4.1, 2.9

 Attachments: Failing-testcase-LUCENE-1453.patch, 
 LUCENE-1453-with-FSDir-open.patch, LUCENE-1453.patch, LUCENE-1453.patch, 
 LUCENE-1453.patch, LUCENE-1453.patch, LUCENE-1453.patch


 Rough summary. Basically, FSDirectory tracks references to FSDirectory and 
 when IndexReader.reopen shares a Directory with a created IndexReader and 
 closeDirectory is true, FSDirectory's ref management will see two decrements 
 for one increment. You can end up getting an AlreadyClosed exception on the 
 Directory when the IndexReader is open.
 I have a test I'll put up. A solution seems fairly straightforward (at least 
 in what needs to be accomplished).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream

2009-06-09 Thread Earwin Burrfoot

@Mark:
 Okay, there's an escape hatch I (and someone else) mentioned on the list
 before. Adopting a fixed release cycle with small intervals between releases
 (compared to what we have now). Fixed - as in, releases are made each N
 months instead of when everyone feels they finished and polished up all
 their pet projects and there's nothing else exciting to do. That way we can
 keep the current policy, but deletion-through-deprecation approach will
 work, at last!
 Thats a big change. I think its a nice idea, but I don't know how practical
 it is. Most of us are basically volunteering time for this type of thing.
 Even still, with the pace of development lately (and you can be sure that
 the current pace is a *new* thing, Lucene did not always have this amount of
 activity), it might make sense.
You're missing the most important point. Fixed schedule means that the
only reason not to do a release is the total abscence of changes.
No matter how much or how few changes are released each time, fixed
schedule gives you predictable lifecycle for all your
deprecation/back-compat needs.

 But that idea needs a champion, and frankly
 I don't have the time right now (it wouldn't likely be in my realm anyway).
 And thats probably the deal with most others. They have work and/or other
 itches that are higher priority than championing a big change.
And here we got at one of the roots of the problem. The root that is
going to stay.

 bq. Giving up is really not the answer though
 It is the answer. I have no moral right to hammer my ideals into heads
 that did tremendously more for the project, than I did. And maintaining a
 patch queue over Lucene trunk is not 'that' hard.
 Its not about hammering your ideals - that almost feels like what you are
 doing, but frankly, it doesn't help. If you even just keep prompting the
 issue as it dies away you will likely keep progress going. There is a
 solution that everyone will accept. I promise you that. Its more work than
 it looks to find that solution and guide it to fruition though. Its fully
 possible, and I'm sure it will happen eventually. Would have beat even money
 that Mike had it a few weeks ago. No dice it looks though ;)
I consciously took a bit of an extremist stance in hope to shift the
mean. Okay, will try ditching it in favour of gently bugging people
like Grant did in the comment that spawned this discussion. :)

@Yonik:
 You go zealously for back-compat - you sacrifice readability/maintainability 
 of your code but free users from any troubles when they want to 'simply 
 upgrade'. You adopt more relaxed policy - you sacrifice users' time, but in 
 return you gain cleaner codebase and new stuff can be written and used 
 faster.
 Not sure I agree with that - if changes become too easy you can get a
 thrashing effect... change just because someone thought it was a
 little better can lead to more chaos.
You're right.
I'm not advocating anarchy. :) But currently we are afraid to break
anything at all, and that is as far away from juste milieu as the
chaos you speak of.

 IMO, changes to interfaces should be clearly better than what existed before.
Recent changes to DISI? Were they clearly for the better?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1648) when you clone or reopen an IndexReader with pending changes, the new reader doesn't commit the changes


[ 
https://issues.apache.org/jira/browse/LUCENE-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717089#action_12717089
 ] 

Earwin Burrfoot commented on LUCENE-1648:
-

As LUCENE-1651 is now committed, this issue can be resolved.

 when you clone or reopen an IndexReader with pending changes, the new reader 
 doesn't commit the changes
 ---

 Key: LUCENE-1648
 URL: https://issues.apache.org/jira/browse/LUCENE-1648
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1648-followup.patch, LUCENE-1648-followup.patch, 
 LUCENE-1648.patch


 While working on LUCENE-1647, I came across this issue... we are failing to 
 carry over hasChanges, norms/deletionsDirty, etc, when cloning the new reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1453) When reopen returns a new IndexReader, both IndexReaders may now control the lifecycle of the underlying Directory which is managed by reference counting

[
https://issues.apache.org/jira/browse/LUCENE-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717107#action_12717107
]

Earwin Burrfoot commented on LUCENE-1453:
-

bq. There are two possibilities to fix this:
Vote for leave them open. Yes, it breaches the contract, but the breach is
controlled (and thus harmless) and we get rid of some weird code (=possible
point of failure) without introducing new.
There is a way to notice change in DirectoryReader behaviour, but it is too
unrealistic:
{code}
IndexReader r = IndexReader.open(/path/to/index);
.
Directory d = r.directory(); // you have to get directory reference as you're
not the one who created it
.
r.close();
.
d.doSomething(); // and EXPECT this call to fail with exception
{code}

When reopen returns a new IndexReader, both IndexReaders may now control the
lifecycle of the underlying Directory which is managed by reference counting
-

Attachments: Failing-testcase-LUCENE-1453.patch, LUCENE-1453.patch,
LUCENE-1453.patch, LUCENE-1453.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (SOLR-706) Fast auto-complete suggestions


[ 
https://issues.apache.org/jira/browse/SOLR-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717108#action_12717108
 ] 

Earwin Burrfoot commented on SOLR-706:
--

When I did autocompletion for my project, simple java TreeMap had superior 
memory characteristics and almost the same performance as tries. I think it's 
not worth inventing something elaborate for this task.

 Fast auto-complete suggestions
 --

 Key: SOLR-706
 URL: https://issues.apache.org/jira/browse/SOLR-706
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Shalin Shekhar Mangar
Assignee: Shalin Shekhar Mangar
 Fix For: 1.5


 A lot of users have suggested that facet.prefix in Solr is not the most 
 efficient way to implement an auto-complete suggestion feature. A fast 
 in-memory trie like structure has often been suggested instead. This issue 
 aims to incorporate a faster/efficient way to answer auto-complete queries in 
 Solr.
 Refer to the following discussion on solr-dev -- 
 http://markmail.org/message/sjjojrnroo3msugj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-236) Field collapsing

[
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717110#action_12717110
]

Earwin Burrfoot commented on SOLR-236:
--

I have implemented collapsing on a high-volume project of mine in much less
flexible, but more practical manner.

Part I. You have to guarantee that all documents having the same value of
collapse-field are dropped into Lucene index as a sequential batch. That
guarantees they get sequential docIds, and with some more work - that they all
end up in the same segment.
Part II. When doing collection you always get docIds in sequential order, and
thus, thanks to Part I you get the docs-to-be-collapsed already grouped by
collapse-field, even before you drop the docs into PriorityQueue to sort them.

Cons:
You can only collapse on a single predetermined at index creation time field.
If one document changes, you have to reindex all docs that have the same
collapse-field value, so it's best if you have either low update/add rates, or
few documents sharing the same collapse-field value.

Pros:
The CPU and memory costs for collapsing compared to usual search are very close
to zero and do not depend on index size/total docs found.
The same idea works with new Lucene per-segment collection and in distributed
mode (sharded index).
Within collapsed group you can sort hits however you want, and select one that
will represent the group for usual sort/paging.
The implementation is not brain-dead simple, but nears it.

Field collapsing

Key: SOLR-236
URL: https://issues.apache.org/jira/browse/SOLR-236
Project: Solr
Issue Type: New Feature
Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
Fix For: 1.5

Attachments: collapsing-patch-to-1.3.0-dieter.patch,
collapsing-patch-to-1.3.0-ivan.patch, collapsing-patch-to-1.3.0-ivan_2.patch,
collapsing-patch-to-1.3.0-ivan_3.patch, field-collapse-solr-236-2.patch,
field-collapse-solr-236.patch, field-collapsing-extended-592129.patch,
field_collapsing_1.1.0.patch, field_collapsing_1.3.patch,
field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff,
field_collapsing_dsteigerwald.diff, SOLR-236-FieldCollapsing.patch,
SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch,
solr-236.patch, SOLR-236_collapsing.patch, SOLR-236_collapsing.patch

This patch include a new feature called Field collapsing.
Used in order to collapse a group of results with similar value for a given
field to a single entry in the result set. Site collapsing is a special case
of this, where all results for a given web site is collapsed into one or two
entries in the result set, typically with an associated more documents from
this site link. See also Duplicate detection.
http://www.fastsearch.com/glossary.aspx?m=48amid=299
The implementation add 3 new query parameters (SolrParams):
collapse.field to choose the field used to group results
collapse.type normal (default value) or adjacent
collapse.max to select how many continuous results are allowed before
collapsing
TODO (in progress):
- More documentation (on source code)
- Test cases
Two patches:
- field_collapsing.patch for current development version
- field_collapsing_1.1.0.patch for Solr-1.1.0
P.S.: Feedback and misspelling correction are welcome ;-)

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: IR static methods

2009-06-04 Thread Earwin Burrfoot

Index/Commit/SegmentMetadata? Several classes, as you can reflect on
various levels of the index.

Some offtopic - SegmentInfo/SegmentsInfo should really be named
Segment/Segments. That's exactly what these objects represent.
You don't use names like PreparedStatementInfo or FileInfo or IntegerInfo :)

On Fri, Jun 5, 2009 at 02:21, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 We have .
 $ ff \*Info\*java
 ./src/java/org/apache/lucene/index/FieldInfo.java
 ./src/java/org/apache/lucene/index/TermVectorOffsetInfo.java
 ./src/java/org/apache/lucene/index/SegmentInfo.java
 ./src/java/org/apache/lucene/index/TermInfosWriter.java
 ./src/java/org/apache/lucene/index/TermInfo.java
 ./src/java/org/apache/lucene/index/FieldInfos.java
 ./src/java/org/apache/lucene/index/SegmentMergeInfo.java
 ./src/java/org/apache/lucene/index/TermInfosReader.java
 ./src/java/org/apache/lucene/index/SegmentInfos.java

 How about IndexInfo?

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Earwin Burrfoot ear...@gmail.com
 To: java-dev@lucene.apache.org
 Sent: Wednesday, June 3, 2009 8:08:50 AM
 Subject: IR static methods

 I have a strong desire to remove all these static methods from IR -
 lastModified, getCurrentVersion, getCommitUserData, indexExists.
 But haven't found a good place for them yet.

 Directory - is a bad place, it shouldn't concern itself with details
 of what exactly is stored inside, it should think of 'how' it is
 stored.
 IndexReader - is bad, it is too heavyweight to be created for getting
 something simple once.

 We should probably create some new lightweight class that provides a
 kind of reflection for the index? Mod dates, versions, userdata,
 existence, sizes, deletions, whatever. Both per-index and per-segment.
 Essentially it is a wrapper over SegmentInfos that allows us to keep
 them hidden (and thus easily changeable), and provides users with more
 concise and adequate interface.

 Any thoughts?

 --
 Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
 Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
 ICQ: 104465785

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.


[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715836#action_12715836
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

Seems the patch didn't apply completely. Your line numbers are off, also 
directory/readOnly are now members of SegmentReader, no way they can't be seen:

{code}
class SegmentReader extends IndexReader implements Cloneable {
  protected Directory directory;
  protected boolean readOnly;

  private String segment;
  private SegmentInfo si;
  private int readBufferSize;
{code}

Here's corresponding part of the patch, I bet $Id$ is the reason.
{code}
-/**
- * @version $Id$
- */
-class SegmentReader extends DirectoryIndexReader {
+/** @version $Id$ */
+class SegmentReader extends IndexReader implements Cloneable {
+  protected Directory directory;
+  protected boolean readOnly;
+
{code}

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.


[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715908#action_12715908
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

bq. Patch looks good Earwin, thanks!
I believe the readers can be cleaned up further, but I'm short on time and 
don't want to delay it for another week or two, and then rebase it against 
updated trunk once again. Might as well do that under a separate issue.

bq. I think we should now rename MultiSegmentReader to DirectoryIndexReader?
Maybe DirectoryReader instead of DirectoryIndexReader? But all three are in 
fact okay with me, I really don't have any preference here.


 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

IR static methods

2009-06-03 Thread Earwin Burrfoot

I have a strong desire to remove all these static methods from IR -
lastModified, getCurrentVersion, getCommitUserData, indexExists.
But haven't found a good place for them yet.

Directory - is a bad place, it shouldn't concern itself with details
of what exactly is stored inside, it should think of 'how' it is
stored.
IndexReader - is bad, it is too heavyweight to be created for getting
something simple once.

We should probably create some new lightweight class that provides a
kind of reflection for the index? Mod dates, versions, userdata,
existence, sizes, deletions, whatever. Both per-index and per-segment.
Essentially it is a wrapper over SegmentInfos that allows us to keep
them hidden (and thus easily changeable), and provides users with more
concise and adequate interface.

Any thoughts?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher

[
https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715944#action_12715944
]

Earwin Burrfoot commented on LUCENE-1672:
-

bq. I will later try to solve this problem with the closeDir inside the
different IndexReaders (but maybe Earwin has done it already in LUCENE-1651)
My issue removes closeDir from SegmentReader, as it cannot 'own' a directory
anymore. MSR-to-be-DirectoryReader still has this flag.

Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
--

Key: LUCENE-1672
URL: https://issues.apache.org/jira/browse/LUCENE-1672
Project: Lucene - Java
Issue Type: Task
Affects Versions: 2.9
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1672.patch, LUCENE-1672.patch

During investigation of LUCENE-1658, I found out, that even LUCENE-1453 is
not completely fixed.
As 1658 deprecates all FSDirectory.getDirectory() static factories, we should
not use them anymore. As the user is now free to choose the correct directory
implementation using direct instantiation or using FSDir.open() he should no
longer use all ctors/methods in IndexWriter/IndexReader/IndexSearcher Co.
that simply take path names as String or File and always instantiate the
Directory himself.
LUCENE-1453 currently works for the cached directory implementations from
FSDir.getDirectory, but not with uncached, non refcounting FSDirs. Sometime
reopen() closes the directory (as far as I see, when a SegmentReader changes
to a MultiSegmentReader and/or deletes apply). This is hard to track. In
Lucene 3.0 we then can remove the whole bunch of closeDirectory
parameters/fields in these classes and simply do not care anymore about
closing directories.
To remove this closeDirectory parameter now (before 3.0) and also fix 1453
correctly, an additional idea would be to change these factories that take
the File/String to return the IndexReader wrapped by a FilteredIndexReader,
that keeps track on closing the underlying directory after close and reopen.
This is simplier than passing this boolean between different
DirectoryIndexReader instances. The small performance impact by wrapping with
FilterIndexReader should not be so bad, because the method is deprecated and
we can state, that it is better to user the factory method with Directory
parameter.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1672) Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher

[
https://issues.apache.org/jira/browse/LUCENE-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715962#action_12715962
]

Earwin Burrfoot commented on LUCENE-1672:
-

bq. And DirectoryIR/MSR still have this Flag, but reopening a MSR always
returns a MSR again (even if it only consists of one segment)?
Exactly.

Deprecate all String/File ctors/opens in IndexReader/IndexWriter/IndexSearcher
--

Key: LUCENE-1672
URL: https://issues.apache.org/jira/browse/LUCENE-1672
Project: Lucene - Java
Issue Type: Task
Affects Versions: 2.9
Reporter: Uwe Schindler
Fix For: 2.9

Attachments: LUCENE-1672.patch, LUCENE-1672.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.


 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651-tag.patch
LUCENE-1651.patch

Argh! The rename broke test-tag again :) in new and innovative ways.
New patches attached.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, 
 LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.


 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651.patch

One more version, applies against current trunk without fuzzy hunk matching.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651-tag.patch, 
 LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch, LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Enhance StandardTokenizer to support words which will not be tokenized

2009-06-03 Thread Earwin Burrfoot

Not sure you can easily marry generated JFlex grammar and
runtime-provided list of protected words.
I took the approach of creating tokens for punctuation inside my
tokenizer and later gluing them with nearby text tokens or dropping
from the stream with a tokenfilter.

On Wed, Jun 3, 2009 at 20:10, Grant Ingersoll gsing...@apache.org wrote:
 You'd have to modify the JFlex grammar.  I'd suggest adding in a generic
 protected words approach whereby you can pass in a list of protected
 words.

 This would be a nice patch/improvement.

 -Grant

 On Jun 3, 2009, at 4:07 AM, ami dudu wrote:


 Hi, I'm using a StandardTokenizer which do great job for me but i need to
 enhance it somehow to consider words like c++ c#, .net as is and not
 tokenized it into c or net.
 I know that there are other tokenizers such as KeywordTokenizer and
 WhitespaceTokenizer but they do not include the StandardTokenizer  logic.
 Any ideas on what is the best way to add this enhancement?

 Thanks,
 Amid
 --
 View this message in context:
 http://www.nabble.com/Enhance-StandardTokenizer-to-support-words-which-will-not-be-tokenized-tp23849495p23849495.html
 Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


 --
 Grant Ingersoll
 http://www.lucidimagination.com/

 Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
 Solr/Lucene:
 http://www.lucidimagination.com/search


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715973#action_12715973
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

Searcher is supposed to be a little cherry of userfriendliness atop a glass of 
Lucene murky internals, ain't it?
I mean, even you had to be explained the ways of Query, Weight and Scorer, what 
would a Lucene neophyte do if we remove his beloved convenience methods?

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e

[jira] Created: (LUCENE-1677) Remove GCJ IndexReader specializations

Remove GCJ IndexReader specializations
--

 Key: LUCENE-1677
 URL: https://issues.apache.org/jira/browse/LUCENE-1677
 Project: Lucene - Java
  Issue Type: Task
Reporter: Earwin Burrfoot
 Fix For: 2.9


These specializations are outdated, unsupported, most probably pointless due to 
the speed of modern JVMs and, I bet, nobody uses them (Mike, you said you are 
going to ask people on java-user, anybody replied that they need it?). While 
giving nothing, they make SegmentReader instantiation code look real ugly.

If nobody objects, I'm going to post a patch that removes these from Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1630) Mating Collector and Scorer on doc Id orderness

2009-06-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715509#action_12715509
 ] 

Earwin Burrfoot commented on LUCENE-1630:
-

You can't, because Weights produced from same Query are different for different 
indexes.
You can probably modify Query inplace for a given index, produce some scorers, 
do scoring, then modify Query for another index, produce scorers, etc..
But now your Query is no longer thread-safe, and I can't reuse it from 
different threads.

So with all its strange looks the trio of Q, W, S is still the best approach if 
you ask me.

 Mating Collector and Scorer on doc Id orderness
 ---

 Key: LUCENE-1630
 URL: https://issues.apache.org/jira/browse/LUCENE-1630
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Shai Erera
 Fix For: 2.9


 This is a spin off of LUCENE-1593. This issue proposes to expose appropriate 
 API on Scorer and Collector such that one can create an optimized Collector 
 based on a given Scorer's doc-id orderness and vice versa. Copied from 
 LUCENE-1593, here is the list of changes:
 # Deprecate Weight and create QueryWeight (abstract class) with a new 
 scorer(reader, scoreDocsInOrder), replacing the current scorer(reader) 
 method. QueryWeight implements Weight, while score(reader) calls 
 score(reader, false /* out-of-order */) and scorer(reader, scoreDocsInOrder) 
 is defined abstract.
 #* Also add QueryWeightWrapper to wrap a given Weight implementation. This 
 one will also be deprecated, as well as package-private.
 #* Add to Query variants of createWeight and weight which return QueryWeight. 
 For now, I prefer to add a default impl which wraps the Weight variant 
 instead of overriding in all Query extensions, and in 3.0 when we remove the 
 Weight variants - override in all extending classes.
 # Add to Scorer isOutOfOrder with a default to false, and override in BS to 
 true.
 # Modify BooleanWeight to extend QueryWeight and implement the new scorer 
 method to return BS2 or BS based on the number of required scorers and 
 setAllowOutOfOrder.
 # Add to Collector an abstract _acceptsDocsOutOfOrder_ which returns 
 true/false.
 #* Use it in IndexSearcher.search methods, that accept a Collector, in order 
 to create the appropriate Scorer, using the new QueryWeight.
 #* Provide a static create method to TFC and TSDC which accept this as an 
 argument and creates the proper instance.
 #* Wherever we create a Collector (TSDC or TFC), always ask for out-of-order 
 Scorer and check on the resulting Scorer isOutOfOrder(), so that we can 
 create the optimized Collector instance.
 # Modify IndexSearcher to use all of the above logic.
 The only class I'm worried about, and would like to verify with you, is 
 Searchable. If we want to deprecate all the search methods on IndexSearcher, 
 Searcher and Searchable which accept Weight and add new ones which accept 
 QueryWeight, we must do the following:
 * Deprecate Searchable in favor of Searcher.
 * Add to Searcher the new QueryWeight variants. Here we have two choices: (1) 
 break back-compat and add them as abstract (like we've done with the new 
 Collector method) or (2) add them with a default impl to call the Weight 
 versions, documenting these will become abstract in 3.0.
 * Have Searcher extend UnicastRemoteObject and have RemoteSearchable extend 
 Searcher. That's the part I'm a little bit worried about - Searchable 
 implements java.rmi.Remote, which means there could be an implementation out 
 there which implements Searchable and extends something different than 
 UnicastRemoteObject, like Activeable. I think there is very small chance this 
 has actually happened, but would like to confirm with you guys first.
 * Add a deprecated, package-private, SearchableWrapper which extends Searcher 
 and delegates all calls to the Searchable member.
 * Deprecate all uses of Searchable and add Searcher instead, defaulting the 
 old ones to use SearchableWrapper.
 * Make all the necessary changes to IndexSearcher, MultiSearcher etc. 
 regarding overriding these new methods.
 One other optimization that was discussed in LUCENE-1593 is to expose a 
 topScorer() API (on Weight) which returns a Scorer that its score(Collector) 
 will be called, and additionally add a start() method to DISI. That will 
 allow Scorers to initialize either on start() or score(Collector). This was 
 proposed mainly because of BS and BS2 which check if they are initialized in 
 every call to next(), skipTo() and score(). Personally I prefer to see that 
 in a separate issue, following that one (as it might add methods to 
 QueryWeight).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online

[jira] Commented: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715672#action_12715672
 ] 

Earwin Burrfoot commented on LUCENE-1651:
-

Hm.. okay, I've got back to work on this patch. To fix tests relying on getting 
SR from IR.open() on trunk I introduced a package-private utility method that 
extracts SR from MSR if it is the only one there. The tests in tags/XXX don't 
see this method, should I backport it somewhere there?

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1651) Make IndexReader.open() always return MSR to simplify (re-)opens.

2009-06-02 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-1651:


Attachment: LUCENE-1651-tag.patch
LUCENE-1651.patch

Here are the patches for current lucene trunk and back compat tag.

 Make IndexReader.open() always return MSR to simplify (re-)opens.
 -

 Key: LUCENE-1651
 URL: https://issues.apache.org/jira/browse/LUCENE-1651
 Project: Lucene - Java
  Issue Type: Task
Affects Versions: 2.9
Reporter: Earwin Burrfoot
Assignee: Michael McCandless
 Fix For: 2.9

 Attachments: LUCENE-1651-tag.patch, LUCENE-1651.patch, 
 LUCENE-1651.patch


 As per discussion in mailing list, I'm making DirectoryIndexReader.open() 
 always return MSR, even for single-segment indexes.
 While theoretically valid in the past (if you make sure to keep your index 
 constantly optimized) this feature is made practically obsolete by 
 per-segment collection.
 The patch somewhat de-hairies (re-)open logic for MSR/SR.
 SR no longer needs an ability to pose as toplevel directory-owning IR.
 All related logic is moved from DIR to MSR.
 DIR becomes almost empty, and copying two or three remaining fields over to 
 MSR/SR, I remove it.
 Lots of tests fail, as they rely on SR returned from IR.open(), I fix by 
 introducing SR.getOnlySegmentReader static package-private method.
 Some previous bugs are uncovered, one is fixed in LUCENE-1645, another 
 (partially fixed in LUCENE-1648) is fixed in this patch. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory


[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715008#action_12715008
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

I told you, Java mmap doesn't work on Windows.
And please, don't use the unmap hack! If it doesn't work, it doesn't work. 
Let's for all windows versions use SimpleFSD.
Look, what are you going to do if you unmap a buffer and then access it by 
accident? Crash JVM?

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

[
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016
]

Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 1:14 AM:
-

bq. The buffer is nulled directly after unmapping.
Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Does windows version differ? What we see here is 'zeroing', not 'nulling'. When
doing get/set, buffer never checks for address to have sense, so the next
access will yield a GPF :)
The guys from Sun explained the absence of unmap() in the original design - the
only way of closing mapped buffer and not getting unpredictable behaviour is to
introduce a synchronized isClosed check on each read/write operation, which
kills the performance even if the sync method used is just a volatile variable.

was (Author: earwin):
Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Absorb NIOFSDirectory into FSDirectory
--

Key: LUCENE-1658
URL: https://issues.apache.org/jira/browse/LUCENE-1658
Project: Lucene - Java
Issue Type: Improvement
Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
Fix For: 2.9

Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch,
LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch,
LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch,
LUCENE-1658.patch, LUCENE-1658.patch

I think whether one uses java.io.* vs java.nio.* or eventually
java.nio2.*, or some other means, is an under-the-hood implementation
detail of FSDirectory and doesn't merit a whole separate class.
I think FSDirectory should be the core class one uses when one's index
is in the filesystem.
So, I'd like to deprecate NIOFSDirectory, absorbing it into
FSDirectory, and add a setting useNIO to FSDirectory. It should
default to true for non-Windows OSs, because it gives far better
concurrent performance on all platforms but Windows (due to known Sun
JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory


[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715016#action_12715016
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

Really? Let me quote some code (MacOS, Java 1.6):

unsafe.freeMemory(address);
address = 0;
Bits.unreserveMemory(capacity);

Does windows version differ? What we see here is 'zeroing', not 'nulling'. When 
doing get/set, buffer never checks for address to have sense, so the next 
access will yield a GPF :)
The guys from Sun explained the absence of unmap() in the original design - the 
only way of closing mapped buffer and not getting unpredictable behaviour is to 
introduce a synchronized isClosed check on each read/write operation, which 
kills the performance even if the sync method used is just a volatile variable.

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

[
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715018#action_12715018
]

Earwin Burrfoot commented on LUCENE-1658:
-

Ah! You was referring to your code. It's not thread-safe still. Someone could
access the closed buffer before it sees the now-null reference to it.
You also employ the hack on non-windows machines, that work quite well without
it. What for?

Absorb NIOFSDirectory into FSDirectory
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory


[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715026#action_12715026
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

I tested on MacOS:

Invalid memory access of location 8b55a000 rip=0110c367
* Here JVM quietly dies. non-null return code, all threads are killed, no 
diagnostic files created.

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory


[ 
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715027#action_12715027
 ] 

Earwin Burrfoot commented on LUCENE-1658:
-

bq. It uses less virtual memory :)
64bit systems have an abundance of said valuable resource. Why taint them with 
dangerous hacks for the sake of zero returns?

 Absorb NIOFSDirectory into FSDirectory
 --

 Key: LUCENE-1658
 URL: https://issues.apache.org/jira/browse/LUCENE-1658
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Store
Reporter: Michael McCandless
Assignee: Uwe Schindler
Priority: Minor
 Fix For: 2.9

 Attachments: LUCENE-1658-take2.patch, LUCENE-1658-take2.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, 
 LUCENE-1658-take3.patch, LUCENE-1658-take3.patch, LUCENE-1658.patch, 
 LUCENE-1658.patch, LUCENE-1658.patch


 I think whether one uses java.io.* vs java.nio.* or eventually
 java.nio2.*, or some other means, is an under-the-hood implementation
 detail of FSDirectory and doesn't merit a whole separate class.
 I think FSDirectory should be the core class one uses when one's index
 is in the filesystem.
 So, I'd like to deprecate NIOFSDirectory, absorbing it into
 FSDirectory, and add a setting useNIO to FSDirectory.  It should
 default to true for non-Windows OSs, because it gives far better
 concurrent performance on all platforms but Windows (due to known Sun
 JRE issue http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

[
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715057#action_12715057
]

Earwin Burrfoot commented on LUCENE-1658:
-

bq. I'm a bit nervous about creating MMapDirectory automatically for any OS,
not just Windows.
It's almost okay for 64bit systems.

bq. The hack also saves transient disk space, on all systems, right?
That's a nice catch. Now I have some of the non-buggy-but-weird behaviour my
app exhibits explained.

bq. But they have a 64 bit buffer, so you could use it instead of many buffers.
They don't. When NIO2 project was merged into OpenJDK, they left some stuff
unmerged, including 64bit buffers. Currently they aren't present in OpenJDK and
Java7 preview builds, and not even a rough estimate is given on whether they
are going to make it through, and when.

bq. Maybe we should move this hack to contrib ( a class that extends
MMapDirectory by adding a close method) with a big warning!
I support this. The hack has some merits if carefully applied, but is outright
too dangerous to ship it as default.

Absorb NIOFSDirectory into FSDirectory
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory

[
https://issues.apache.org/jira/browse/LUCENE-1658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715063#action_12715063
]

Earwin Burrfoot edited comment on LUCENE-1658 at 6/1/09 4:16 AM:
-

bq. On a couple of projects I've worked in, they were very reluctant to having
packages allocate memory outside the JVM, and that's my understanding of memory
mapped buffers.
mmap does not allocate memory. It allocates address space, and uses the same
disk cache system already has.
For example, you can't cause OOM in your (or another co-existing) app with
mmaps (except eating up your own address space on 32bit systems).

bq. But if you decide to include MMapDir in that auto-create logic, I hope
there will be a way to instantiate a specific FSDir, in case we'll have
problems with that logic.
Public constructors for all D variants are a must, and for me they are the best
that this patch has to offer :)

was (Author: earwin):
bq. On a couple of projects I've worked in, they were very reluctant to
having packages allocate memory outside the JVM, and that's my understanding of
memory mapped buffers.
mmap does not allocate memory. It allocates address space, and uses the same
disk cache system already has.
For example, you can't cause OOM in your (or another co-existing) app with
mmaps (except eating up your own address space on 32bit systems).

Absorb NIOFSDirectory into FSDirectory
--

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1658) Absorb NIOFSDirectory into FSDirectory