date:20081204


 [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1470:
--

Attachment: fixbuild-LUCENE-1470.patch

Sorry for again a new patch: When again looking into the test, I missed a test 
for the automatic encoding detection by string length 
(TrieUtils.trieCodedToXxxAuto()).
The appended patch fixes the hudson build and adds this test.

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread John Wang

good open source projects should be better than the commercial counter
parts.

I really like 2.4. The DocIDSet/Filter apis really allowed me to do some
interesting stuff.

I feel lucene has potential to be more than just a full text search library.

-John

On Wed, Dec 3, 2008 at 11:58 PM, Robert Muir <[EMAIL PROTECTED]> wrote:

> no, i'm not doing any caching but as mentioned it did require some work to
> become almost completely i/o bound due to the nature of my wacky queries,
> example removing O(n) behavior from fuzzy and regexp.
>
> probably the os cache is not helping much because indexes are very large.
> I'm very happy being i/o bound because now and especially in the future i
> think it will be cheaper to speed up with additional ram and faster storage.
>
> still even out of box without any tricks lucene performs *much* better than
> the commercial alternatives i have fought with. lucene was evaluated a while
> ago before 2.3 and this was not the case, but I re-evaluated around 2.3
> release and it is now.
>
>
> On Thu, Dec 4, 2008 at 2:45 AM, John Wang <[EMAIL PROTECTED]> wrote:
>
>> Thanks Robert, definitely interested!
>> We are too, looking into SSDs for performance.
>> 2.4 allows you to create extend QueryParser and create your own "leaf"
>> queries.
>> I am surprised you are mostly IO bound. Lucene does a good job caching. Do
>> you do some sort of caching yourself? If your index is not changing often,
>> there is a lot you can do without SSDs.
>>
>> -John
>>
>>
>> On Wed, Dec 3, 2008 at 11:27 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
>>
>>> yeah i am using read-only.
>>>
>>> i will admit to subclassing queryparser and having customized
>>> query/scorer for several. all queries contain fuzzy queries so this was
>>> necessary.
>>>
>>> "high" throughput i guess is a matter of opinion. in attempting to
>>> profile high-throughput, again customized query/scorer made it easy for me
>>> to simplify some things, such as some math in termquery that doesn't make
>>> sense (redundant) for my Similarity. everything is pretty much i/o bound now
>>> so if tehre is some throughput issue i will look into SSD for high volume
>>> indexes.
>>>
>>> i posted on Use Cases on the wiki how I made fuzzy and regex fast if you
>>> are curious.
>>>
>>>
>>> On Thu, Dec 4, 2008 at 2:10 AM, John Wang <[EMAIL PROTECTED]> wrote:
>>>
 Thanks Robert for sharing.
 Good to hear it is working for what you need it to do.

 3) Especially with ReadOnlyIndexReaders, you should not be blocked while
 indexing. Especially if you have multicore machines.
 4) do you stay with sub-second responses with high thru-put?

 -John


 On Wed, Dec 3, 2008 at 11:03 PM, Robert Muir <[EMAIL PROTECTED]> wrote:

>
>
> On Thu, Dec 4, 2008 at 1:24 AM, John Wang <[EMAIL PROTECTED]> wrote:
>
>> Nice!
>> Some questions:
>>
>> 1) one index?
>>
> no, but two individual ones today were around 100M docs
>
>> 2) how big is your document? e.g. how many terms etc.
>>
> last one built has over 4M terms
>
>> 3) are you serving(searching) the docs in realtime?
>>
> i dont understand this question, but searching is slower if i am
> indexing on a disk thats also being searched.
>
>>
>> 4) search speed?
>>
> usually subsecond (or close) after some warmup. while this might seem
> slow its fast compared to the competition, trust me.
>
>>
>> I'd love to learn more about your architecture.
>>
> i hate to say you would be disappointed, but theres nothign fancy.
> probably why it works...
>
>>
>> -John
>>
>>
>> On Wed, Dec 3, 2008 at 10:13 PM, Robert Muir <[EMAIL PROTECTED]>wrote:
>>
>>> sorry gotta speak up on this. i indexed 300m docs today. I'm using an
>>> out of box jar.
>>>
>>> yeah i have some special subclasses but if i thought any of this
>>> stuff was general enough to be useful to others i'd submit it. I'm just
>>> happy to have something scalable that i can customize to my 
>>> peculiarities.
>>>
>>> so i think i fit in your 10% and im not stressing on either
>>> scalability or api.
>>>
>>> thanks,
>>> robert
>>>
>>>
>>> On Thu, Dec 4, 2008 at 12:36 AM, John Wang <[EMAIL PROTECTED]>wrote:
>>>
 Grant:
 I am sorry that I disagree with some points:

 1) "I think it's a sign that Lucene is pretty stable." - While
 lucene is a great project, especially with 2.x releases, great 
 improvements
 are made, but do we really have a clear picture on how lucene is being 
 used
 and deployed. While lucene works great running as a vanilla search 
 library,
 when pushed to limits, one needs to "hack" into lucene to make certain
 things work. If 90% of the user base use it to build small indexes and 
 usin

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653257#action_12653257
 ] 

Michael McCandless commented on LUCENE-1470:


Hmm -- I would prefer that contrib tests subclass LiaTestCase.  We must be 
missing a dependency in the ant build files.

OK this seems to fix it:

Index: contrib/contrib-build.xml
===
--- contrib/contrib-build.xml   (revision 723145)
+++ contrib/contrib-build.xml   (working copy)
@@ -61,7 +61,7 @@
   
 
   
-  
+  
   
 
   

I'll commit that, and the fix to the test case.  Thanks Uwe!

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653258#action_12653258
 ] 

Michael McCandless commented on LUCENE-1470:


bq.  Hmm - I would prefer that contrib tests subclass LiaTestCase

Woops, I meant LuceneTestCase ;)  Time sharing not working very well in my 
brain this morning...

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653257#action_12653257
 ] 

mikemccand edited comment on LUCENE-1470 at 12/4/08 3:07 AM:
-

Hmm -- I would prefer that contrib tests subclass LiaTestCase.  We must be 
missing a dependency in the ant build files.

OK this seems to fix it:

{code}
Index: contrib/contrib-build.xml
===
--- contrib/contrib-build.xml   (revision 723145)
+++ contrib/contrib-build.xml   (working copy)
@@ -61,7 +61,7 @@
   
 
   
-  
+  
   
 
   
{code}

I'll commit that, and the fix to the test case.  Thanks Uwe!

  was (Author: mikemccand):
Hmm -- I would prefer that contrib tests subclass LiaTestCase.  We must be 
missing a dependency in the ant build files.

OK this seems to fix it:

Index: contrib/contrib-build.xml
===
--- contrib/contrib-build.xml   (revision 723145)
+++ contrib/contrib-build.xml   (working copy)
@@ -61,7 +61,7 @@
   
 
   
-  
+  
   
 
   

I'll commit that, and the fix to the test case.  Thanks Uwe!
  
> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1470) Add TrieRangeQuery to contrib


 [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1470.


   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 723287.

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Michael McCandless



Robert Muir wrote:

i posted on Use Cases on the wiki how I made fuzzy and regex fast if  
you are curious.


It looks like this is the wiki page:

http://wiki.apache.org/lucene-java/FastSSFuzzy?highlight=(fuzzy)

The approach is similar to how contrib/spellchecker generates its  
candidates, in that you build a 2nd index from the primary index and  
use the 2nd index to more quickly (not O(N)) generate candidates.   
It'd be nice to get your approach into contrib as well ;)


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Mark Miller


John Wang wrote:



Seems like being a committer can be rather lucrative.


I think being an Apache committer on any project can be somewhat 
lucrative. Companies know that you probably work well with others if 
your a committer, which can probably lead to improved career 
opportunities. Cant say too much about working well with others :) I may 
not be extracting as much money as I can though - sounds like I could be 
taking bribes to commit code if I wanted to make more ;)


My comment was on the statements of being volunteers and don't get 
paid, which is a little misleading.
It depends. Sometimes, something your doing with a customer might make 
its way into Lucene. Thats not most of the work that goes on here 
though. Most of the work is looking at submitted patches in our free 
time, going over them, running the tests, and possibly committing them. 
I do that for the project because I like to, not for any money I'm 
getting (true enough I havnt been a core committer long, but I did the 
same as a contrib committer). When I'm sitting around at 11 at night or 
7 in the morning, trying to get patches committed, I'd hate to be 
classified as a non volunteer. Its just as easy to get the committer 
title and then fall off the face of the world. No one ensures you are 
helping anyone get anything done.


I guess I need to learn to be a good boy not to piss off the 
committers anymore (or convince my company to pay to get some patches 
in) And hopefully someday I get to grow up and get to become a 
committer and make some $ too.
You might consider it. I think you have been a bit rude, but watch and 
see...quality patches you submit will still get processed like any 
other. The people around here are friendly and mainly interested in the 
quality of Lucene. Noone is trying to enforce some sort of "power elite" 
here. There is no blacklist. At the same time, lashing out isnt going to 
help get any issues passed (in fact, I've seen it flounder more than one 
issue).


I've certainly never been involved in Lucene for the money myself (and I 
don't have much of it, believe you me).


- Mark


-John




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-689) NullPointerException thrown by equals method in SpanOrQuery


 [ 
https://issues.apache.org/jira/browse/LUCENE-689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-689:
--

Fix Version/s: 2.9

> NullPointerException thrown by equals method in SpanOrQuery
> ---
>
> Key: LUCENE-689
> URL: https://issues.apache.org/jira/browse/LUCENE-689
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.1
> Environment: Java 1.5.0_09, RHEL 3 Linux, Tomcat 5.0.28
>Reporter: Michael Goddard
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-689.txt
>
>
> Part of our code utilizes the equals method in SpanOrQuery and, in certain 
> cases (details to follow, if necessary), a NullPointerException gets thrown 
> as a result of the String "field" being null.  After applying the following 
> patch, the problem disappeared:
> Index: src/java/org/apache/lucene/search/spans/SpanOrQuery.java
> ===
> --- src/java/org/apache/lucene/search/spans/SpanOrQuery.java(revision 
> 465065)
> +++ src/java/org/apache/lucene/search/spans/SpanOrQuery.java(working copy)
> @@ -121,7 +121,8 @@
>  final SpanOrQuery that = (SpanOrQuery) o;
>  if (!clauses.equals(that.clauses)) return false;
> -if (!field.equals(that.field)) return false;
> +if (field != null && !field.equals(that.field)) return false;
> +if (field == null && that.field != null) return false;
>  return getBoost() == that.getBoost();
>}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653268#action_12653268
 ] 

Michael McCandless commented on LUCENE-1470:


bq. I think, this cannot work. The Cache is keyed by FieldCacheImpl.Entry 
containing the parser to use.

Sigh, you are correct.  How would you fix FieldCache?

I guess the workaround is to also index the original value (unencoded by 
TrieUtils) as an additional field, for sorting.

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653270#action_12653270
 ] 

Uwe Schindler commented on LUCENE-1470:
---

Thanks, then I would also change TestTrieRangeQuery to also use LuceneTestCase, 
just for completeness.

bq. Sigh, you are correct. How would you fix FieldCache?

I would fix FieldCache by giving in SortField the possibility to supply a 
parser instance. So you create a SortField using a new constructor 
SortField(String field, int type, Object parser, boolean reverse). The parser 
is "object" bcause all parsers have no super-interface. The ideal solution 
would be to have:

SortField(String field, int type, FieldCache.Parser parser, boolean reverse)

and FieldCache.Parser is a super-interface (just empty, more like a 
marker-interface) of all other parsers (like LongParser...)

bq. I guess the workaround is to also index the original value (unencoded by 
TrieUtils) as an additional field, for sorting.

The problem with the extra field would be, that it works good for longs or 
doubles (with some extra work), but Dates still keep as String, or you use 
Date.getTime() as long. But this is not very elegant and needs more fields and 
terms. I prefer a clean solution.

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653272#action_12653272
 ] 

Mark Miller commented on LUCENE-1390:
-

So my final thought on this is performance...is handling more much slower? 
Could that be a reason to keep the Latin1 filter as well?

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1465) NearSpansOrdered.getPayload does not return the payload from the minimum match span


[ 
https://issues.apache.org/jira/browse/LUCENE-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653277#action_12653277
 ] 

Mark Miller commented on LUCENE-1465:
-

Whats involved in a backport - just commit it to the 2.4 branch and thats all?

Looks like I have to look into terms indexed at the same position first - I'll 
try to get to that soon.

- Mark

> NearSpansOrdered.getPayload does not return the payload from the minimum 
> match span
> ---
>
> Key: LUCENE-1465
> URL: https://issues.apache.org/jira/browse/LUCENE-1465
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1465.patch, LUCENE-1465.patch, LUCENE-1465.patch, 
> LUCENE-1465.patch, Test.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-996) Parsing mixed inclusive/exclusive range queries


 [ 
https://issues.apache.org/jira/browse/LUCENE-996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-996:
---

Fix Version/s: (was: 2.9)
   3.0

Because this requires changing a callback or two in the queryparser, its 
probably easier to put it into 3 than 2.9.

> Parsing mixed inclusive/exclusive range queries
> ---
>
> Key: LUCENE-996
> URL: https://issues.apache.org/jira/browse/LUCENE-996
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.2
>Reporter: Andrew Schurman
>Priority: Minor
> Fix For: 3.0
>
> Attachments: LUCENE-996.patch, LUCENE-996.patch, lucene-996.patch, 
> lucene-996.patch
>
>
> The current query parser doesn't handle parsing a range query (i.e. 
> ConstantScoreRangeQuery) with mixed inclusive/exclusive bounds.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1286) LargeDocHighlighter - another span highlighter optimized for large documents


[ 
https://issues.apache.org/jira/browse/LUCENE-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653283#action_12653283
 ] 

Mark Miller commented on LUCENE-1286:
-

Hey Koji, I actually have some ideas to come back to this with, but no time for 
some time to actually work on it.

bq. Can you elaborate this - "rebuild the document by running through the query 
terms by using their offsets"?

Part of the problem with the Highlighter and large docs is that it runs through 
every token in the doc and scores that token, building the original highlighted 
doc as it goes. For a large doc, that can be a bit slow. What Ronnies 
highlighter did was just look at the offsets of the query terms (hence the need 
for term vectors) which allows you to rebuild the original highlighted document 
in big quick chunks (stitching things together between query term offsets).

I was attempting a similar thing here with phrase and span support, but I 
couldn't match the speed of what the current Span highlighter has - this is 
because the current Span Highlighter can highlight non position sensitive terms 
very fast. My method required getting non position sensitive terms from the 
MemoryIndex as well (via getSpans) and the cost ruined any benefit. I came up 
with a few things to try since then but havn't had the time to dedicate to it 
yet. Its hard to get around requiring term vectors (for the offsets), and I'd 
like to avoid that. At the same time, if you don't require term vectors, its 
probably going to be pretty slow re-analyzing the documents anyway...

> LargeDocHighlighter - another span highlighter optimized for large documents
> 
>
> Key: LUCENE-1286
> URL: https://issues.apache.org/jira/browse/LUCENE-1286
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/highlighter
>Affects Versions: 2.4
>Reporter: Mark Miller
>Priority: Minor
>
> The existing Highlighter API is rich and well designed, but the approach 
> taken is not very efficient for large documents.
> I believe that this is because the current Highlighter rebuilds the document 
> by running through and scoring every every token in the tokenstream.
> With a break in the current API, an alternate approach can be taken: rebuild 
> the document by running through the query terms by using their offsets. The 
> benefit is clear - a large doc will have a large tokenstream, but a query 
> will likely be very small in comparison.
> I expect this approach to be quite a bit faster for very large documents, 
> while still supporting Phrase and Span queries.
> First rough patch to follow shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1469) isValid should be invoked after analyze rather than before it so it can validate the output of analyze


[ 
https://issues.apache.org/jira/browse/LUCENE-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653285#action_12653285
 ] 

Mark Miller commented on LUCENE-1469:
-

This makes sense to me. Care to submit a patch?

> isValid should be invoked after analyze rather than before it so it can 
> validate the output of analyze
> --
>
> Key: LUCENE-1469
> URL: https://issues.apache.org/jira/browse/LUCENE-1469
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Vincent Li
>Priority: Minor
>   Original Estimate: 0.08h
>  Remaining Estimate: 0.08h
>
> The Synonym map has a protected method String analyze(String word) designed 
> for custom stemming.
> However, before analyze is invoked on a word, boolean isValid(String str) is 
> used to validate the word - which causes the program to discard words that 
> maybe useable by the custom analyze method. 
> I think that isValid should be invoked after analyze rather than before it so 
> it can validate the output of analyze and allow implemters to decide what is 
> valid for the overridden analyze method. (In fact, if you look at code 
> snippet below, isValid should really go after the empty string check)
> This is a two line change in org.apache.lucene.index.memory.SynonymMap
>   /*
>* Part B: ignore phrases (with spaces and hyphens) and
>* non-alphabetic words, and let user customize word (e.g. do some
>* stemming)
>*/
>   if (!isValid(word)) continue; // ignore
>   word = analyze(word);
>   if (word == null || word.length() == 0) continue; // ignore

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1465) NearSpansOrdered.getPayload does not return the payload from the minimum match span


[ 
https://issues.apache.org/jira/browse/LUCENE-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653292#action_12653292
 ] 

Michael McCandless commented on LUCENE-1465:


bq. Whats involved in a backport - just commit it to the 2.4 branch and thats 
all? 

Yup.  "svn merge" works well as long as the code hasn't diverged much, eg 
running this in a 2.4 branch checkout:

{code}
svn merge -r(N-1):N https://svn.apache.org/repos/asf/lucene/java/trunk
{code}

where N was the revision committed to trunk.

> NearSpansOrdered.getPayload does not return the payload from the minimum 
> match span
> ---
>
> Key: LUCENE-1465
> URL: https://issues.apache.org/jira/browse/LUCENE-1465
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1465.patch, LUCENE-1465.patch, LUCENE-1465.patch, 
> LUCENE-1465.patch, Test.java
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653295#action_12653295
 ] 

Michael McCandless commented on LUCENE-1470:


bq. Thanks, then I would also change TestTrieRangeQuery to also use 
LuceneTestCase, just for completeness. 

OK done.

bq. would fix FieldCache by giving in SortField the possibility to supply a 
parser instance. So you create a SortField using a new constructor 
SortField(String field, int type, Object parser, boolean reverse). The parser 
is "object" bcause all parsers have no super-interface.

This seems OK for now?  Can you open an issue?  Retro-fitting a super-interface 
would break back-compat for (admittedly very advanced) existing Parser 
instances external to Lucene, right?

bq. but Dates still keep as String, or you use Date.getTime() as long

Yeah.  But if we open the new issue (to allow external FieldCache parsers to be 
used when sorting) then one could parse to long directly from a TrieUtil 
encoded Date field, right?

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Grant Ingersoll

On Dec 4, 2008, at 12:36 AM, John Wang wrote:

Grant:

I am sorry that I disagree with some points:

1) "I think it's a sign that Lucene is pretty stable." - While
lucene is a great project, especially with 2.x releases, great
improvements are made, but do we really have a clear picture on how
lucene is being used and deployed. While lucene works great running
as a vanilla search library, when pushed to limits, one needs to
"hack" into lucene to make certain things work. If 90% of the user
base use it to build small indexes and using the vanilla api, and
the other 10% is really stressing both on the scalability and api
side and are running into issues, would you still say: "running well
for 90% of the users, therefore it is stable or extensible"? I think
it is unfair to the project itself to be measured by the vanilla use-
case. I have done couple of large deployments, e.g. >30 million
documents indexed and searched in realtime., and I really had to do
some tweaking.

Sorry, we should have written a perfect engine the first time out.
I'll get on that. Question for you: how much of that tweaking have
you contributed back? If you have such obvious wins, put them up as
patches so we can all benefit, just like you've benefitted from our
volunteering.

As for 90%, I'd say it is more like > 95% and, gee, if I can write a
general purpose open source search library that keeps 95% of a very,
very, very large install base happy all while still improving it and
maintaining backward compatibility, than color me stable.

2) "You want stuff committed, keep it up to date, make it manageable
to review, document it, respond to questions/concerns with answers
as best you can. " - To some degree I would hope it depends on what
the issue is, e.g. enforcing such process on a one-line null check
seems to be an overkill. I agree with the process itself, what would
make it better is some transparency on how patches/issues are
evaluated to be committed. At least seemed from the outside, it is
purely being decided on by the committers, and since my
understanding is that an open source project belongs to the public,
the public user base should have some say.

Here's your list of opened issues: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&reporterSelect=specificuser&[EMAIL PROTECTED]
Only 1 of which has more than 2 votes and which is assigned to
Hoss. However, from what I can see, you've had all but 1, I repeat
ONE, issue not resolved.

And, yes, what gets committed is decided on by the COMMITTERS with
input from the community; who else can be responsible for committing?
Hence the title. We can't please everyone, but I'll be damned if
you're going to disparage the work of so many because you have sour
grapes over some people (not all) disagreeing with you over how
serialization should work in Lucene just b/c you think the problem is
trivial when clearly others do not.

Committers are picked by the project over a long period of time (feel
free to nominate someone who you feel has merit, we've elected
committers based on community nominations in the past) because they
stick around and stay involved and respond on the list, etc. I'm
starting to think your real issue here is that we haven't all agreed
with you the minute you suggest something, but sorry, that is how open
source works.

3) which brings me to this point: "I personally, would love to work
on Lucene all day every day as I have a lot of things I'd love to
engage the community on, but the fact is I'm not paid to do that, so
I give what I can when I can. I know most of the other committers
are that way too." - Is this really true? Isn't a large part of the
committer base also a part of the for-profit, consulting business,
e.g. Lucid? Would groups/companies that pay for consulting service
get their patches/requirements committed with higher priority? If
so, seems to me to be a conflict of interest there.

Yes, John, it is true. I would love to work on Lucene all day. If I
won the lottery tomorrow, I'd probably still volunteer on Lucene. Let
me ask you back, who pays you to work on Lucene? Was this patch
submitted because you just happened to spot it while pouring over the
code at night on your own and out of the goodness of your heart? Or
did you discover it at LinkedIn where you were specifically hired
because of your Lucene skills and knowledge of the Lucene community?
In other words, you're accusing me and others of getting paid for my
expertise in Lucene, all the while you are getting paid for your
expertise in Lucene.

4) "Lather, rinse, repeat. Next thing you know, you'll be on the
receiving end as a committer." - While I agree that being a
committer is a great honor and many committers are awesome, but
assuming everyone would want to be a committer is a little
presumptuous.

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653297#action_12653297
 ] 

Michael McCandless commented on LUCENE-1473:


bq. It seems best to remove Serialization from Lucene so that users are not 
confused and create a better solution.

I don't think that's the case.  If we choose to only support "live 
serialization" then we should add "implements Serializable" but spell out 
clearly in the javadocs that there is no guarantee of cross-version 
compatibility ("long term persistence") and in fact that often there are 
incompatibilities.

I think "live serialization" is still a useful feature.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Build failed in Hudson: Lucene-trunk #665

2008-12-04 Thread Uwe Schindler

Why does compiling of my testcase for TrieRangeQuery fails on Hudson, but
works here?

-
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: [EMAIL PROTECTED]

> -Original Message-
> From: Apache Hudson Server [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2008 3:11 AM
> To: java-dev@lucene.apache.org
> Subject: Build failed in Hudson: Lucene-trunk #665
> 
> See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/665/changes
> 
> Changes:
> 
> [mikemccand] LUCENE-1457: fix possible overflow bugs during binary search
> 
> [mikemccand] LUCENE-1470: add TrieRangeQuery, a much more efficient
> implementation of RangeQuery at the expense of added space consumed in the
> index
> 
> [markrmiller] LUCENE-1246: check for null sub queries so that
> BooleanQuery.toString does not throw NullPointerException.
> 
> --
> [...truncated 3201 lines...]
> clover.setup:
> 
> clover.info:
> 
> clover:
> 
> compile-core:
> 
> common.compile-test:
> [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/misc/classes/test
> [javac] Compiling 7 source files to
> http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/misc/classes/test
> [javac] Note: Some input files use or override a deprecated API.
> [javac] Note: Recompile with -Xlint:deprecation for details.
> 
> build-artifacts-and-tests:
>  [echo] Building queries...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> init:
> 
> clover.setup:
> 
> clover.info:
> 
> clover:
> 
> compile-core:
> [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/queries/classes/java
> [javac] Compiling 12 source files to
> http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/queries/classes/java
> [javac] Note: Some input files use or override a deprecated API.
> [javac] Note: Recompile with -Xlint:deprecation for details.
> 
> jar-core:
>   [jar] Building jar:
> http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/queries/lucene-queries-2.4-SNAPSHOT.jar
> 
> jar:
> 
> compile-test:
>  [echo] Building queries...
> 
> javacc-uptodate-check:
> 
> javacc-notice:
> 
> jflex-uptodate-check:
> 
> jflex-notice:
> 
> common.init:
> 
> build-lucene:
> 
> init:
> 
> clover.setup:
> 
> clover.info:
> 
> clover:
> 
> compile-core:
> 
> common.compile-test:
> [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/queries/classes/test
> [javac] Compiling 6 source files to
> http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/build/contrib/queries/classes/test
> [javac] http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test
> TrieUtils.java :23: cannot find symbol
> [javac] symbol  : class LuceneTestCase
> [javac] location: package org.apache.lucene.util
> [javac] import org.apache.lucene.util.LuceneTestCase;
> [javac]   ^
> [javac] http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test
> TrieUtils.java :25: cannot find symbol
> [javac] symbol: class LuceneTestCase
> [javac] public class TestTrieUtils extends LuceneTestCase {
> [javac]^
> [javac] http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test
> TrieUtils.java :29: cannot find symbol
> [javac] symbol  : method
> assertEquals(java.lang.String,java.lang.String)
> [javac] location: class org.apache.lucene.search.trie.TestTrieUtils
> [javac]   assertEquals(
> TrieUtils.VARIANT_8BIT.TRIE_CODED_NUMERIC_MIN,
> "\u0100\u0100\u0100\u0100\u0100\u0100\u0100\u0100");
> [javac] ^
> [javac] http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test
> TrieUtils.java :30: cannot find symbol
> [javac] symbol  : method
> assertEquals(java.lang.String,java.lang.String)
> [javac] location: class org.apache.lucene.search.trie.TestTrieUtils
> [javac]   assertEquals(
> TrieUtils.VARIANT_8BIT.TRIE_CODED_NUMERIC_MAX,
> "\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff\u01ff");
> [javac] ^
> [javac] http://hudson.zones.apache.org/hudson/job/Lucene-
> trunk/ws/trunk/contrib/queries/src/test/org/apache/lucene/search/trie/Test
> TrieUtils.java :31: cannot find symbol
> [j

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653298#action_12653298
 ] 

Uwe Schindler commented on LUCENE-1470:
---

Yes, I will open an issue! Maybe I maybe create a first patch after looking 
into the problem.

bq. This seems OK for now? Can you open an issue? Retro-fitting a 
super-interface would break back-compat for (admittedly very advanced) existing 
Parser instances external to Lucene, right?

I am not sure, but I think its better to leave it as now. On the other hand, if 
we just have a "marker" super-interface, it should be backwards compatible, 
because the new super-interface is new and existing code would only use the 
existing interfaces. New methods are not added by the super interface, so code 
would be source and binary compatible (as it only references the existing 
interfaces). I think we had this discussion some time in the past in another 
issue (Fieldable???), but this was another problem.

bq. Yeah. But if we open the new issue (to allow external FieldCache parsers to 
be used when sorting) then one could parse to long directly from a TrieUtil 
encoded Date field, right?

Correct. As soon as this works, I would simply add as "extra bonus" 
o.a.l.search.trie.TrieSortField, that automatically supplys a correct parser 
for easy usage. Date, Double and Long trie fields can always be sorted as longs 
without knowing the correct meaning (because the trie format was designed like 
so).

Currently my code would just sort the trie encoded fields using 
SortField.STRING, but this resource expensive (butI have no example currently 
running, as it was not needed for panFMP/PANGAEA and other projects).

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653321#action_12653321
 ] 

Michael McCandless commented on LUCENE-1473:


bq. For classes that no one submits an Externalizable patch for, the 
serialVersionUID needs to be added.

The serialVersionUID approach would be too simplistic, because we can't simply 
bump it up whenever we make a change since that then breaks back compatibility. 
 We would have to override write/readObject or write/readExternal, and 
serialVersionUID would not be used.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread John Wang

Mark and Grant:

I do apologize if I came off seeming rude. I guess I let my frustration
of the serialization issue got the better of me (and also a built up from
some of the other issues, which I thought are trivial but was made to be
not). And I will improve my behavior in the future.

   There is a reason I have stopped submitting patches via Jira. (For which
I no longer dare to express.)

   There is absolutely nothing wrong with getting paid for Lucene expertise.
I was just commenting on your comment about "volunteering", but if you think
I am wrong, then I am. I did have a concern with the focus of the project
getting biased by paying companies to the committers, but obviously it is
not my business.

The issues/patches I am having are trivial stuffs, and that was
precisely my point. I am not pushing for  grandeous ideas, I am frustrated
with some very brain dead issues (I am not smart enough to provide any earth
shattering patches) that has blown out of proportion in my mind.

I will try to keep my mouth shut in the future.

-John

On Thu, Dec 4, 2008 at 5:24 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

>
> On Dec 4, 2008, at 12:36 AM, John Wang wrote:
>
>  Grant:
>>
>>I am sorry that I disagree with some points:
>>
>> 1) "I think it's a sign that Lucene is pretty stable." - While lucene is a
>> great project, especially with 2.x releases, great improvements are made,
>> but do we really have a clear picture on how lucene is being used and
>> deployed. While lucene works great running as a vanilla search library, when
>> pushed to limits, one needs to "hack" into lucene to make certain things
>> work. If 90% of the user base use it to build small indexes and using the
>> vanilla api, and the other 10% is really stressing both on the scalability
>> and api side and are running into issues, would you still say: "running well
>> for 90% of the users, therefore it is stable or extensible"? I think it is
>> unfair to the project itself to be measured by the vanilla use-case. I have
>> done couple of large deployments, e.g. >30 million documents indexed and
>> searched in realtime., and I really had to do some tweaking.
>>
>
> Sorry, we should have written a perfect engine the first time out.  I'll
> get on that.  Question for you:  how much of that tweaking have you
> contributed back?  If you have such obvious wins, put them up as patches so
> we can all benefit, just like you've benefitted from our volunteering.
>
> As for 90%, I'd say it is more like > 95% and, gee, if I can write a
> general purpose open source search library that keeps 95% of a very, very,
> very large install base happy all while still improving it and maintaining
> backward compatibility, than color me stable.
>
>
>> 2) "You want stuff committed, keep it up to date, make it manageable to
>> review, document it, respond to questions/concerns with answers as best you
>> can. " - To some degree I would hope it depends on what the issue is, e.g.
>> enforcing such process on a one-line null check seems to be an overkill. I
>> agree with the process itself, what would make it better is some
>> transparency on how patches/issues are evaluated to be committed. At least
>> seemed from the outside, it is purely being decided on by the committers,
>> and since my understanding is that an open source project belongs to the
>> public, the public user base should have some say.
>>
>
> Here's your list of opened issues:
> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&reporterSelect=specificuser&[EMAIL
>  PROTECTED]  Only 1 of which has more than 2 votes and which is assigned to 
> Hoss.
>  However, from what I can see, you've had all but 1, I repeat ONE, issue not
> resolved.
>
> And, yes, what gets committed is decided on by the COMMITTERS with input
> from the community; who else can be responsible for committing?  Hence the
> title.  We can't please everyone, but I'll be damned if you're going to
> disparage the work of so many because you have sour grapes over some people
> (not all) disagreeing with you over how serialization should work in Lucene
> just b/c you think the problem is trivial when clearly others do not.
>
> Committers are picked by the project over a long period of time (feel free
> to nominate someone who you feel has merit, we've elected committers based
> on community nominations in the past) because they stick around and stay
> involved and respond on the list, etc.  I'm starting to think your real
> issue here is that we haven't all agreed with you the minute you suggest
> something, but sorry, that is how open source works.
>
>
>
>> 3) which brings me to this point: "I personally, would love to work on
>> Lucene all day every day as I have a lot of things I'd love to engage the
>> community on, but the fact is I'm not paid to do that, so I give what I can
>> when I can.  I know most of the other committers are that way too." - Is
>> this really true? Isn't a large part of the

[jira] Created: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

Missing possibility to supply custom FieldParser when sorting search results


 Key: LUCENE-1478
 URL: https://issues.apache.org/jira/browse/LUCENE-1478
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.4
Reporter: Uwe Schindler


When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
confronted by the problem that the special trie-encoded values (which are longs 
in a special encoding) cannot be sorted by Searcher.search() and SortField. The 
problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie 
encoded values may be sorted using SortField.String (as the encoding is in such 
a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when 
retrieving the cached values. But you cannot use this during searching, because 
there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given 
use the default). My idea is to create a SortField using a new constructor
{code}SortField(String field, int type, Object parser, boolean reverse){code}

The parser is "object" bcause all parsers have no super-interface. The ideal 
solution would be to have:

{code}SortField(String field, int type, FieldCache.Parser parser, boolean 
reverse){code}

and FieldCache.Parser is a super-interface (just empty, more like a 
marker-interface) of all other parsers (like LongParser...). The sort 
implementation then must be changed to respect the given parser (if not NULL), 
else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results


 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1478:
--

Description: 
When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
confronted by the problem that the special trie-encoded values (which are longs 
in a special encoding) cannot be sorted by Searcher.search() and SortField. The 
problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie 
encoded values may be sorted using SortField.String (as the encoding is in such 
a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when 
retrieving the cached values. But you cannot use this during searching, because 
there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given 
use the default). My idea is to create a SortField using a new constructor
{code}SortField(String field, int type, Object parser, boolean reverse){code}

The parser is "object" because all current parsers have no super-interface. The 
ideal solution would be to have:

{code}SortField(String field, int type, FieldCache.Parser parser, boolean 
reverse){code}

and FieldCache.Parser is a super-interface (just empty, more like a 
marker-interface) of all other parsers (like LongParser...). The sort 
implementation then must be changed to respect the given parser (if not NULL), 
else use the default FieldCache.get without parser.

  was:
When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
confronted by the problem that the special trie-encoded values (which are longs 
in a special encoding) cannot be sorted by Searcher.search() and SortField. The 
problem is: If you use SortField.LONG, you get NumberFormatExceptions. The trie 
encoded values may be sorted using SortField.String (as the encoding is in such 
a way, that they are sortable as Strings), but this is very memory ineffective.

ExtendedFieldCache gives the possibility to specify a custom LongParser when 
retrieving the cached values. But you cannot use this during searching, because 
there is no possibility to supply this custom LongParser to the SortField.

I propose a change in the sort classes:
Include a pointer to the parser instance to be used in SortField (if not given 
use the default). My idea is to create a SortField using a new constructor
{code}SortField(String field, int type, Object parser, boolean reverse){code}

The parser is "object" bcause all parsers have no super-interface. The ideal 
solution would be to have:

{code}SortField(String field, int type, FieldCache.Parser parser, boolean 
reverse){code}

and FieldCache.Parser is a super-interface (just empty, more like a 
marker-interface) of all other parsers (like LongParser...). The sort 
implementation then must be changed to respect the given parser (if not NULL), 
else use the default FieldCache.get without parser.


> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then mus

[jira] Commented: (LUCENE-1461) Cached filter for a single term field

2008-12-04 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653361#action_12653361
 ] 

Otis Gospodnetic commented on LUCENE-1461:
--

Is this related to LUCENE-855?  The same?  Aha, I see Paul asked the reverse 
question in LUCENE-855 already... Tim?


> Cached filter for a single term field
> -
>
> Key: LUCENE-1461
> URL: https://issues.apache.org/jira/browse/LUCENE-1461
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Tim Sturge
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch, 
> LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch, 
> LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java, 
> TermMultiFilter.java, TestFieldCacheRangeFilter.patch
>
>
> These classes implement inexpensive range filtering over a field containing a 
> single term. They do this by building an integer array of term numbers 
> (storing the term->number mapping in a TreeMap) and then implementing a fast 
> integer comparison based DocSetIdIterator.
> This code is currently being used to do age range filtering, but could also 
> be used to do other date filtering or in any application where there need to 
> be multiple filters based on the same single term field. I have an untested 
> implementation of single term filtering and have considered but not yet 
> implemented term set filtering (useful for location based searches) as well. 
> The code here is fairly rough; it works but lacks javadocs and toString() and 
> hashCode() methods etc. I'm posting it here to discover if there is other 
> interest in this feature; I don't mind fixing it up but would hate to go to 
> the effort if it's not going to make it into Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653378#action_12653378
 ] 

John Wang commented on LUCENE-1473:
---

Mike:

   If you have class A implements Serializable, with a defined suid, say 1.

   Let A2 be a newer version of class A, and suid is not changed, say 1.

Let's say A2 has a new field.

   Imaging A is running in VM1 and A2 is running in VM2. Serialization 
between VM1 and VM2 of class A is ok, just that A will not get the new fields. 
Which is fine since VM1 does not make use of it. 

   You can argue that A2 will not get the needed field from serialized A, 
but isn't that better than crashing?

Either the case, I think the behavior is better than it is currently. 
(maybe that's why Eclipse and Findbug both report the lacking of suid 
definition in lucene code a warning)

   I agree adding Externalizable implementation is more work, but it would 
make the serialization story correct.

-John


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Doug Cutting


John Wang wrote:
I agree with the process itself, what would make it better is 
some transparency on how patches/issues are evaluated to be committed. 


To be clear: there is no forum for communication about patches except 
this list, and, by extension, Jira.  The process of patch evaluation is 
completely transparent.


At least seemed from the outside, it is purely being decided on by the 
committers, and since my understanding is that an open source project 
belongs to the public, the public user base should have some say.


It is not a democracy, it is a meritocracy.

http://www.apache.org/foundation/how-it-works.html#meritocracy

I'll repeat: committers are added when they've both contributed a series 
of high-quality, easy-to-commit patches, and when they've demonstrated 
that they are easy to work with.  That process has resulted in the 
current set of committers, and those committers determine which patches 
are committed and when.  Those are the rules.


However committers cannot ram just any patch through.  Committers are 
only added after they've demonstrated the ability to build consensus 
around their patches.  And they must continue to build consensus around 
their patches even after they are committers.  Patches that receive no 
endorsement from others are not committed, no matter who contributes 
them.  A contribution is not more rapidly committed simply because the 
contributor is a committer.  Rather, committers knows how to elicit and 
respond to criticism and build consensus around a patch in order to get 
them committed rapidly.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1461) Cached filter for a single term field

2008-12-04 Thread Tim Sturge (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653408#action_12653408
 ] 

Tim Sturge commented on LUCENE-1461:


That's amazing. LUCENE-855 (the FieldCacheRangeFilter part) is pretty much 
identical in purpose and design, down to the name. The major implementation 
differences are that it overloaded BitSet which was necessary prior to the 
addition of DocIdSetIterator. Thus my implementation looks significantly 
cleaner even though it is basically functionally identical.

I think this shows that any decent idea will be repeatedly reinvented until it 
is widely enough known. I personally would have saved some time both in 
conceptualization and implementation had I been aware of this. 

I would very much like to credit Matt in CHANGES.txt for this as well; it seems 
like an accident of fate that I'm not using his implementation today.


> Cached filter for a single term field
> -
>
> Key: LUCENE-1461
> URL: https://issues.apache.org/jira/browse/LUCENE-1461
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Tim Sturge
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: DisjointMultiFilter.java, FieldCacheRangeFilter.patch, 
> LUCENE-1461.patch, LUCENE-1461a.patch, LUCENE-1461b.patch, 
> LUCENE-1461c.patch, RangeMultiFilter.java, RangeMultiFilter.java, 
> TermMultiFilter.java, TestFieldCacheRangeFilter.patch
>
>
> These classes implement inexpensive range filtering over a field containing a 
> single term. They do this by building an integer array of term numbers 
> (storing the term->number mapping in a TreeMap) and then implementing a fast 
> integer comparison based DocSetIdIterator.
> This code is currently being used to do age range filtering, but could also 
> be used to do other date filtering or in any application where there need to 
> be multiple filters based on the same single term field. I have an untested 
> implementation of single term filtering and have considered but not yet 
> implemented term set filtering (useful for location based searches) as well. 
> The code here is fairly rough; it works but lacks javadocs and toString() and 
> hashCode() methods etc. I'm posting it here to discover if there is other 
> interest in this feature; I don't mind fixing it up but would hate to go to 
> the effort if it's not going to make it into Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653413#action_12653413
 ] 

Doug Cutting commented on LUCENE-1473:
--

> Serialization between VM1 and VM2 of class A is ok, just that A will not get 
> the new fields. Which is fine since VM1 does not make use of it.

But VM1 might require an older field that the new field replaced, and VM1 may 
then crash in an unpredictable way.  Not defining explicit suid's is more 
conservative: you get a well-defined exception when things might not work.  
Defining suid's but doing nothing else about compatibility is playing 
fast-and-loose: it might work in many cases, but it also might cause strange, 
hard-to-diagnose problems in others.  If we want Lucene to work reliably across 
versions, then we need to commit to that goal as a project, define the limits 
of the compatibility, implement Externalizeable, add tests, etc.  Just adding 
suid's doesn't achieve that, so far as I can see.


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Tim Sturge (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653414#action_12653414
 ] 

Tim Sturge commented on LUCENE-855:
---

Matt, Andy,

Please take a look at LUCENE-1461. As far as I can tell it is identical in 
purpose and design to this patch.

Matt,

I would like to add you to the CHANGES.txt credits for LUCENE-1461. Are you OK 
with that?



> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread robert engels (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653421#action_12653421
 ] 

robert engels commented on LUCENE-1473:
---

Even if you changed SUIDs based on version changes, there is the very real 
possibility that the new code CAN'T be instantiated in any meaningful way from 
the old data. Then what would you do?

Even if you had all of the old classes, and their dependencies available from 
dynamic classloading, it still won't work UNLESS every new feature is designed 
with backwards compatibility with previous versions  - a burden that is just 
too great when required of all Lucene code.

Given that, as has been discussed, there are other formats that can be used 
where isolated backwards persistence is desired (like XML based query 
descriptions).  Even these won't work if the XML description references 
explicit classes - which is why designing such a format for a near limitless 
query structure (given user defined query classes) is probably impossible.

So strive for a decent solution that covers most cases, and fails gracefully 
when it can't work.

using standard serialization (with proper transient fields) seems to fit this 
bill, since in a stable API, most core classes should remain fairly constant, 
and those that are bound to change may take explicit steps in their 
serialization (if deemed needed)


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Jason Rutherglen

To put things in perspective, I believe Microsoft (who could potentially
place a lot of resources towards Lucene) now uses Lucene through Powerset?
and I don't think those folks are contributing back.  I know of several
other companies who do the same, and many potential contributions that are
not submitted because people and their companies do not see the benefit of
going through the hoops required to get patches committed.  A relatively
simple patch such as 1473 Serialization represents this well.

For example if a company is developing custom search algorithms, Lucene
supports TF/IDF but not much else.  Custom search algorithms require
rewriting lots of Lucene code.  Companies who write new search algorithms do
not necessarily want to rewrite Lucene as well to make it pluggable for new
scoring as it is out of scope, they will simply branch the code.  It does
not help that the core APIs underneath IndexReader are protected and package
protected which assumes a user that is not advanced.  It is repeated in the
mailing lists that new features will threaten the existing user base which
is based on opinion rather than fact.  More advanced users are currently
hindered by the conservatism of the project and so naturally have stopped
trying to submit changes that alter the core non-public code.

The rancor is from users would benefit from a faster pace and the ability to
be more creative inside the core Lucene system.  As the internals change
frequently and unnannounced the process of developing core patches is
difficult and frustrating.

Now that Lucene is stable and flexible indexing is being implemented.  It
would benefit the community to focus on the future.  Who exactly is
responsible for this?  Which of the committers are building for the future?
Which are doing bug fixes?  What is the process of developing more advanced
features in open source?  Right now it seems to be one person, Michael
McCandless developing all of the new core code.  This is great forward
progress, however it's unclear how others can get involved and not get
stampeded by the constant changes that all happen via one brilliant person.

I have requested of people such as Michael Busch to collaborate on the
column stride fields and received no response.

To me, an good example of volunteers are people who prepare food and donate
their time at soup kitchens with no pay, and no hope for pay related to
feeding the hungry.

-J

On Wed, Dec 3, 2008 at 2:52 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

>
> On Dec 3, 2008, at 2:27 PM, Jason Rutherglen (JIRA) wrote:
>
>
>>
>> Hoss wrote: "sort of mythical "Lucene powerhouse"
>> Lucene seems to run itself quite differently than other open source Java
>> projects.  Perhaps it would be good to spell out the reasons for the
>> reluctance to move ahead with features that developers work on, that work,
>> but do not go in.  The developer contributions seem to be quite low right
>> now, especially compared to neighbor projects such as Hadoop.  Is this
>> because fewer people are using Lucene?  Or is it due to the reluctance to
>> work with the developer community?  Unfortunately the perception in the eyes
>> of some people who work on search related projects it is the latter.
>>
>
>
> Or, could it be that Hadoop is relatively new and in vogue at the moment,
> very malleable and buggy(?) and has a HUGE corporate sponsor who dedicates
> lots of resources to it on a full time basis, whilst Lucene has been around
> in the ASF for 7+ years (and 12+ years total) and has a really large install
> base and thus must move more deliberately and basically has 1 person who
> gets to work on it full time while the rest of us pretty much volunteer?
>  That's not an excuse, it's just the way it is.  I personally, would love to
> work on Lucene all day every day as I have a lot of things I'd love to
> engage the community on, but the fact is I'm not paid to do that, so I give
> what I can when I can.  I know most of the other committers are that way
> too.
>
> Thus, I don't think any one of us has a reluctance to move ahead with
> features or bug fixes.   Looking at CHANGES.txt, I see a lot of
> contributors.  Looking at java-dev and JIRA, I see lots of engagement with
> the community.  Is it near the historical high for traffic, no it's not, but
> that isn't necessarily a bad thing.  I think it's a sign that Lucene is
> pretty stable.
>
> What we do have a reluctance for are patches that don't have tests (i.e.
> this one), patches that massively change Lucene APIs in non-trivial ways or
> break back compatibility or are not kept up to date.  Are we perfect?  Of
> course not.  I, personally, would love for there to be a way that helps us
> process a larger volume of patches (note, I didn't say commit a larger
> volume).  Hadoop's automated patch tester would be a huge start in that, but
> at the end of the day, Lucene still works the way all ASF projects do: via
> meritocracy and volunteerism. You want stuff com

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Doug Cutting


Jason Rutherglen wrote:
A relatively simple patch such as 1473 Serialization 
represents this well. 


LUCENE-1473 is an incomplete patch that proposes to commit the project 
to new back-compatibility requirements.  Compatibility requirements 
should not be added lightly, but only deliberately, as they have a 
long-term impact on the ability of the project to evolve.  Prior to this 
we've not heard from folks who require cross-version java serialization 
compatibility.  Without more folks asserting this as a need it is hard 
to rationalize adding this.


As the 
internals change frequently and unnannounced the process of developing 
core patches is difficult and frustrating.


The process is entirely in public.  You have as much announcement as 
anyone.  Patches are weighed on there merits as they are contributed.


It would benefit the community to focus on the future.  Who exactly is 
responsible for this?  Which of the committers are building for the 
future?  Which are doing bug fixes?  What is the process of developing 
more advanced features in open source?


I've already explained the process several times.

We cannot easily make a long-term plan when we do not have the power to 
assign folks.  We can state long-term goals, like flexible indexing, but 
in the end, it won't get done until someone volunteers to write the 
code.  So you're welcome to start a wish list on the wiki, and you're 
welcome to then start contributing patches that implement items on your 
wish list.  If you propose something that folks think is extremely 
useful, but requires an incompatible change, then it could perhaps be 
done in a branch.  But most of the existing community is interested in 
pushing forward incrementally, trying hard to keep most things 
back-compatible.  If that's too frustrating for you, you can fork Lucene 
and build a new community.


Right now it seems to be one 
person, Michael McCandless developing all of the new core code.


Mike does a lot of development, but he also commits a lot of patches 
written by others.


This is 
great forward progress, however it's unclear how others can get involved 
and not get stampeded by the constant changes that all happen via one 
brilliant person. 


You want Mike to do less?  Others can and do get involved all the time. 
 Look at http://tinyurl.com/5nl78n.  The majority of the things Mike 
works on are instigated by others.


I have requested of people such as Michael Busch to collaborate on the 
column stride fields and received no response. 


Did you pay Michael?  No one here is compelled to work with anyone else. 
  We work with others when we feel it is in our mutual self interest.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Andy Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450
 ] 

Andy Liu commented on LUCENE-855:
-

Yes, it looks the same.  Glad this will finally make it to the source!

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

jira attachments ?

I am having a problem posting an attachment to Jira. Just spins, and  
spins...


Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries


I can't seem to post to Jira, so I am attaching here...I attached QueryFilter.java.In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct?The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level.The MyMultiReader is a subclass that allows access to the underlying SegmentReaders.The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values.Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ?is there any interest in this?

QueryFilter.java
Description: Binary data
On Dec 4, 2008, at 2:10 PM, Andy Liu (JIRA) wrote:    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855:-Yes, it looks the same.  Glad this will finally make it to the source! MemoryCachedRangeFilter to boost performance of Range queries-                Key: LUCENE-855                URL: https://issues.apache.org/jira/browse/LUCENE-855            Project: Lucene - Java         Issue Type: Improvement         Components: Search   Affects Versions: 2.1           Reporter: Andy Liu        Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.javaCurrently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.MemoryCachedRangeFilter reads all  pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.So in summery, MemoryCachedRangeFilter can be useful when:- Performance is critical- Memory is not an issue- Field contains many unique numeric values- Index contains large amount of documents -- This message is automatically generated by JIRA.-You can reply to this email to add a comment to the issue online.-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Jason Rutherglen

Correction: Powerset apparently did not use Lucene.  And apparently there
are a few other companies who are not open sourcing, use Lucene
serialization regularly.

> Did you pay Michael?  No one here is compelled to work with anyone else.
 We work with others when we feel it is in our mutual self interest.

Nice... I guess our government is the macrocosm.

On Thu, Dec 4, 2008 at 11:21 AM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote:

> To put things in perspective, I believe Microsoft (who could potentially
> place a lot of resources towards Lucene) now uses Lucene through Powerset?
> and I don't think those folks are contributing back.  I know of several
> other companies who do the same, and many potential contributions that are
> not submitted because people and their companies do not see the benefit of
> going through the hoops required to get patches committed.  A relatively
> simple patch such as 1473 Serialization represents this well.
>
> For example if a company is developing custom search algorithms, Lucene
> supports TF/IDF but not much else.  Custom search algorithms require
> rewriting lots of Lucene code.  Companies who write new search algorithms do
> not necessarily want to rewrite Lucene as well to make it pluggable for new
> scoring as it is out of scope, they will simply branch the code.  It does
> not help that the core APIs underneath IndexReader are protected and package
> protected which assumes a user that is not advanced.  It is repeated in the
> mailing lists that new features will threaten the existing user base which
> is based on opinion rather than fact.  More advanced users are currently
> hindered by the conservatism of the project and so naturally have stopped
> trying to submit changes that alter the core non-public code.
>
> The rancor is from users would benefit from a faster pace and the ability
> to be more creative inside the core Lucene system.  As the internals change
> frequently and unnannounced the process of developing core patches is
> difficult and frustrating.
>
> Now that Lucene is stable and flexible indexing is being implemented.  It
> would benefit the community to focus on the future.  Who exactly is
> responsible for this?  Which of the committers are building for the future?
> Which are doing bug fixes?  What is the process of developing more advanced
> features in open source?  Right now it seems to be one person, Michael
> McCandless developing all of the new core code.  This is great forward
> progress, however it's unclear how others can get involved and not get
> stampeded by the constant changes that all happen via one brilliant person.
>
>
> I have requested of people such as Michael Busch to collaborate on the
> column stride fields and received no response.
>
> To me, an good example of volunteers are people who prepare food and donate
> their time at soup kitchens with no pay, and no hope for pay related to
> feeding the hungry.
>
> -J
>
>
> On Wed, Dec 3, 2008 at 2:52 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
>
>>
>> On Dec 3, 2008, at 2:27 PM, Jason Rutherglen (JIRA) wrote:
>>
>>
>>>
>>> Hoss wrote: "sort of mythical "Lucene powerhouse"
>>> Lucene seems to run itself quite differently than other open source Java
>>> projects.  Perhaps it would be good to spell out the reasons for the
>>> reluctance to move ahead with features that developers work on, that work,
>>> but do not go in.  The developer contributions seem to be quite low right
>>> now, especially compared to neighbor projects such as Hadoop.  Is this
>>> because fewer people are using Lucene?  Or is it due to the reluctance to
>>> work with the developer community?  Unfortunately the perception in the eyes
>>> of some people who work on search related projects it is the latter.
>>>
>>
>>
>> Or, could it be that Hadoop is relatively new and in vogue at the moment,
>> very malleable and buggy(?) and has a HUGE corporate sponsor who dedicates
>> lots of resources to it on a full time basis, whilst Lucene has been around
>> in the ASF for 7+ years (and 12+ years total) and has a really large install
>> base and thus must move more deliberately and basically has 1 person who
>> gets to work on it full time while the rest of us pretty much volunteer?
>>  That's not an excuse, it's just the way it is.  I personally, would love to
>> work on Lucene all day every day as I have a lot of things I'd love to
>> engage the community on, but the fact is I'm not paid to do that, so I give
>> what I can when I can.  I know most of the other committers are that way
>> too.
>>
>> Thus, I don't think any one of us has a reluctance to move ahead with
>> features or bug fixes.   Looking at CHANGES.txt, I see a lot of
>> contributors.  Looking at java-dev and JIRA, I see lots of engagement with
>> the community.  Is it near the historical high for traffic, no it's not, but
>> that isn't necessarily a bad thing.  I think it's a sign that Lucene is
>> pretty stable.
>>
>> What we do have a reluctance

[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results


 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1478:
--

Attachment: LUCENE-1478-no-superinterface.patch

Attached is a patch that implements the first variant (without super interface 
for all FieldParsers). All current tests pass. A special test case for a custum 
field parser was not implemented.

For testing, I modified one of my contrib TrieRangeQuery test cases locally to 
sort using a custom LongParser that decoded the encoded longs in the cache 
[parseLong(value) returns TrieUtils.trieCodedToLong(value)].

A good test case would be to store some dates in ISO format in a field and then 
sort it as longs after parsing using SimpleDateFormat. This would be another 
typical use case (sorting by date, but not using SortField.STRING to minimize 
memory usage).

If you like my patch, we could also discuss about using a super-interface for 
all Parsers. The modifications are rather simple (only the SortField 
constructor would be affected and some casts, and of course: the superinterface 
in all declarations inside FieldCache, ExtendedFieldCache)

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
> Attachments: LUCENE-1478-no-superinterface.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: jira attachments ?

2008-12-04 Thread Uwe Schindler

Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

> From: robert engels [mailto:[EMAIL PROTECTED]
> Sent: Thursday, December 04, 2008 9:37 PM
> To: java-dev@lucene.apache.org
> Subject: jira attachments ?
> 
> I am having a problem posting an attachment to Jira. Just spins, and
> spins...
> 
> Everything else seems to work fine (comments, etc.).
> 
> Anyone else experiencing this?
> 
> Thanks.
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-04 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653500#action_12653500
 ] 

Robert Muir commented on LUCENE-1390:
-

its a bit slower, but the difference is minor. i just ran some tests with some 
cpu-bound (these filters are right at the top of hprof.txt) indexes that i build

i ran em a couple times and it looks like this... not very scientific but it 
gives an idea.

ASCII Folding filter index time (ms): 143365
ISOLatin1Accent filter (ms): 134649


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jira attachments ?


Dear God, I've been blocked ! What will the Lucene community do ! :)

On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote:


Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]


From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:37 PM
To: java-dev@lucene.apache.org
Subject: jira attachments ?

I am having a problem posting an attachment to Jira. Just spins, and
spins...

Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Uwe Schindler

I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and invalidates only
reloaded segments. In each release of Lucene I am very unhappy, because it
is still not in. The same problem like yours is if you have a one million
documents index that is updated by adding a few documents each half hour. If
you use sorting by a field, whenever the index is reopened and you really
only a very small segment is added, nevertheless the complete FieldCache is
rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost
performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to be that
if the index is modified, the cache is invalidated, causing a complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive environment, as
the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works far better
in many cases - it is also far less memory intensive. Scanning the bitset
could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked very well for
us. Maybe this is a completely different issue and separate incident should
be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries


Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be far  
simpler (closer to what I have provided).


In the mean-time, you should be able to use the provided class with a  
few modifications.


The "reload the entire cache" was a deal breaker for us, so I came up  
the attached. Works very well.


On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:


I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and  
invalidates only
reloaded segments. In each release of Lucene I am very unhappy,  
because it
is still not in. The same problem like yours is if you have a one  
million
documents index that is updated by adding a few documents each half  
hour. If
you use sorting by a field, whenever the index is reopened and you  
really
only a very small segment is added, nevertheless the complete  
FieldCache is

rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply  
LUCENE-831 soon

and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter  
to boost

performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to  
be that

if the index is modified, the cache is invalidated, causing a complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive  
environment, as

the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works  
far better
in many cases - it is also far less memory intensive. Scanning the  
bitset

could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked very  
well for
us. Maybe this is a completely different issue and separate  
incident should

be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

The biggest benefit I see of using the field cache to do filter  
caching, is that the same cache can be used for sorting - thereby  
improving the performance and memory usage.


The downside I see is that if you have a common filter that is built  
from many fields, you are going to use a lot more memory, as every  
field used needs to be cached. With my code you would only have a  
single "bitset" for the filter.


On Dec 4, 2008, at 4:00 PM, robert engels wrote:


Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be  
far simpler (closer to what I have provided).


In the mean-time, you should be able to use the provided class with  
a few modifications.


The "reload the entire cache" was a deal breaker for us, so I came  
up the attached. Works very well.


On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:


I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and  
invalidates only
reloaded segments. In each release of Lucene I am very unhappy,  
because it
is still not in. The same problem like yours is if you have a one  
million
documents index that is updated by adding a few documents each  
half hour. If
you use sorting by a field, whenever the index is reopened and you  
really
only a very small segment is added, nevertheless the complete  
FieldCache is

rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply  
LUCENE-831 soon

and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855)  
MemoryCachedRangeFilter to boost

performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems  
to be that
if the index is modified, the cache is invalidated, causing a  
complete

reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive  
environment, as

the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works  
far better
in many cases - it is also far less memory intensive. Scanning the  
bitset

could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked  
very well for
us. Maybe this is a completely different issue and separate  
incident should

be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Paul Elschot

Op Thursday 04 December 2008 23:03:40 schreef robert engels:
> The biggest benefit I see of using the field cache to do filter
> caching, is that the same cache can be used for sorting - thereby
> improving the performance and memory usage.

Would it be possible to build such Filter caching into 
CachingWrapperFilter instead of into QueryFilter?

Both filter caching and the field value caching will need
access to the underlying (segment?) readers.

>
> The downside I see is that if you have a common filter that is built
> from many fields, you are going to use a lot more memory, as every
> field used needs to be cached. With my code you would only have a
> single "bitset" for the filter.

But with many ranges that would mean many bitsets, and
MemoryCachedRangeFilter only needs to cache the field values once
for any number of ranges. It's a tradeoff.

Regards,
Paul Elschot


>
> On Dec 4, 2008, at 4:00 PM, robert engels wrote:
> > Lucene-831 is far more comprehensive.
> >
> > I also think that by exposing access to the sub-readers it can be
> > far simpler (closer to what I have provided).
> >
> > In the mean-time, you should be able to use the provided class with
> > a few modifications.
> >
> > The "reload the entire cache" was a deal breaker for us, so I came
> > up the attached. Works very well.
> >
> > On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
> >> I am looking all the time to LUCENE-831, which is a new version of
> >> FieldCache that is compatible with IndexReader.reopen() and
> >> invalidates only
> >> reloaded segments. In each release of Lucene I am very unhappy,
> >> because it
> >> is still not in. The same problem like yours is if you have a one
> >> million
> >> documents index that is updated by adding a few documents each
> >> half hour. If
> >> you use sorting by a field, whenever the index is reopened and you
> >> really
> >> only a very small segment is added, nevertheless the complete
> >> FieldCache is
> >> rebuild, very bad :(.
> >>
> >>
> >> So I think the ultimative fix would be to hopefully apply
> >> LUCENE-831 soon
> >> and also use LUCENE-1461 as RangeFilter cache.
> >>
> >> -
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: [EMAIL PROTECTED]
> >> 
> >> From: robert engels [mailto:[EMAIL PROTECTED]
> >> Sent: Thursday, December 04, 2008 9:39 PM
> >> To: java-dev@lucene.apache.org
> >> Subject: Re: [jira] Commented: (LUCENE-855)
> >> MemoryCachedRangeFilter to boost
> >> performance of Range queries
> >>
> >> I can't seem to post to Jira, so I am attaching here...
> >>
> >> I attached QueryFilter.java.
> >>
> >> In reading this patch, and other similar ones, the problem seems
> >> to be that
> >> if the index is modified, the cache is invalidated, causing a
> >> complete
> >> reload of the cache. Do I have this correct?
> >>
> >> The attached patch works really well in a highly interactive
> >> environment, as
> >> the cache is only invalidated at the segment level.
> >>
> >> The MyMultiReader is a subclass that allows access to the
> >> underlying SegmentReaders.
> >>
> >> The patch cannot be applied, but I think the implementation works
> >> far better
> >> in many cases - it is also far less memory intensive. Scanning the
> >> bitset
> >> could also be optimized very easily using internal skip values.
> >>
> >> Maybe this is completely off-base, but the solution has worked
> >> very well for
> >> us. Maybe this is a completely different issue and separate
> >> incident should
> >> be opened ?
> >>
> >> is there any interest in this?
> >>
> >>
> >>
> >> --
> >>--- To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >
> > ---
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2008-12-04 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653513#action_12653513
 ] 

Michael Busch commented on LUCENE-1448:
---

{quote}
Another option is to "define" the API such that when incrementToken()
returns false, then it has actually advanced to an "end-of-stream
token". OffsetAttribute.getEndOffset() should return the final
offset. Since we have not released the new API, we could simply make
this change (and fix all instances in the core/contrib that use the
new API accordingly). I think I like this option best.
{quote}

This adds some "cleaning up" responsibilities to all existing
TokenFilters out there. So far it is very straightforward to change an
existing TokenFilter to use the new API. You simply have to:
- add  attributes the filter needs in its constructor 
- change next() to incrementToken() and change return calls that
return null to false, others to true (or what input returns)
- don't access a token but the appropriate attributes to set the data

But maybe there's a custom filter in the end of the chain that returns
more tokens even after its input returned the last one. For example a
SynonymExpansionFilter might return a synonym for the last word it
received from its input before it returns false. In this case it might
overwrite endOffset that another filter/stream already set to the
final endOffset. It needs to cache that value and set it when it
returns false.

ALso all filters that currently use an offset need to know now to
clean up before returning false.

I'm not saying this is necessarily bad. I also find this approach
tempting, because it's simple. But it might be a common pitfall for
bugs?

What I'd like to work on soon is an efficient way to buffer attributes
(maybe add methods to attribute that write into a bytebuffer). Then
attributes can implement what variables need to be serialized and
which ones don't. In that case we could add a finalOffset to
OffsetAttribute that does not get serialiezd/deserialized.

And possibly it might be worthwhile to have explicit states defined in
a TokenStream that we can enforce with three methods: start(),
increment(), end(). Then people would now if they have to do something
at the end of a stream they have to do it in end().

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Earwin Burrfoot

It would be cool to be able to explicitly list subreaders that were
added/removed as a result of reopen(), or have some kind of
notification mechanism.
We have filter caches, custom field/sort caches here and they are all
reader-bound. Currently warm-up delay is negated by reopening and
warming up in background, before switching to the new reader/caches,
but it still limits our minimum between-reopens delay.

On Fri, Dec 5, 2008 at 01:03, robert engels <[EMAIL PROTECTED]> wrote:
> The biggest benefit I see of using the field cache to do filter caching, is
> that the same cache can be used for sorting - thereby improving the
> performance and memory usage.
>
> The downside I see is that if you have a common filter that is built from
> many fields, you are going to use a lot more memory, as every field used
> needs to be cached. With my code you would only have a single "bitset" for
> the filter.
>
> On Dec 4, 2008, at 4:00 PM, robert engels wrote:
>
>> Lucene-831 is far more comprehensive.
>>
>> I also think that by exposing access to the sub-readers it can be far
>> simpler (closer to what I have provided).
>>
>> In the mean-time, you should be able to use the provided class with a few
>> modifications.
>>
>> The "reload the entire cache" was a deal breaker for us, so I came up the
>> attached. Works very well.
>>
>> On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
>>
>>> I am looking all the time to LUCENE-831, which is a new version of
>>> FieldCache that is compatible with IndexReader.reopen() and invalidates
>>> only
>>> reloaded segments. In each release of Lucene I am very unhappy, because
>>> it
>>> is still not in. The same problem like yours is if you have a one million
>>> documents index that is updated by adding a few documents each half hour.
>>> If
>>> you use sorting by a field, whenever the index is reopened and you really
>>> only a very small segment is added, nevertheless the complete FieldCache
>>> is
>>> rebuild, very bad :(.
>>>
>>>
>>> So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
>>> and also use LUCENE-1461 as RangeFilter cache.
>>>
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: [EMAIL PROTECTED]
>>> 
>>> From: robert engels [mailto:[EMAIL PROTECTED]
>>> Sent: Thursday, December 04, 2008 9:39 PM
>>> To: java-dev@lucene.apache.org
>>> Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to
>>> boost
>>> performance of Range queries
>>>
>>> I can't seem to post to Jira, so I am attaching here...
>>>
>>> I attached QueryFilter.java.
>>>
>>> In reading this patch, and other similar ones, the problem seems to be
>>> that
>>> if the index is modified, the cache is invalidated, causing a complete
>>> reload of the cache. Do I have this correct?
>>>
>>> The attached patch works really well in a highly interactive environment,
>>> as
>>> the cache is only invalidated at the segment level.
>>>
>>> The MyMultiReader is a subclass that allows access to the underlying
>>> SegmentReaders.
>>>
>>> The patch cannot be applied, but I think the implementation works far
>>> better
>>> in many cases - it is also far less memory intensive. Scanning the bitset
>>> could also be optimized very easily using internal skip values.
>>>
>>> Maybe this is completely off-base, but the solution has worked very well
>>> for
>>> us. Maybe this is a completely different issue and separate incident
>>> should
>>> be opened ?
>>>
>>> is there any interest in this?
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter

2008-12-04 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653520#action_12653520
 ] 

Robert Muir commented on LUCENE-1390:
-

sorry, that wasn't a fair test case. a good chunk of those docs contain accents 
outside of latin1, so asciifoldingfilter was doing more work

i reran on some heavily accented (but only latin1) data and the difference was 
negligible, 1% or so 

appears asciifoldingfilter only slows you down versus isolatin1accentfilter in 
the case where it probably should be! (you have accents outside of latin1 but 
are using latin1accentfilter)


> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results


 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1478:
--

Lucene Fields: [New, Patch Available]  (was: [New])

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
> Attachments: LUCENE-1478-no-superinterface.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653539#action_12653539
 ] 

Mark Miller commented on LUCENE-1390:
-

Thanks Robert. I plan to commit this in a few days with the deprecation of the 
latin1 filter for removal in 3.0.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> 
>
> Key: LUCENE-1390
> URL: https://issues.apache.org/jira/browse/LUCENE-1390
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
> Environment: any
>Reporter: Andi Vajda
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 2.9
>
> Attachments: ASCIIFoldingFilter.patch, ASCIIFoldingFilter.patch, 
> ASCIIFoldingFilter.patch
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the 
> ISO Latin 1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this 
> code that included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 
> and Latin Extended A unicode blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede 
> ISOLatin1AccentFilter which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653544#action_12653544
 ] 

Uwe Schindler commented on LUCENE-1470:
---

Hi Mike,

I opened issue LUCENE-1478 and attached a first patch.

About the current issue: I have seen that TrieRangeQuery is missing in 
/lucene/java/trunk/contrib/queries/README.txt. Can you add it there or should I 
write a small patch? I think it should at least be mentioned there for what it 
is for, but the JavaDocs are much more informative and the corresponding paper 
/ code credits are cited there.

Thank you very much for helping to get this into Lucene!

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread Grant Ingersoll



On Dec 4, 2008, at 2:21 PM, Jason Rutherglen wrote:

To put things in perspective, I believe Microsoft (who could  
potentially place a lot of resources towards Lucene) now uses Lucene  
through Powerset? and I don't think those folks are contributing  
back.  I know of several other companies who do the same, and many  
potential contributions that are not submitted because people and  
their companies do not see the benefit of going through the hoops  
required to get patches committed.  A relatively simple patch such  
as 1473 Serialization represents this well.


What do you suggest?  We didn't force anyone to use Lucene.  Heck,  
most of our users don't even ever participate on the mailing list.


We do provide a very clear, transparent path for making contributions  
and becoming a committer.  I don't know what else we can do, but we're  
totally open to suggestions on how to improve it.


FWIW, just b/c you think 1473 is trivial doesn't make it so.  You have  
a single use case and that's all you care about.  The community has  
dozens, if not hundreds of use cases, and your "trivial" patch may not  
be so trivial in that regards.  How would you feel if we "broke"  
something that you have relied on for years in the name of us moving  
faster?  I am willing to bet the large number of people here in Lucene  
appreciate our deliberations for the most part.  As for my opinion on  
1473, I personally think there are better ways of achieving what you  
are trying to do, as Robert and others have suggested and I don't  
think it is worth it to maintain serialization across versions as it  
is a too large of a burden, IMO.  But, heh, make an argument  
(preferably w/o the accusations) and convince me otherwise.





For example if a company is developing custom search algorithms,  
Lucene supports TF/IDF but not much else.  Custom search algorithms  
require rewriting lots of Lucene code.  Companies who write new  
search algorithms do not necessarily want to rewrite Lucene as well  
to make it pluggable for new scoring as it is out of scope, they  
will simply branch the code.  It does not help that the core APIs  
underneath IndexReader are protected and package protected which  
assumes a user that is not advanced.  It is repeated in the mailing  
lists that new features will threaten the existing user base which  
is based on opinion rather than fact.  More advanced users are  
currently hindered by the conservatism of the project and so  
naturally have stopped trying to submit changes that alter the core  
non-public code.


So, your mad at us for others not contributing back their forks?  Even  
the ones we don't know about?  Simply put, I'm sorry we can't please  
you.  If you go read the archives, you will see plenty of times when  
even us committers have been frustrated from time to time by the  
process (just look at the JDK 1.5 debate, or the Interface/Abstract  
debate) but in the end, I feel Lucene is stronger for it.  Community  
over code, it's the Apache Way.  You are free to disagree.  In fact,  
you have several options available to you to show that disagreement:   
1. You can work to become a committer and change it from within.  The  
bar really isn't that high, 3 to 4 non-trivial patches and a  
willingness to work with others in a mostly pleasant way.  2.  You can  
make us aware of the patches and be persistent about seeing it through  
and we'll try to get to it.  Just look at CHANGES.txt and JIRA and you  
will see that this happens all the time and from a wide variety of  
contributors (including both you and John).  3.  You can fork the code  
and go do your thing and build your own community, etc.


Personally, I hope you choose 1 or 2, as we're all stronger together  
than we are apart.





The rancor is from users would benefit from a faster pace and the  
ability to be more creative inside the core Lucene system.  As the  
internals change frequently and unnannounced the process of  
developing core patches is difficult and frustrating.


I'm sorry that we can't work at a faster pace.  Suggestions on how to  
deal with the number of patches we have and still maintain quality and  
how to move forward w/o breaking old patches are much appreciated.


As for the internals changing, you have just hit the nail on the head  
as to why it is so important to maintain back-compat.


I simply don't get the unannounced part.  What isn't announced?  Geez,  
I've been a committer for a few years now, and I have yet to see  
another open source project that is as public as Lucene, for better or  
worse.  Look at the archives, we regularly even put our warts out for  
public consumption in an effort to improve ourselves.


Rather than continue hijacking this thread, why don't we either let it  
die and focus on serialization, or we go over to java-dev and you and  
John and the rest of us can create a concrete list of suggestions that  
we think could make Lucene better and we can all

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653545#action_12653545
 ] 

John Wang commented on LUCENE-1473:
---

The discussion here is whether it is better to have 100% of the time failing 
vs. 10% of the time failing. (these are just meaningless numbers to express a 
point)
I do buy Doug's comment about getting into a weird state due to data 
serialization, but this is something Externalizable would solve.
This discussion has digressed to general Java serialization design, where it 
originally scoped only to several lucene classes.

If it is documented that lucene only supports serialization of classes from the 
same jar, is that really enough, doesn't it also depend on the compiler, if 
someone were to build their own jar?

Furthermore, in a distributed environment with lotsa machines, it is always 
idea to upgrade bit by bit, is taking this functionality away by imposing this 
restriction a good trade-off to just implementing Externalizable for a few 
classes, if Serializable is deemed to be dangerous, which I am not so sure 
given the lucene classes we are talking about.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-831) Complete overhaul of FieldCache API/Implementation


 [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-831:
---

Attachment: LUCENE-831.patch

Updated to trunk.

I've combined all of the dual (primitive array/ObjectArray) CachKeys into one. 
Each cache key can support both modes or throw UnsupportedException or 
something.

I've also tried something a bit experimental to allow users to eventually use 
custom or alternate cachekeys (payload or sparse arrays or something) that work 
with internal sorting. A cache implementation can now supply a 
ComparatorFactory (name will prob be tweaked) that handles creating 
comparators. You can subclass ComparatorFactory and add new or override current 
supported CacheKeys.

CustomComparators still needs to be twiddled with some.

I've converted some of the sort tests to run with both primitive and object 
arrays as well.

- Mark
I

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jira attachments ?

2008-12-04 Thread Michael McCandless



Robert which browser are you using?

Mike

robert engels wrote:


Dear God, I've been blocked ! What will the Lucene community do ! :)

On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote:


Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]


From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:37 PM
To: java-dev@lucene.apache.org
Subject: jira attachments ?

I am having a problem posting an attachment to Jira. Just spins, and
spins...

Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653553#action_12653553
 ] 

Doug Cutting commented on LUCENE-1473:
--

> This discussion has digressed to general Java serialization design, where it 
> originally scoped only to several lucene classes. 

Which classes?  The existing patch applies to one class.  Jason said, "If it 
looks ok, I will implement Externalizable in other classes." but never said 
which.  It would be good to know how wide the impact of the proposed change 
would be.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jira attachments ?


I am using Safari 3.2 (on OSX Tiger).

On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote:



Robert which browser are you using?

Mike

robert engels wrote:


Dear God, I've been blocked ! What will the Lucene community do ! :)

On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote:


Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]


From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:37 PM
To: java-dev@lucene.apache.org
Subject: jira attachments ?

I am having a problem posting an attachment to Jira. Just spins,  
and

spins...

Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jira attachments ?

2008-12-04 Thread Michael McCandless



Hmmm the only time I've seen this was also with Safari (though on  
an older version).  It caused me to switch [back] to Firefox.  Try  
Firefox?


Mike

robert engels wrote:


I am using Safari 3.2 (on OSX Tiger).

On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote:



Robert which browser are you using?

Mike

robert engels wrote:


Dear God, I've been blocked ! What will the Lucene community do ! :)

On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote:


Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]


From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:37 PM
To: java-dev@lucene.apache.org
Subject: jira attachments ?

I am having a problem posting an attachment to Jira. Just spins,  
and

spins...

Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries



On Dec 4, 2008, at 4:10 PM, Paul Elschot wrote:


Op Thursday 04 December 2008 23:03:40 schreef robert engels:

The biggest benefit I see of using the field cache to do filter
caching, is that the same cache can be used for sorting - thereby
improving the performance and memory usage.


Would it be possible to build such Filter caching into
CachingWrapperFilter instead of into QueryFilter?

Both filter caching and the field value caching will need
access to the underlying (segment?) readers.



I don't see why not. The QueryFilter extends from that... We are just  
on a much older code base.


Not really sure why this hierarchy exists tough, as the only  
extenders are QueryFilter, and CachingWrapperFilterHelper.


I would prefer QueryFilter, and then extend that as CachingQueryFilter.

I've always been taught that is you see the words Wrapper, or Helper,  
there is probably a design problem, or at least a naming problem.





The downside I see is that if you have a common filter that is built
from many fields, you are going to use a lot more memory, as every
field used needs to be cached. With my code you would only have a
single "bitset" for the filter.


But with many ranges that would mean many bitsets, and
MemoryCachedRangeFilter only needs to cache the field values once
for any number of ranges. It's a tradeoff.



That was my point. I don't see the field based caching and the filter  
based caching as solving the same problem to a degree. It is going to  
depend on the actual usage - that is why I would like to support both.



Regards,
Paul Elschot




On Dec 4, 2008, at 4:00 PM, robert engels wrote:

Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be
far simpler (closer to what I have provided).

In the mean-time, you should be able to use the provided class with
a few modifications.

The "reload the entire cache" was a deal breaker for us, so I came
up the attached. Works very well.

On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:

I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and
invalidates only
reloaded segments. In each release of Lucene I am very unhappy,
because it
is still not in. The same problem like yours is if you have a one
million
documents index that is updated by adding a few documents each
half hour. If
you use sorting by a field, whenever the index is reopened and you
really
only a very small segment is added, nevertheless the complete
FieldCache is
rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply
LUCENE-831 soon
and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855)
MemoryCachedRangeFilter to boost
performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems
to be that
if the index is modified, the cache is invalidated, causing a
complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive
environment, as
the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the
underlying SegmentReaders.

The patch cannot be applied, but I think the implementation works
far better
in many cases - it is also far less memory intensive. Scanning the
bitset
could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked
very well for
us. Maybe this is a completely different issue and separate
incident should
be opened ?

is there any interest in this?



--
--- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---
-- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: jira attachments ?


Could be...  I will try next time...

Seems a strange (and serious) bug in Jira (I have no problems with  
other "add attachment" sites) ...


On Dec 4, 2008, at 5:59 PM, Michael McCandless wrote:



Hmmm the only time I've seen this was also with Safari (though  
on an older version).  It caused me to switch [back] to Firefox.   
Try Firefox?


Mike

robert engels wrote:


I am using Safari 3.2 (on OSX Tiger).

On Dec 4, 2008, at 5:38 PM, Michael McCandless wrote:



Robert which browser are you using?

Mike

robert engels wrote:

Dear God, I've been blocked ! What will the Lucene community  
do ! :)


On Dec 4, 2008, at 3:27 PM, Uwe Schindler wrote:


Hi Robert,

two minutes ago I uploaded a patch...

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]


From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:37 PM
To: java-dev@lucene.apache.org
Subject: jira attachments ?

I am having a problem posting an attachment to Jira. Just  
spins, and

spins...

Everything else seems to work fine (comments, etc.).

Anyone else experiencing this?

Thanks.

- 


To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-- 
---

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--- 
--

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 
-

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-831) Complete overhaul of FieldCache API/Implementation


 [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-831:
---

Attachment: LUCENE-831.patch

Couple of needed tweaks and a test for a custom ComparatorFactory.

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-04 Thread John Wang

Hi Grant:
 I agree and I apologize for hijacking this thread. If Luceners feel our
criticisms are invalid, then so be it.

 We should focus on this issue, being the serialization story in Lucene.
Not general java serialization, so I don't see how it would benefit to move
this to the java dev list.

  As far as lucene serialization, incorporating comments from various
people, this is what I gather are the choices (feel free to correct me)

1) Remove implementation and support of Serializable: We all agreed this is
bad and breaks backward compatibility.

2) Do nothing to the code base and fix documentation, and clarify Lucene
only supports Serialization between components with the release jar. This
seems to be the suggested approach where I have a coupla concerns:

a) Since given the exact code base, due to the nature of java serialization,
different builds of the jar via IBM vm vs. Sun VM vs. Jrocket etc, cannot
guarantee compatibility. Thus we are enforcing users that care about
Serialization to use the release jar.

b) There is at least one place, as I have previously mentioned, e.g.
ScoreDocComparator, the contract returns a Comparable and via javadoc, must
be serializable. How should this be treated? This can be an application
object, should we pass on the same enforcement there when merge/sort is
happening across the wire since similar serialization problem would break
inside MultiSearcher?

3) Clean up the serialization story, either add SUID or implement
Externalizable for some classes within Lucene that implements Serializable:

>From what I am told, this is too much work for the committers.

I hope you guys at least agree with me with the way it is currently, the
serialization story is broken, whether in documentation or in code. I see
the disagreement being its severity, and whether it is a trivial fix, which
I have learned it is not really my place to say.

Please do understand this is not a far-fetched, made-up use-case, we are
running into this in production, and we are developing in accordance to
lucene documentation.

Thanks

-John

On Thu, Dec 4, 2008 at 3:23 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:

>
> On Dec 4, 2008, at 2:21 PM, Jason Rutherglen wrote:
>
>  To put things in perspective, I believe Microsoft (who could potentially
>> place a lot of resources towards Lucene) now uses Lucene through Powerset?
>> and I don't think those folks are contributing back.  I know of several
>> other companies who do the same, and many potential contributions that are
>> not submitted because people and their companies do not see the benefit of
>> going through the hoops required to get patches committed.  A relatively
>> simple patch such as 1473 Serialization represents this well.
>>
>
> What do you suggest?  We didn't force anyone to use Lucene.  Heck, most of
> our users don't even ever participate on the mailing list.
>
> We do provide a very clear, transparent path for making contributions and
> becoming a committer.  I don't know what else we can do, but we're totally
> open to suggestions on how to improve it.
>
> FWIW, just b/c you think 1473 is trivial doesn't make it so.  You have a
> single use case and that's all you care about.  The community has dozens, if
> not hundreds of use cases, and your "trivial" patch may not be so trivial in
> that regards.  How would you feel if we "broke" something that you have
> relied on for years in the name of us moving faster?  I am willing to bet
> the large number of people here in Lucene appreciate our deliberations for
> the most part.  As for my opinion on 1473, I personally think there are
> better ways of achieving what you are trying to do, as Robert and others
> have suggested and I don't think it is worth it to maintain serialization
> across versions as it is a too large of a burden, IMO.  But, heh, make an
> argument (preferably w/o the accusations) and convince me otherwise.
>
>
>>
>> For example if a company is developing custom search algorithms, Lucene
>> supports TF/IDF but not much else.  Custom search algorithms require
>> rewriting lots of Lucene code.  Companies who write new search algorithms do
>> not necessarily want to rewrite Lucene as well to make it pluggable for new
>> scoring as it is out of scope, they will simply branch the code.  It does
>> not help that the core APIs underneath IndexReader are protected and package
>> protected which assumes a user that is not advanced.  It is repeated in the
>> mailing lists that new features will threaten the existing user base which
>> is based on opinion rather than fact.  More advanced users are currently
>> hindered by the conservatism of the project and so naturally have stopped
>> trying to submit changes that alter the core non-public code.
>>
>
> So, your mad at us for others not contributing back their forks?  Even the
> ones we don't know about?  Simply put, I'm sorry we can't please you.  If
> you go read the archives, you will see plenty of times when eve

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-04 Thread John Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653563#action_12653563
 ] 

John Wang commented on LUCENE-1473:
---

For our problem, it is Query all all its derived and encapsulated classes. I 
guess the title of the bug is too generic.

As far as my comment about other lucene classes, one can just go to the lucene 
javadoc and click on "Tree" and look for Serializable. If you want me to, I can 
go an fetch the complete list, but here are some examples:

1) Document (Field etc.)
2) OpenBitSet, Filter ...
3) Sort, SortField
4) Term
5) TopDocs, Hits etc.

For the top level API.



> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib