Re: Java logging in Lucene

2008-12-08 Thread Earwin Burrfoot
The common problem with native logging, log4j and slf4j (logback impl)
is that they are totally unsuitable for actually logging something.
They do good work checking if the logging can be avoided, but use
almost-global locking if you really try to write this line to a file.
My research shows there are no ready-made java logging frameworks that
can be used in high-load production environment.

On Sat, Dec 6, 2008 at 19:52, Shai Erera <[EMAIL PROTECTED]> wrote:
> On the performance side, I don't expect to see any different performance
> than what we have today, since checking if infoStream != null should be
> similar to logger.isLoggable (or the equivalent methods from SLF4J).
>
> I'll look at SLF4J, open an issue and work out a patch.
>
> On Sat, Dec 6, 2008 at 1:22 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>
>> On Dec 5, 2008, at 11:36 PM, Shai Erera wrote:
>>
>>>
>>> What do you have against JUL? I've used it and in my company (which is
>>> quite a large one btw) we've moved to JUL just because it's so easy to
>>> configure, comes already with the JDK and very intuitive. Perhaps it has
>>> some shortcomings which I'm not aware of, and I hope you can point me at
>>> them.
>>
>> See http://lucene.markmail.org/message/3t2qwbf7cc7wtx6h?q=Solr+logging (or
>> http://grantingersoll.com/2008/04/25/logging-frameworks-considered-harmful/ 
>> for
>> my rant on it!)  Frankly, I could live a quite happy life if I never had to
>> think about logging frameworks again!
>>
>> As for JUL, the bottom line for me is (and perhaps I'm wrong):  It doesn't
>> play nice with others (show me a system today that uses open source projects
>> which doesn't have at least 2 diff. logging frameworks) and it usually
>> requires coding where other implementations don't.  My impression of JUL is
>> that the designers wanted Log4j, but somehow they felt they had to come up
>> with something "original", and in turn arrived at this thing that is the
>> lowest common denominator.  But, like I said, it's a religious debate, eh?
>> ;-)
>>
>> As for logging, you and Jason make good points.  I guess the first thing
>> to do would be to submit a patch that adds SLF4J instead of infoStream and
>> then we can test performance.  It still amazing, to me, however, that Lucene
>> has made it this long with all but rudimentary logging and only during
>> indexing.
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785


Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

  * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
IndexReader.getSubReaders.  Default impl could just return
length-1 array of itself.

  * Change IndexSearcher.sort that takes a Sort, to first call
IndexReader.getSubReaders, and then do the same logic that
MultiSearcher does, with improvements from LUCENE-1471 (run
separate search per-reader, then merge-sort the top hits from
each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since we'd
no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full  
array", ie, so we can access native values w/o materializing the  
full array.  I also think "upgrade to Object" is dangerous to even  
offer since it's so costly.



I'm right with you. I didn't think the Object approach was really an  
upgrade (beyond losing the merge, which is especially important for  
StringIndex - it has no merge option at the moment) which is why I  
left both options for now. So I def agree we need to move to  
iterator, drop object, etc.


Its the doin' that aint so easy. The iterator approach seems  
somewhat straightforward (though its complicated by needing to  
provide a random access object as well), but I'm still working  
through how we control so many iterator types (I dont see how you  
can use polymorphism yet ).


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JavaCC and Demo files

2008-12-08 Thread Grant Ingersoll


On Dec 8, 2008, at 6:15 AM, Michael McCandless wrote:



I don't see this.  When I "ls -l --time-style=full-iso" these files:

-rw-rw-rw- 1 mike users 20796 2008-11-26 05:25:28.0 -0500  
src/demo/org/apache/lucene/demo/html/HTMLParser.java
-rw-rw-rw- 1 mike users  9486 2008-11-26 05:25:28.0 -0500  
src/demo/org/apache/lucene/demo/html/HTMLParser.jj


so they seem to be modified at the same time.

If you run JavaCC do you see any resulting svn diffs on the .java  
file?


Yes, I did, but the it was hard to differentiate whether the changes  
were solely due to me using 4.1 instead of 3.x, which seems to be what  
the files were gen'd with.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JavaCC and Demo files

2008-12-08 Thread Michael McCandless


I don't see this.  When I "ls -l --time-style=full-iso" these files:

-rw-rw-rw- 1 mike users 20796 2008-11-26 05:25:28.0 -0500 src/ 
demo/org/apache/lucene/demo/html/HTMLParser.java
-rw-rw-rw- 1 mike users  9486 2008-11-26 05:25:28.0 -0500 src/ 
demo/org/apache/lucene/demo/html/HTMLParser.jj


so they seem to be modified at the same time.

If you run JavaCC do you see any resulting svn diffs on the .java file?

Mike

On Dec 6, 2008, at 8:25 PM, Grant Ingersoll wrote:


Anyone else seeing:
javacc-notice:
[echo]
[echo]   One or more of the JavaCC .jj files is newer than  
its corresponding
[echo]   .java file.  Run the "javacc" target to regenerate  
the artifacts.

[echo]


I think the demo files are out of date for the HTML parser, but  
don't recall if this is something we should just automatically  
update or not.


Thanks,
Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller
What do we get from this though? A MultiSearcher (with the  scoring 
issues) that can properly do rewrite? Won't we have to take 
MultiSearchers scoring baggage into this as well?


Michael McCandless wrote:


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

  * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
IndexReader.getSubReaders.  Default impl could just return
length-1 array of itself.

  * Change IndexSearcher.sort that takes a Sort, to first call
IndexReader.getSubReaders, and then do the same logic that
MultiSearcher does, with improvements from LUCENE-1471 (run
separate search per-reader, then merge-sort the top hits from
each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since we'd
no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full 
array", ie, so we can access native values w/o materializing the 
full array.  I also think "upgrade to Object" is dangerous to even 
offer since it's so costly.



I'm right with you. I didn't think the Object approach was really an 
upgrade (beyond losing the merge, which is especially important for 
StringIndex - it has no merge option at the moment) which is why I 
left both options for now. So I def agree we need to move to 
iterator, drop object, etc.


Its the doin' that aint so easy. The iterator approach seems somewhat 
straightforward (though its complicated by needing to provide a 
random access object as well), but I'm still working through how we 
control so many iterator types (I dont see how you can use 
polymorphism yet ).


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless


Mark Miller wrote:

What do we get from this though? A MultiSearcher (with the  scoring  
issues) that can properly do rewrite? Won't we have to take  
MultiSearchers scoring baggage into this as well?


If this can work, what we'd get is far better reopen() performance
when you sort-by-field, with no change to the returned results
(rewrite, scores, sort order are identical).

Say you have 1MM doc index, and then you add 100 docs & commit.
Today, when you reopen() and then do a search, FieldCache recomputes
from scratch (iterating through all Terms in entire index) the global
arrays for the fields you're sorting on.  The cost is in proportion to
total index size.

With this change, only the new segment's terms will be iterated on, so
the cost is in proportion to what new segments appeared.

This is the same benefit we are seeking with LUCENE-831, for all uses
of FieldCache (not just sort-by-field), it's just that I think we can
achieve this speedup to sort-by-field without LUCENE-831.

I think there would be no change to the scoring: we would still create
a Weight based on the toplevel IndexReader, but then search each
sub-reader separately, using that Weight.

Though... that is unusual (to create a Weight with the parent
IndexSearcher and then use it in the sub-searchers) -- will something
break if we do that?  (This is new territory for me).

If something will break, I think we can still achieve this, but it
will be a more invasive change and probably will have to be re-coupled
to the new API we will introduce with LUCENE-831.  Marvin actually
referred to how to do this, here:

  https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854 
#action_12650854


in the paragraph starting with "If our goal is minimal impact...".
Basically during collection, the FieldSortedHitQueue would have to
keep track of subReaderIndex/subReaderDocID (mapping, through
iteration, from the primary docID w/o doing a wasteful new binary
search for each) and enroll into different pqueues indexed by
subReaderIndex, then do the merge sort in the end.

Mike



Michael McCandless wrote:


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

 * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
   IndexReader.getSubReaders.  Default impl could just return
   length-1 array of itself.

 * Change IndexSearcher.sort that takes a Sort, to first call
   IndexReader.getSubReaders, and then do the same logic that
   MultiSearcher does, with improvements from LUCENE-1471 (run
   separate search per-reader, then merge-sort the top hits from
   each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since  
we'd

no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full  
array", ie, so we can access native values w/o materializing the  
full array.  I also think "upgrade to Object" is dangerous to  
even offer since it's so costly.



I'm right with you. I didn't think the Object approach was really  
an upgrade (beyond losing the merge, which is especially important  
for StringIndex - it has no merge option at the moment) which is  
why I left both options for now. So I def agree we need to move to  
iterator, drop object, etc.


Its the doin' that aint so easy. The iterator approach seems  
somewhat straightforward (though its complicated by needing to  
provide a random access object as well), but I'm still working  
through how we control so many iterator types (I dont see how you  
can use polymorphism yet ).


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1380) Patch for ShingleFilter.enablePositions (or PositionFilter)

2008-12-08 Thread Mck SembWever (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654410#action_12654410
 ] 

Mck SembWever commented on LUCENE-1380:
---

ping. are there any committors willing to commit these changes?

> Patch for ShingleFilter.enablePositions (or PositionFilter)
> ---
>
> Key: LUCENE-1380
> URL: https://issues.apache.org/jira/browse/LUCENE-1380
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Mck SembWever
>Priority: Trivial
> Attachments: LUCENE-1380-PositionFilter.patch, 
> LUCENE-1380-PositionFilter.patch, LUCENE-1380-PositionFilter.patch, 
> LUCENE-1380.patch, LUCENE-1380.patch
>
>
> Make it possible for *all* words and shingles to be placed at the same 
> position, that is for _all_ shingles (and unigrams if included) to be treated 
> as synonyms of each other.
> Today the shingles generated are synonyms only to the first term in the 
> shingle.
> For example the query "abcd efgh ijkl" results in:
>("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
> where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh 
> ijkl" is a synonym of "efgh".
> There exists no way today to alter which token a particular shingle is a 
> synonym for.
> This patch takes the first step in making it possible to make all shingles 
> (and unigrams if included) synonyms of each other.
> See http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746 for 
> mailing list thread.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller

Michael McCandless wrote:


Mark Miller wrote:

What do we get from this though? A MultiSearcher (with the  scoring 
issues) that can properly do rewrite? Won't we have to take 
MultiSearchers scoring baggage into this as well?


If this can work, what we'd get is far better reopen() performance
when you sort-by-field, with no change to the returned results
(rewrite, scores, sort order are identical).

Say you have 1MM doc index, and then you add 100 docs & commit.
Today, when you reopen() and then do a search, FieldCache recomputes
from scratch (iterating through all Terms in entire index) the global
arrays for the fields you're sorting on.  The cost is in proportion to
total index size.

With this change, only the new segment's terms will be iterated on, so
the cost is in proportion to what new segments appeared.

This is the same benefit we are seeking with LUCENE-831, for all uses
of FieldCache (not just sort-by-field), it's just that I think we can
achieve this speedup to sort-by-field without LUCENE-831.


Yup, I'm with you on all that. Except the without LUCENE-831 part - we 
need some FieldCache meddling right? The current FieldCache approach 
doesn't allow us to meddle much. Isn't it more like, we want the 
LUCENE-831 API (or something similar), but we won't need the objectarray 
or merge stuff?




I think there would be no change to the scoring: we would still create
a Weight based on the toplevel IndexReader, but then search each
sub-reader separately, using that Weight.

Though... that is unusual (to create a Weight with the parent
IndexSearcher and then use it in the sub-searchers) -- will something
break if we do that?  (This is new territory for me).


Okay, right. That does change things. Would love to hear more opinions, 
but that certainly seems reasonable to me. You score each segment using 
tf/idf stats from all of the segments.




If something will break, I think we can still achieve this, but it
will be a more invasive change and probably will have to be re-coupled
to the new API we will introduce with LUCENE-831.  Marvin actually
referred to how to do this, here:

  
https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854#action_12650854 



in the paragraph starting with "If our goal is minimal impact...".
Basically during collection, the FieldSortedHitQueue would have to
keep track of subReaderIndex/subReaderDocID (mapping, through
iteration, from the primary docID w/o doing a wasteful new binary
search for each) and enroll into different pqueues indexed by
subReaderIndex, then do the merge sort in the end.

Mike



Michael McCandless wrote:


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

 * Expose sub-readers (LUCENE-1475) by adding IndexReader[]
   IndexReader.getSubReaders.  Default impl could just return
   length-1 array of itself.

 * Change IndexSearcher.sort that takes a Sort, to first call
   IndexReader.getSubReaders, and then do the same logic that
   MultiSearcher does, with improvements from LUCENE-1471 (run
   separate search per-reader, then merge-sort the top hits from
   each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since we'd
no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full 
array", ie, so we can access native values w/o materializing the 
full array.  I also think "upgrade to Object" is dangerous to even 
offer since it's so costly.



I'm right with you. I didn't think the Object approach was really 
an upgrade (beyond losing the merge, which is especially important 
for StringIndex - it has no merge option at the moment) which is 
why I left both options for now. So I def agree we need to move to 
iterator, drop object, etc.


Its the doin' that aint so easy. The iterator approach seems 
somewhat straightforward (though its complicated by needing to 
provide a random access object as well), but I'm still working 
through how we control so many iterator types (I dont see how you 
can use polymorphism yet ).


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654413#action_12654413
 ] 

Michael McCandless commented on LUCENE-831:
---


bq. It seems with this field cache approach and the recent 
FieldCacheRangeFilter on trunk, that Lucene has a robust and coherent answer to 
performing efficient sorting and range filtering for float, double, short, int 
and long values, perhaps it's time to enhance Document. That might cut down the 
size of the API, which in turn makes it easy to test and tune. Document could 
preclude tokenization for such fields, I suspect I'm not the only one to build 
a type-safe replacement to Document.

This is an interesting idea.  Say we create IntField, a subclass of
Field.  It could directly accept a single int value and not accept
tokenization options.  It could assert "not null", if the field wanted
that.  FieldInfo could store that it's an int and expose more stronly
typed APIs from IndexReader.document as well.  If in the future we
enable Term to be things-other-than-String, we could do the right
thing with typed fields.  Etc


> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java logging in Lucene

2008-12-08 Thread Shai Erera
{quote}
My research shows there are no ready-made java logging frameworks that can
be used in high-load production environment.
{quote}

I'm not sure I understand what you mean by that. We use Java logging in our
high-profiled products, which support 100s of tps. Logging is usually turned
off, and is being turned on only for debugging. We have not seen any
problems with Java logging at runtime (i.e., w/o logging, when only
logger.isLoggable calls are made) or at debug-time (when actual logging
happens). Of course, at debug-time performance is slower, but that's debug
time - you're not into performance, but for debugging.

Anyway, as far as SLF4J goes, I've written a patch using it, and replacing
infoStream. I'm about to open an issue and submit the patch, for everyone to
review. We can continue the discussion there.

Shai

On Mon, Dec 8, 2008 at 10:13 AM, Earwin Burrfoot <[EMAIL PROTECTED]> wrote:

> The common problem with native logging, log4j and slf4j (logback impl)
> is that they are totally unsuitable for actually logging something.
> They do good work checking if the logging can be avoided, but use
> almost-global locking if you really try to write this line to a file.
> My research shows there are no ready-made java logging frameworks that
> can be used in high-load production environment.
>
> On Sat, Dec 6, 2008 at 19:52, Shai Erera <[EMAIL PROTECTED]> wrote:
> > On the performance side, I don't expect to see any different performance
> > than what we have today, since checking if infoStream != null should be
> > similar to logger.isLoggable (or the equivalent methods from SLF4J).
> >
> > I'll look at SLF4J, open an issue and work out a patch.
> >
> > On Sat, Dec 6, 2008 at 1:22 PM, Grant Ingersoll <[EMAIL PROTECTED]>
> wrote:
> >>
> >> On Dec 5, 2008, at 11:36 PM, Shai Erera wrote:
> >>
> >>>
> >>> What do you have against JUL? I've used it and in my company (which is
> >>> quite a large one btw) we've moved to JUL just because it's so easy to
> >>> configure, comes already with the JDK and very intuitive. Perhaps it
> has
> >>> some shortcomings which I'm not aware of, and I hope you can point me
> at
> >>> them.
> >>
> >> See http://lucene.markmail.org/message/3t2qwbf7cc7wtx6h?q=Solr+logging(or
> >>
> http://grantingersoll.com/2008/04/25/logging-frameworks-considered-harmful/for
> >> my rant on it!)  Frankly, I could live a quite happy life if I never had
> to
> >> think about logging frameworks again!
> >>
> >> As for JUL, the bottom line for me is (and perhaps I'm wrong):  It
> doesn't
> >> play nice with others (show me a system today that uses open source
> projects
> >> which doesn't have at least 2 diff. logging frameworks) and it usually
> >> requires coding where other implementations don't.  My impression of JUL
> is
> >> that the designers wanted Log4j, but somehow they felt they had to come
> up
> >> with something "original", and in turn arrived at this thing that is the
> >> lowest common denominator.  But, like I said, it's a religious debate,
> eh?
> >> ;-)
> >>
> >> As for logging, you and Jason make good points.  I guess the first thing
> >> to do would be to submit a patch that adds SLF4J instead of infoStream
> and
> >> then we can test performance.  It still amazing, to me, however, that
> Lucene
> >> has made it this long with all but rudimentary logging and only during
> >> indexing.
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>


[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654417#action_12654417
 ] 

Uwe Schindler commented on LUCENE-831:
--

{quote}This is an interesting idea. Say we create IntField, a subclass of
Field. It could directly accept a single int value and not accept
tokenization options. It could assert "not null", if the field wanted
that. FieldInfo could store that it's an int and expose more stronly
typed APIs from IndexReader.document as well. If in the future we
enable Term to be things-other-than-String, we could do the right
thing with typed fields. Etc{quote}

Maybe this document could also manage the encoding of these fields to the index 
format. With that it would be possible to extend Docuemnt, to automatically use 
my trie-based encoding for storing the raw term values. On the otrher hand 
RangeQuery would be aware of the field encoding and can switch dynamically to 
the correct search/sort algorithm. Great!


> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Robert Newson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654418#action_12654418
 ] 

Robert Newson commented on LUCENE-831:
--


Yes, something like that. I made a Document class with an add method for each 
primitive type which allowed only the sensible choices for Store and Index. 
Field subclasses would achieve the same thing. A subclass per primitive type 
might be excessive, they'd be 99% identical to each other. A NumericField that 
could hold a single short, int, long, float, double or Date might be enough 
(new NumericField(name, 99.99F, true), the final boolean toggling YES/NO for 
Store, since Index is always UNANALYZED_NO_NORMS).

Adding this to FieldInfo would change the on-disk format such that it remembers 
that a particular field is of a special type?  That way all the places that 
Lucene currently has a multiplicity of classes or constants (SortField.INT, 
etc) could be eliminated, replaced by first class support in Document/Field.

A remaining question would be whether field name is sufficient for uniqueness, 
I suggest it becomes fieldname+type. This also implies changes to the Query and 
Filter hierarchy. 

If it helps, I can post my Document class, which had helper methods for 
RangeFilter and TermQuery's for each type. It's not a complicated class, you 
can probably already picture it.


> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654417#action_12654417
 ] 

thetaphi edited comment on LUCENE-831 at 12/8/08 6:04 AM:
---

{quote}This is an interesting idea. Say we create IntField, a subclass of
Field. It could directly accept a single int value and not accept
tokenization options. It could assert "not null", if the field wanted
that. FieldInfo could store that it's an int and expose more stronly
typed APIs from IndexReader.document as well. If in the future we
enable Term to be things-other-than-String, we could do the right
thing with typed fields. Etc{quote}

Maybe this new Document class could also manage the encoding of these fields to 
the index format. With that it would be possible to extend Document, to 
automatically use my trie-based encoding for storing the raw term values. On 
the otrher hand RangeQuery would be aware of the field encoding (from field 
metadata) and can switch dynamically to the correct search/sort algorithm. 
Great!


  was (Author: thetaphi):
{quote}This is an interesting idea. Say we create IntField, a subclass of
Field. It could directly accept a single int value and not accept
tokenization options. It could assert "not null", if the field wanted
that. FieldInfo could store that it's an int and expose more stronly
typed APIs from IndexReader.document as well. If in the future we
enable Term to be things-other-than-String, we could do the right
thing with typed fields. Etc{quote}

Maybe this document could also manage the encoding of these fields to the index 
format. With that it would be possible to extend Docuemnt, to automatically use 
my trie-based encoding for storing the raw term values. On the otrher hand 
RangeQuery would be aware of the field encoding and can switch dynamically to 
the correct search/sort algorithm. Great!

  
> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Robert Newson (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Newson updated LUCENE-831:
-

Attachment: ExtendedDocument.java


Type-safe Document-style object. Doesn't extend Document as it is final.



> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2008-12-08 Thread Shai Erera (JIRA)
Replace infoSteram by a logging framework (SLF4J)
-

 Key: LUCENE-1482
 URL: https://issues.apache.org/jira/browse/LUCENE-1482
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.4.1, 2.9


Lucene makes use of infoStream to output messages in its indexing code only. 
For debugging purposes, when the search application is run on the customer 
side, getting messages from other code flows, like search, query parsing, 
analysis etc can be extremely useful.
There are two main problems with infoStream today:
1. It is owned by IndexWriter, so if I want to add logging capabilities to 
other classes I need to either expose an API or propagate infoStream to all 
classes (see for example DocumentsWriter, which receives its infoStream 
instance from IndexWriter).
2. I can either turn debugging on or off, for the entire code.

Introducing a logging framework can allow each class to control its logging 
independently, and more importantly, allows the application to turn on logging 
for only specific areas in the code (i.e., org.apache.lucene.index.*).

I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
as it names states, a facade over different logging frameworks. As such, you 
can include the slf4j.jar in your application, and it recognizes at deploy time 
what is the actual logging framework you'd like to use. SLF4J comes with 
several adapters for Java logging, Log4j and others. If you know your 
application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
your classpath, and your logging statements will use Java logging underneath 
the covers.

This makes the logging code very simple. For a class A the logger will be 
instantiated like this:
public class A {
  private static final logger = LoggerFactory.getLogger(A.class);
}

And will later be used like this:
public class A {
  private static final logger = LoggerFactory.getLogger(A.class);

  public void foo() {
if (logger.isDebugEnabled()) {
  logger.debug("message");
}
  }
}

That's all !

Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
(but I assume it's fast also over other logging frameworks).

The important thing is, every class controls its own logger. Not all classes 
have to output logging messages, and we can improve Lucene's logging gradually, 
w/o changing the API, by adding more logging messages to interesting classes.

I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2008-12-08 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1482:
---

Attachment: LUCENE-1482.patch

This patch covers:
- Using SLF4J in all the classes that used infoStream
- A test which uses the JDK14 adapter to make sure it works, as well as fixing 
few tests which relied on some messages
- Deprecation of setInfoStream(), getInfoStream() etc. in several classes who 
exposed this API.

Few notes:
- As in many customer environments I know of the INFO level is usually turned 
on, and we were forbidden to output any message in the INFO level, unless it's 
really INFO, WARNING or SEVERE, I assumed Lucene logging messages should be in 
the DEBUG level (which is one less than TRACE).

- I wasn't sure what to do with the set/get infoStream methods, so I just 
deprecated them and do nothing (i.e., setInfoStream does nothing and 
getInfoStream always returns a null).
Not sure how's that align with Lucene's back-compat policy, but on the other 
hand I didn't think I should keep both infoStream and SLF4J logging in the code.

- Should I attach the slf4j jars separately?

- I didn't go as far as measuring performance because:
1. The code uses isDebugEnabled everywhere, which is at least judging by the 
JDK14 adapter very fast (just checks a member on the actual logger instance) 
and is almost equivalent to infoStream != null check.
2. It really depends on the adapter that's being used. I used JDK14, but 
perhaps some other adapter will perform worse on these calls, although I expect 
these calls to be executed quickly, if not even being inlined by the compiler.

> Replace infoSteram by a logging framework (SLF4J)
> -
>
> Key: LUCENE-1482
> URL: https://issues.apache.org/jira/browse/LUCENE-1482
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1482.patch
>
>
> Lucene makes use of infoStream to output messages in its indexing code only. 
> For debugging purposes, when the search application is run on the customer 
> side, getting messages from other code flows, like search, query parsing, 
> analysis etc can be extremely useful.
> There are two main problems with infoStream today:
> 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
> other classes I need to either expose an API or propagate infoStream to all 
> classes (see for example DocumentsWriter, which receives its infoStream 
> instance from IndexWriter).
> 2. I can either turn debugging on or off, for the entire code.
> Introducing a logging framework can allow each class to control its logging 
> independently, and more importantly, allows the application to turn on 
> logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
> I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
> as it names states, a facade over different logging frameworks. As such, you 
> can include the slf4j.jar in your application, and it recognizes at deploy 
> time what is the actual logging framework you'd like to use. SLF4J comes with 
> several adapters for Java logging, Log4j and others. If you know your 
> application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
> your classpath, and your logging statements will use Java logging underneath 
> the covers.
> This makes the logging code very simple. For a class A the logger will be 
> instantiated like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
> }
> And will later be used like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
>   public void foo() {
> if (logger.isDebugEnabled()) {
>   logger.debug("message");
> }
>   }
> }
> That's all !
> Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
> (but I assume it's fast also over other logging frameworks).
> The important thing is, every class controls its own logger. Not all classes 
> have to output logging messages, and we can improve Lucene's logging 
> gradually, w/o changing the API, by adding more logging messages to 
> interesting classes.
> I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java logging in Lucene

2008-12-08 Thread Yonik Seeley
On Sat, Dec 6, 2008 at 11:52 AM, Shai Erera <[EMAIL PROTECTED]> wrote:
> On the performance side, I don't expect to see any different performance
> than what we have today, since checking if infoStream != null should be
> similar to logger.isLoggable (or the equivalent methods from SLF4J).

I'm leery of going down this logging road because people may add
logging statements in inappropriate places, believing that
isLoggable() is about the same as infoStream != null

They seem roughly equivalent because of the context in which they are
tested: coarse grained logging where the surrounding operations
eclipse the logging check.

isLoggable() involves volatile reads, which prevent optimizations and
instruction reordering across the read.  On current x86 platforms, no
memory barrier instructions are needed for a volatile read, but that's
not true of other architectures.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless


Mark Miller wrote:


Michael McCandless wrote:


Mark Miller wrote:

What do we get from this though? A MultiSearcher (with the   
scoring issues) that can properly do rewrite? Won't we have to  
take MultiSearchers scoring baggage into this as well?


If this can work, what we'd get is far better reopen() performance
when you sort-by-field, with no change to the returned results
(rewrite, scores, sort order are identical).

Say you have 1MM doc index, and then you add 100 docs & commit.
Today, when you reopen() and then do a search, FieldCache recomputes
from scratch (iterating through all Terms in entire index) the global
arrays for the fields you're sorting on.  The cost is in proportion  
to

total index size.

With this change, only the new segment's terms will be iterated on,  
so

the cost is in proportion to what new segments appeared.

This is the same benefit we are seeking with LUCENE-831, for all uses
of FieldCache (not just sort-by-field), it's just that I think we can
achieve this speedup to sort-by-field without LUCENE-831.


Yup, I'm with you on all that. Except the without LUCENE-831 part -  
we need some FieldCache meddling right? The current FieldCache  
approach doesn't allow us to meddle much. Isn't it more like, we  
want the LUCENE-831 API (or something similar), but we won't need  
the objectarray or merge stuff?


We wouldn't need any change to FieldCache, because we only ask  
FieldCache for int[] (eg) on the SegmentReader instances.  Because  
reopen() shares SegmentReader instances, only the new segments would  
have a cache miss in FieldCache.  I think?


Once we do LUCENE-831, minus objectarray and merging, this change  
would be basically the same, ie, accessing per-segment int values,  
just with a new API.  Ie, by doing this change first I don't think  
we're going to waste much in then cutting over in the future to  
LUCENE-831's API (vs waiting for LUCENE-831 api).


I think there would be no change to the scoring: we would still  
create

a Weight based on the toplevel IndexReader, but then search each
sub-reader separately, using that Weight.

Though... that is unusual (to create a Weight with the parent
IndexSearcher and then use it in the sub-searchers) -- will something
break if we do that?  (This is new territory for me).


Okay, right. That does change things. Would love to hear more  
opinions, but that certainly seems reasonable to me. You score each  
segment using tf/idf stats from all of the segments.


That's my expectation (hope).  So the results are identical but  
performance is much better.



If something will break, I think we can still achieve this, but it
will be a more invasive change and probably will have to be re- 
coupled

to the new API we will introduce with LUCENE-831.  Marvin actually
referred to how to do this, here:

 https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854 
#action_12650854


in the paragraph starting with "If our goal is minimal impact...".
Basically during collection, the FieldSortedHitQueue would have to
keep track of subReaderIndex/subReaderDocID (mapping, through
iteration, from the primary docID w/o doing a wasteful new binary
search for each) and enroll into different pqueues indexed by
subReaderIndex, then do the merge sort in the end.

Mike



Michael McCandless wrote:


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

* Expose sub-readers (LUCENE-1475) by adding IndexReader[]
  IndexReader.getSubReaders.  Default impl could just return
  length-1 array of itself.

* Change IndexSearcher.sort that takes a Sort, to first call
  IndexReader.getSubReaders, and then do the same logic that
  MultiSearcher does, with improvements from LUCENE-1471 (run
  separate search per-reader, then merge-sort the top hits from
  each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since  
we'd

no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full  
array", ie, so we can access native values w/o materializing  
the full array.  I also think "upgrade to Object" is dangerous  
to even offer since it's so costly.



I'm right with you. I didn't think the Object approach was  
really an upgrade (beyond losing the merge, which is especially  
important for StringIndex - it has no merge option at the  
moment) which is why I left both options for now. So I def agree  
we need to move to iterator, drop object, etc.


Its the doin' that aint so easy. The iterator approach seems  
somewhat straightforward (though its complicated by needing to  
provide a ran

[jira] Commented: (LUCENE-1481) Sort and SortField does not have equals() and hashCode()

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654434#action_12654434
 ] 

Michael McCandless commented on LUCENE-1481:


Patch looks good; I'll commit shortly.

> Sort and SortField does not have equals() and hashCode()
> 
>
> Key: LUCENE-1481
> URL: https://issues.apache.org/jira/browse/LUCENE-1481
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1481.patch
>
>
> During developing for my project panFMP I had the following issue:
> I have a cache for queries (like Solr has, too)  for query results. This 
> cache also uses the Sort/SortField as key into the cache. The problem is, 
> because Sort/SortField does not implement equals() and hashCode(), you cannot 
> store them as cache keys. To workaround, currently I use Sort.toString() as 
> cache key, but this is not so nice.
> In corelation with issue LUCENE-1478, I could fix this there in one patch 
> together with the other improvements.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1481) Sort and SortField does not have equals() and hashCode()

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1481.


   Resolution: Fixed
Fix Version/s: 2.9
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 724379.

Thanks Uwe!

> Sort and SortField does not have equals() and hashCode()
> 
>
> Key: LUCENE-1481
> URL: https://issues.apache.org/jira/browse/LUCENE-1481
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Query/Scoring
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1481.patch
>
>
> During developing for my project panFMP I had the following issue:
> I have a cache for queries (like Solr has, too)  for query results. This 
> cache also uses the Sort/SortField as key into the cache. The problem is, 
> because Sort/SortField does not implement equals() and hashCode(), you cannot 
> store them as cache keys. To workaround, currently I use Sort.toString() as 
> cache key, but this is not so nice.
> In corelation with issue LUCENE-1478, I could fix this there in one patch 
> together with the other improvements.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1478:
--

Attachment: LUCENE-1478.patch

As LUCENE-1481 is committed, here the updated patch with SortField.hashCode() 
and equals()

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Java logging in Lucene

2008-12-08 Thread Earwin Burrfoot
I referred to the case when you want normal production logs, like
access logs, or whatever.
Debugging with all common logging implementations is also broken,
because switching logging on/off dramatically changes multithreading
picture.

On Mon, Dec 8, 2008 at 17:02, Shai Erera <[EMAIL PROTECTED]> wrote:
> {quote}
> My research shows there are no ready-made java logging frameworks that can
> be used in high-load production environment.
> {quote}
>
> I'm not sure I understand what you mean by that. We use Java logging in our
> high-profiled products, which support 100s of tps. Logging is usually turned
> off, and is being turned on only for debugging. We have not seen any
> problems with Java logging at runtime (i.e., w/o logging, when only
> logger.isLoggable calls are made) or at debug-time (when actual logging
> happens). Of course, at debug-time performance is slower, but that's debug
> time - you're not into performance, but for debugging.
>
> Anyway, as far as SLF4J goes, I've written a patch using it, and replacing
> infoStream. I'm about to open an issue and submit the patch, for everyone to
> review. We can continue the discussion there.
>
> Shai
>
> On Mon, Dec 8, 2008 at 10:13 AM, Earwin Burrfoot <[EMAIL PROTECTED]> wrote:
>>
>> The common problem with native logging, log4j and slf4j (logback impl)
>> is that they are totally unsuitable for actually logging something.
>> They do good work checking if the logging can be avoided, but use
>> almost-global locking if you really try to write this line to a file.
>> My research shows there are no ready-made java logging frameworks that
>> can be used in high-load production environment.
>>
>> On Sat, Dec 6, 2008 at 19:52, Shai Erera <[EMAIL PROTECTED]> wrote:
>> > On the performance side, I don't expect to see any different performance
>> > than what we have today, since checking if infoStream != null should be
>> > similar to logger.isLoggable (or the equivalent methods from SLF4J).
>> >
>> > I'll look at SLF4J, open an issue and work out a patch.
>> >
>> > On Sat, Dec 6, 2008 at 1:22 PM, Grant Ingersoll <[EMAIL PROTECTED]>
>> > wrote:
>> >>
>> >> On Dec 5, 2008, at 11:36 PM, Shai Erera wrote:
>> >>
>> >>>
>> >>> What do you have against JUL? I've used it and in my company (which is
>> >>> quite a large one btw) we've moved to JUL just because it's so easy to
>> >>> configure, comes already with the JDK and very intuitive. Perhaps it
>> >>> has
>> >>> some shortcomings which I'm not aware of, and I hope you can point me
>> >>> at
>> >>> them.
>> >>
>> >> See http://lucene.markmail.org/message/3t2qwbf7cc7wtx6h?q=Solr+logging
>> >> (or
>> >>
>> >> http://grantingersoll.com/2008/04/25/logging-frameworks-considered-harmful/
>> >> for
>> >> my rant on it!)  Frankly, I could live a quite happy life if I never
>> >> had to
>> >> think about logging frameworks again!
>> >>
>> >> As for JUL, the bottom line for me is (and perhaps I'm wrong):  It
>> >> doesn't
>> >> play nice with others (show me a system today that uses open source
>> >> projects
>> >> which doesn't have at least 2 diff. logging frameworks) and it usually
>> >> requires coding where other implementations don't.  My impression of
>> >> JUL is
>> >> that the designers wanted Log4j, but somehow they felt they had to come
>> >> up
>> >> with something "original", and in turn arrived at this thing that is
>> >> the
>> >> lowest common denominator.  But, like I said, it's a religious debate,
>> >> eh?
>> >> ;-)
>> >>
>> >> As for logging, you and Jason make good points.  I guess the first
>> >> thing
>> >> to do would be to submit a patch that adds SLF4J instead of infoStream
>> >> and
>> >> then we can test performance.  It still amazing, to me, however, that
>> >> Lucene
>> >> has made it this long with all but rudimentary logging and only during
>> >> indexing.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785


Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller
I tried a quick poor mans version using a MultiSearcher and wrapping the 
sub readers as searchers. Other than some AUTO sort field detection 
problems, all tests do appear to pass. The new sort stuff for 
MultiSearcher may be a tiny bit off...sort tests fail, though are only 
slightly off, with that patch. Havn't looked further yet - just hacked 
it up real quick. Seems to work, but needs work.


- Mark


Michael McCandless wrote:


Mark Miller wrote:


Michael McCandless wrote:


Mark Miller wrote:

What do we get from this though? A MultiSearcher (with the  scoring 
issues) that can properly do rewrite? Won't we have to take 
MultiSearchers scoring baggage into this as well?


If this can work, what we'd get is far better reopen() performance
when you sort-by-field, with no change to the returned results
(rewrite, scores, sort order are identical).

Say you have 1MM doc index, and then you add 100 docs & commit.
Today, when you reopen() and then do a search, FieldCache recomputes
from scratch (iterating through all Terms in entire index) the global
arrays for the fields you're sorting on.  The cost is in proportion to
total index size.

With this change, only the new segment's terms will be iterated on, so
the cost is in proportion to what new segments appeared.

This is the same benefit we are seeking with LUCENE-831, for all uses
of FieldCache (not just sort-by-field), it's just that I think we can
achieve this speedup to sort-by-field without LUCENE-831.


Yup, I'm with you on all that. Except the without LUCENE-831 part - 
we need some FieldCache meddling right? The current FieldCache 
approach doesn't allow us to meddle much. Isn't it more like, we want 
the LUCENE-831 API (or something similar), but we won't need the 
objectarray or merge stuff?


We wouldn't need any change to FieldCache, because we only ask 
FieldCache for int[] (eg) on the SegmentReader instances.  Because 
reopen() shares SegmentReader instances, only the new segments would 
have a cache miss in FieldCache.  I think?


Once we do LUCENE-831, minus objectarray and merging, this change 
would be basically the same, ie, accessing per-segment int values, 
just with a new API.  Ie, by doing this change first I don't think 
we're going to waste much in then cutting over in the future to 
LUCENE-831's API (vs waiting for LUCENE-831 api).



I think there would be no change to the scoring: we would still create
a Weight based on the toplevel IndexReader, but then search each
sub-reader separately, using that Weight.

Though... that is unusual (to create a Weight with the parent
IndexSearcher and then use it in the sub-searchers) -- will something
break if we do that?  (This is new territory for me).


Okay, right. That does change things. Would love to hear more 
opinions, but that certainly seems reasonable to me. You score each 
segment using tf/idf stats from all of the segments.


That's my expectation (hope).  So the results are identical but 
performance is much better.



If something will break, I think we can still achieve this, but it
will be a more invasive change and probably will have to be re-coupled
to the new API we will introduce with LUCENE-831.  Marvin actually
referred to how to do this, here:

 https://issues.apache.org/jira/browse/LUCENE-1458?focusedCommentId=12650854#action_12650854 



in the paragraph starting with "If our goal is minimal impact...".
Basically during collection, the FieldSortedHitQueue would have to
keep track of subReaderIndex/subReaderDocID (mapping, through
iteration, from the primary docID w/o doing a wasteful new binary
search for each) and enroll into different pqueues indexed by
subReaderIndex, then do the merge sort in the end.

Mike



Michael McCandless wrote:


On thinking more about this... I think with a few small changes we
could achieve Sort by field without materializing a full array.  We
can decouple this change from LUCENE-831.

I think all that's needed is:

* Expose sub-readers (LUCENE-1475) by adding IndexReader[]
  IndexReader.getSubReaders.  Default impl could just return
  length-1 array of itself.

* Change IndexSearcher.sort that takes a Sort, to first call
  IndexReader.getSubReaders, and then do the same logic that
  MultiSearcher does, with improvements from LUCENE-1471 (run
  separate search per-reader, then merge-sort the top hits from
  each).

The results should be functionally identical to what we have today,
but, searching after doing a reopen() should be much faster since 
we'd

no longer re-build the global FieldCache array.

Does this make sense?  It's a small change for a big win, I think.
Does anyone want to take a crack at this patch?

Mike

Mark Miller wrote:


Michael McCandless wrote:


I'd like to decouple "upgraded to Object" vs "materialize full 
array", ie, so we can access native values w/o materializing the 
full array.  I also think "upgrade to Object" is dangerous to 
even offer since it's so costly.



I'm right with you. I didn't think the 

Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless


Mark Miller wrote:

I tried a quick poor mans version using a MultiSearcher and wrapping  
the sub readers as searchers. Other than some AUTO sort field  
detection problems, all tests do appear to pass.


Excellent, that sounds like a tentatively positive result, though we  
do need to get to the bottom of those differences.


The new sort stuff for MultiSearcher may be a tiny bit off...sort  
tests fail, though are only slightly off, with that patch. Havn't  
looked further yet - just hacked it up real quick. Seems to work,  
but needs work.


Which new sort stuff are you referring to?  Is it LUCENE-1471?

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller

Michael McCandless wrote:


Mark Miller wrote:

I tried a quick poor mans version using a MultiSearcher and wrapping 
the sub readers as searchers. Other than some AUTO sort field 
detection problems, all tests do appear to pass.


Excellent, that sounds like a tentatively positive result, though we 
do need to get to the bottom of those differences.


I'm going to dig in more tonight I hope. The main issue is that using 
SortType.AUTO blows up because the MultiSearcher code expects it already 
to have been resolved to a sort type, but my hack kept that from 
happening so it hits a switch statement for AUTO that throws an 
exception. That shouldn't be a problem. The other issue is a weird 
situation where it hits an INT switch statement when the sort type is 
Byte and throws a class cast exception. I didn't have time to look into 
anything, but I'm guessing its AUTO related as well. Both probably non 
issues once I get time to do more.




The new sort stuff for MultiSearcher may be a tiny bit off...sort 
tests fail, though are only slightly off, with that patch. Havn't 
looked further yet - just hacked it up real quick. Seems to work, but 
needs work.


Which new sort stuff are you referring to?  Is it LUCENE-1471?


Yes. First thing I did was try and patch this in, but the sort tests 
failed. It would be the right order, but like the two center docs would 
be reversed or something. No time to dig in, so I just switch to the 
trunk MultiSearcher and all tests passed except for the two with the 
above issues.


- Mark



Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Uwe Schindler
Hi Mark,

> I'm going to dig in more tonight I hope. The main issue is that using
> SortType.AUTO blows up because the MultiSearcher code expects it already
> to have been resolved to a sort type, but my hack kept that from
> happening so it hits a switch statement for AUTO that throws an
> exception. That shouldn't be a problem. The other issue is a weird
> situation where it hits an INT switch statement when the sort type is
> Byte and throws a class cast exception. I didn't have time to look into
> anything, but I'm guessing its AUTO related as well. Both probably non
> issues once I get time to do more.

Haha, the same bug I encountered when doing my LUCENE-1478 developments. The
bug is in the comparator, it returns INT instead of BYTE in
FieldSortedHitQueue. It is fixed in my patch for LUCENE-1478, which is
currently under discussion, maybe we fix this before LUCENE-1478 in a
separate patch/issue.

But you can fix it simply, just correct it here in FieldSortedHitQueue.java
(this is a copy from my patch):

@@ -241,7 +257,7 @@
   }
 
   public int sortType() {
-return SortField.INT;
+return SortField.BYTE;
   }
 };
   }

Uwe


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654488#action_12654488
 ] 

Jason Rutherglen commented on LUCENE-831:
-

M. McCandless:

"This is an interesting idea. Say we create IntField, a subclass of
Field. It could directly accept a single int value and not accept
tokenization options. It could assert "not null", if the field wanted
that. FieldInfo could store that it's an int and expose more stronly
typed APIs from IndexReader.document as well. If in the future we
enable Term to be things-other-than-String, we could do the right
thing with typed fields. Etc"

+1 For 3.0 this will be of great benefit in the effort to remove the excessive 
string creation
that happens right now with Lucene.  Term should also be more generic such that 
it
can also accept primitive or user defined types (and index format encodings).  

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1478:
---

Attachment: LUCENE-1478.patch

OK I made a few small changes to the patch: added CHANGES entry, touched up 
javadocs, and added null check for field in the new SortField ctors.  I think 
it's ready to commit!

Uwe can you look over my changes?  Thanks.

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654513#action_12654513
 ] 

Doug Cutting commented on LUCENE-1473:
--

Would it take any more lines of code to remove Serializeable from the core 
classes and re-implement RemoteSearchable in a separate layer on top of the 
core APIs?  That layer could be a contrib module and could get all the 
externalizeable love it needs.  It could support a specific popular subset of 
query and filter classes, rather than arbitrary Query implementations.  It 
would be extensible, so that if folks wanted to support new kinds of queries, 
they easily could.  This other approach seems like a slippery slope, 
complicating already complex code with new concerns.  It would be better to 
encapsulate these concerns in a layer atop APIs whose back-compatibility we 
already make promises about, no?

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch, 
> LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1475) Expose sub-IndexReaders from MultiReader or MultiSegmentReader

2008-12-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1475:
-

Attachment: LUCENE-1475.patch

LUCENE-1475.patch

- Added getSubReaders to IndexReader which by default returns null
- Added IndexReader.isMultiReader default false.  MultiSegmentReader and 
MultiReader return true

I took this approach rather than add an interface as this seemed to be more 
with the kitchen sink IndexReader API currently in use (meaning it's not my 
first choice, but because it's a small addition I don't care).

> Expose sub-IndexReaders from MultiReader or MultiSegmentReader
> --
>
> Key: LUCENE-1475
> URL: https://issues.apache.org/jira/browse/LUCENE-1475
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1475.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> MultiReader and MultiSegmentReader are package protected and do not expose 
> the underlying sub-IndexReaders.  A way to expose the sub-readers is to have 
> an interface that an IndexReader may be cast to that exposes the underlying 
> readers.  
> This is for realtime indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1471:
--

Assignee: Michael McCandless

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1475) Expose sub-IndexReaders from MultiReader or MultiSegmentReader

2008-12-08 Thread robert engels (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654518#action_12654518
 ] 

robert engels commented on LUCENE-1475:
---

I think the API is wrong.

The method should either be

IndexReader[] getSubReaders()

It should return an empty array, not null when there are no sub readers.

But a better API is probably

IndexReader[] getReaders()

which contains an array containing itself if there are no sub-readers. The only 
drawback to this is if we allow multiple levels, then you need to make a check 
that this!=getReaders[0]

Using null adds null-checks in the common case.


> Expose sub-IndexReaders from MultiReader or MultiSegmentReader
> --
>
> Key: LUCENE-1475
> URL: https://issues.apache.org/jira/browse/LUCENE-1475
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1475.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> MultiReader and MultiSegmentReader are package protected and do not expose 
> the underlying sub-IndexReaders.  A way to expose the sub-readers is to have 
> an interface that an IndexReader may be cast to that exposes the underlying 
> readers.  
> This is for realtime indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654520#action_12654520
 ] 

Michael McCandless commented on LUCENE-1471:



I agree performance improvement is probably smallish since m & n are
usually small; still it'd be good to improve it, especially since
we're discussing cutting over sort-by-field searching in IndexSearcher
to the MultiSearcher approach, and, sometimes m & n may not be small.

There are two different patches here.  I think the approaches are
mostly the same (ie use 2nd pqueue to extract top N merged results),
but on quick inspection there are some differences:

  * The first one shares a common source for the big switch statement
(by extending FieldDocSortedHitQueue) on SortField.getType(), which
is great.

  * First one passes all tests; 2nd one fails at least 3 tests (all
due to the AUTO SortField -- what's the fix here?).

  * Code style is closer to Lucene's in the first one ({'s not on
separate lines, no _ leader in many variable names).

I'm sure there are other differences I'm missing.  Can you two work
together to merge the two patches into a single one?  Thanks.


> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654529#action_12654529
 ] 

Jason Rutherglen commented on LUCENE-1471:
--

The patches seem to implement the same concept?  I'm using the 2nd one because 
FieldDocSortedHitQueue is not public (it should be) and some other class is 
final that made using the 1st patch impossible.  

If there is no performance difference then the 1st patch is less code and 
re-uses Lucene more so the 1st looks best.

Mike M:
"I agree performance improvement is probably smallish since m & n are
usually small; "
If results are in the hundreds then the performance matters.  With more 
microprocessor cores 
growing because we don't have nanotech processors yet, parallel thread 
searching should be the norm 
for systems that care about response time.  

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1475) Expose sub-IndexReaders from MultiReader or MultiSegmentReader

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654532#action_12654532
 ] 

Michael McCandless commented on LUCENE-1475:


I agree: a better default impl is length 1 array with yourself.

What should be returned if a Multi*Reader has embedded Multi*Readers as 
sub-readers?  (Admittedly rare but still... eg from the standpoint of 
LUCENE-831, we'd want them expanded & inlined (recursively) into the returned 
array, I think).

Also, this API implicitly carries a promise which is the readers are logically 
sequentially concatenated to define the docID sequence.  Maybe we should name 
it getSequentialReaders or something less generic, to reflect this?  EG 
ParallelReader also contains an array of sub-readers, but one should never 
return that in getReaders().

> Expose sub-IndexReaders from MultiReader or MultiSegmentReader
> --
>
> Key: LUCENE-1475
> URL: https://issues.apache.org/jira/browse/LUCENE-1475
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1475.patch
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> MultiReader and MultiSegmentReader are package protected and do not expose 
> the underlying sub-IndexReaders.  A way to expose the sub-readers is to have 
> an interface that an IndexReader may be cast to that exposes the underlying 
> readers.  
> This is for realtime indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654537#action_12654537
 ] 

Uwe Schindler commented on LUCENE-1478:
---

Hi Mike,

patch looks good, checked each change of you with TortoiseMerge. All tests pass 
including Trie ones.

The only comments: You added this java docs to hashCode() and equals() in the 
patch of LUCENE-1481. Maybe you should add the parser here, too.

  /** Returns a hash code value for this object.  If a
   *  [EMAIL PROTECTED] SortComparatorSource} was provided, it must
   *  properly implement hashCode. */

But on the other hand side: If the parser and/or comparator are static 
singletons (like it is done by the TrieUtils factories) they are not needed to 
implement equals and hashcode. The default Object equals/hashcode is enough for 
singletons. And I think most parsers and comparators are singletons. A short 
not should be enough.

The additional null check is OK but in my opinion not needed, because 
field!=null when not one of the special RELEVANCE/DOCORDER sorts. For 
consistency we may add the check to the other ctors, too.

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654546#action_12654546
 ] 

Michael McCandless commented on LUCENE-1471:


bq. The patches seem to implement the same concept?

That's my impression.

bq. I'm using the 2nd one because FieldDocSortedHitQueue is not public (it 
should be) and some other class is final that made using the 1st patch 
impossible.

The first patch works fine, w/o making FieldDocSortedHitQueue public.

bq. If there is no performance difference then the 1st patch is less code and 
re-uses Lucene more so the 1st looks best.

OK I'll go forwards with the first patch.

{quote}
If results are in the hundreds then the performance matters. With more 
microprocessor cores
growing because we don't have nanotech processors yet, parallel thread 
searching should be the norm
for systems that care about response time.
{quote}

I would love to find a clean way to make Lucene's searching "naturally" 
concurrent, so that more cores would in fact greatly reduce the worst case 
latency.  Our inability to properly use concurrency on the search side to 
reduce a single query's latency (we can of course use concurrency to improve 
net throughput, today) will soon be a big limitation.  ParallelMultiSearcher 
ought to work, but it requires you to manually partition.  And it should pool 
threads or use ExecutorService.  But I don't see how this applies to this 
issue...

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1482) Replace infoSteram by a logging framework (SLF4J)

2008-12-08 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654545#action_12654545
 ] 

Doug Cutting commented on LUCENE-1482:
--

> Should I attach the slf4j jars separately?

If we go with SLF4J, we'd want to include the -api jar in Lucene for sure, 
along with a single implementation.  My vote would be for the -nop 
implementation.  Then, folks who want logging can include the implementation 
they like.


> Replace infoSteram by a logging framework (SLF4J)
> -
>
> Key: LUCENE-1482
> URL: https://issues.apache.org/jira/browse/LUCENE-1482
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: LUCENE-1482.patch
>
>
> Lucene makes use of infoStream to output messages in its indexing code only. 
> For debugging purposes, when the search application is run on the customer 
> side, getting messages from other code flows, like search, query parsing, 
> analysis etc can be extremely useful.
> There are two main problems with infoStream today:
> 1. It is owned by IndexWriter, so if I want to add logging capabilities to 
> other classes I need to either expose an API or propagate infoStream to all 
> classes (see for example DocumentsWriter, which receives its infoStream 
> instance from IndexWriter).
> 2. I can either turn debugging on or off, for the entire code.
> Introducing a logging framework can allow each class to control its logging 
> independently, and more importantly, allows the application to turn on 
> logging for only specific areas in the code (i.e., org.apache.lucene.index.*).
> I've investigated SLF4J (stands for Simple Logging Facade for Java) which is, 
> as it names states, a facade over different logging frameworks. As such, you 
> can include the slf4j.jar in your application, and it recognizes at deploy 
> time what is the actual logging framework you'd like to use. SLF4J comes with 
> several adapters for Java logging, Log4j and others. If you know your 
> application uses Java logging, simply drop slf4j.jar and slf4j-jdk14.jar in 
> your classpath, and your logging statements will use Java logging underneath 
> the covers.
> This makes the logging code very simple. For a class A the logger will be 
> instantiated like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
> }
> And will later be used like this:
> public class A {
>   private static final logger = LoggerFactory.getLogger(A.class);
>   public void foo() {
> if (logger.isDebugEnabled()) {
>   logger.debug("message");
> }
>   }
> }
> That's all !
> Checking for isDebugEnabled is very quick, at least using the JDK14 adapter 
> (but I assume it's fast also over other logging frameworks).
> The important thing is, every class controls its own logger. Not all classes 
> have to output logging messages, and we can improve Lucene's logging 
> gradually, w/o changing the API, by adding more logging messages to 
> interesting classes.
> I will submit a patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654537#action_12654537
 ] 

thetaphi edited comment on LUCENE-1478 at 12/8/08 11:59 AM:
-

Hi Mike,

patch looks good, checked each change of you with TortoiseMerge. All tests pass 
including Trie ones.

The only comments: You added this java docs to hashCode() and equals() in the 
patch of LUCENE-1481. Maybe you should add the parser here, too.

{code}
  /** Returns a hash code value for this object.  If a
   *  [EMAIL PROTECTED] SortComparatorSource} was provided, it must
   *  properly implement hashCode. */
{code}
But on the other hand side: If the parser and/or comparator are static 
singletons (like it is done by the TrieUtils factories) they are not needed to 
implement equals and hashcode. The default Object equals/hashcode is enough for 
singletons. And I think most parsers and comparators are singletons. A short 
not should be enough.

The additional null check is OK but in my opinion not needed, because 
field!=null when not one of the special RELEVANCE/DOCORDER sorts. For 
consistency we may add the check to the other ctors, too.

  was (Author: thetaphi):
Hi Mike,

patch looks good, checked each change of you with TortoiseMerge. All tests pass 
including Trie ones.

The only comments: You added this java docs to hashCode() and equals() in the 
patch of LUCENE-1481. Maybe you should add the parser here, too.

  /** Returns a hash code value for this object.  If a
   *  [EMAIL PROTECTED] SortComparatorSource} was provided, it must
   *  properly implement hashCode. */

But on the other hand side: If the parser and/or comparator are static 
singletons (like it is done by the TrieUtils factories) they are not needed to 
implement equals and hashcode. The default Object equals/hashcode is enough for 
singletons. And I think most parsers and comparators are singletons. A short 
not should be enough.

The additional null check is OK but in my opinion not needed, because 
field!=null when not one of the special RELEVANCE/DOCORDER sorts. For 
consistency we may add the check to the other ctors, too.
  
> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1478:
---

Attachment: LUCENE-1478.patch

New patch attached:

bq. Maybe you should add the parser here, too.

OK done.

bq. The default Object equals/hashcode is enough for singletons.

OK I updated javadoc to add "unless a singleton is always used".

bq. The additional null check is OK but in my opinion not needed, because 
field!=null when not one of the special RELEVANCE/DOCORDER sorts. For 
consistency we may add the check to the other ctors, too.

Yeah I added that because the javadoc had previously said field could
be null for DOC/SCORE sort type.  OK I added that check for all ctors
(added initFieldType private utility method).

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654569#action_12654569
 ] 

Uwe Schindler commented on LUCENE-1478:
---

Hi Mike,
all is ok. The extra check is better than before! I think its ready for commit.
Thanks for the discussions!

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654571#action_12654571
 ] 

Michael McCandless commented on LUCENE-1476:



I like this approach!!

It's also incremental in cost (cost of flush/commit is in proportion
to how many deletes were done), but you are storing the "packet" of
incremental deletes with the segment you just flushed and not against
the N segments that had deletes.  And you write only one file to hold
all the tombstones, which for commit() (file sync) is much less cost.

And it's great that we don't need a new merge policy to handle all the
delete files.

Though one possible downside is, for a very large segment in a very
large index you will likely be merging (at search time) quite a few
delete packets.  But, with the cutover to
deletes-accessed-only-by-iterator, this cost is probably not high
until a large pctg of the segment's docs are deleted, at which point
you should really expungeDeletes() or optimize() or optimize(int)
anyway.

If only we could write code as quickly as we can dream...


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1478.


   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Committed revision 724484.

Thanks Uwe!

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1478:
---

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread eks dev
That sounds much better. Trying to distribute lucene (my reason why all this 
would be interesting) itself is just not going to work for far too many 
applications and will put burden on API extensions.

My point is, I do not want to distribute Lucene Index, I need to distribute my 
application that is using Lucene. Think of it like having distributed Luke, 
usefull by itself, but not really usefull for slightly more complex use cases.
My Hit class is specialized Lucene Hit object, my Query has totally diferent 
features and agregates Lucene Query... this is what I can control, what I need 
to send over the wire and that is the place where I define what is my 
Version/API, if lucene API Classes change and all existing featurs remain, I 
have no problems in keeping my serialized objects compatible.  So the 
versioning becomes under my control, Lucene provides only features, library.
 

Having light layer, easily extensible,  on top of the core  API would be just 
great, as fas as I am concerned java Serialization is not my world, having 
something light and extensible in etch/thrift/hadop IPC/ProtocolBuffers  
direction is much more thrilling. That is exactly the road hadoop, nutch, katta 
and probably many others are taking, having comon base that supports such cases 
is maybe good idea, why not making RemoteSearchable using hadoop IPC, or 
etch/thrift ...
 
Maybe there are other reasons to suport java serialization, I do not know. Just 
painting one view on this idea 




- Original Message 
> From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Sent: Monday, 8 December, 2008 19:52:46
> Subject: [jira] Commented: (LUCENE-1473) Implement standard Serialization 
> across Lucene versions
> 
> 
> [ 
> https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654513#action_12654513
>  
> ] 
> 
> Doug Cutting commented on LUCENE-1473:
> --
> 
> Would it take any more lines of code to remove Serializeable from the core 
> classes and re-implement RemoteSearchable in a separate layer on top of the 
> core 
> APIs?  That layer could be a contrib module and could get all the 
> externalizeable love it needs.  It could support a specific popular subset of 
> query and filter classes, rather than arbitrary Query implementations.  It 
> would 
> be extensible, so that if folks wanted to support new kinds of queries, they 
> easily could.  This other approach seems like a slippery slope, complicating 
> already complex code with new concerns.  It would be better to encapsulate 
> these 
> concerns in a layer atop APIs whose back-compatibility we already make 
> promises 
> about, no?
> 
> > Implement standard Serialization across Lucene versions
> > ---
> >
> > Key: LUCENE-1473
> > URL: https://issues.apache.org/jira/browse/LUCENE-1473
> > Project: Lucene - Java
> >  Issue Type: Bug
> >  Components: Search
> >Affects Versions: 2.4
> >Reporter: Jason Rutherglen
> >Priority: Minor
> > Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch, 
> LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch
> >
> >   Original Estimate: 8h
> >  Remaining Estimate: 8h
> >
> > To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for 
> faster performance.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread robert engels

I think an important piece to make this work is the query parser/syntax.

We already have a system similar to what is outlined below.  We made  
changes to the query syntax to support our various query extensions.


The nice thing, is that persisting queries is a simple string.  It  
also makes it very easy for external system to submit queries.


We also have XML definitions for a "result set".

I think the only way to make this work though, is probably a more  
detailed query syntax (similar to SQL), so that it can be easily  
extended with new clauses/functions without breaking existing code.


I would also suggest that any core queries classes have a  
representation here.


I would also like to see a way for "proprietary" clauses to be  
supported (like calls in SQL).


On Dec 8, 2008, at 3:37 PM, eks dev wrote:

That sounds much better. Trying to distribute lucene (my reason why  
all this would be interesting) itself is just not going to work for  
far too many applications and will put burden on API extensions.


My point is, I do not want to distribute Lucene Index, I need to  
distribute my application that is using Lucene. Think of it like  
having distributed Luke, usefull by itself, but not really usefull  
for slightly more complex use cases.
My Hit class is specialized Lucene Hit object, my Query has totally  
diferent features and agregates Lucene Query... this is what I can  
control, what I need to send over the wire and that is the place  
where I define what is my Version/API, if lucene API Classes change  
and all existing featurs remain, I have no problems in keeping my  
serialized objects compatible.  So the versioning becomes under my  
control, Lucene provides only features, library.


Having light layer, easily extensible,  on top of the core  API  
would be just great, as fas as I am concerned java Serialization is  
not my world, having something light and extensible in etch/thrift/ 
hadop IPC/ProtocolBuffers  direction is much more thrilling. That  
is exactly the road hadoop, nutch, katta and probably many others  
are taking, having comon base that supports such cases is maybe  
good idea, why not making RemoteSearchable using hadoop IPC, or  
etch/thrift ...


Maybe there are other reasons to suport java serialization, I do  
not know. Just painting one view on this idea





- Original Message 

From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, 8 December, 2008 19:52:46
Subject: [jira] Commented: (LUCENE-1473) Implement standard  
Serialization across Lucene versions



[
https://issues.apache.org/jira/browse/LUCENE-1473? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel&focusedCommentId=12654513#action_12654513

]

Doug Cutting commented on LUCENE-1473:
--

Would it take any more lines of code to remove Serializeable from  
the core
classes and re-implement RemoteSearchable in a separate layer on  
top of the core

APIs?  That layer could be a contrib module and could get all the
externalizeable love it needs.  It could support a specific  
popular subset of
query and filter classes, rather than arbitrary Query  
implementations.  It would
be extensible, so that if folks wanted to support new kinds of  
queries, they
easily could.  This other approach seems like a slippery slope,  
complicating
already complex code with new concerns.  It would be better to  
encapsulate these
concerns in a layer atop APIs whose back-compatibility we already  
make promises

about, no?


Implement standard Serialization across Lucene versions
---

Key: LUCENE-1473
URL: https://issues.apache.org/jira/browse/ 
LUCENE-1473

Project: Lucene - Java
 Issue Type: Bug
 Components: Search
   Affects Versions: 2.4
   Reporter: Jason Rutherglen
   Priority: Minor
Attachments: custom-externalizable-reader.patch,  
LUCENE-1473.patch,

LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch


  Original Estimate: 8h
 Remaining Estimate: 8h

To maintain serialization compatibility between Lucene versions,

serialVersionUID needs to be added to classes that implement
java.io.Serializable.  java.io.Externalizable may be implemented  
in classes for

faster performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: 

[jira] Commented: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-12-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654589#action_12654589
 ] 

Grant Ingersoll commented on LUCENE-1462:
-

Karl,

I made TVOI serializable: Committed revision 724501.

> Instantiated/IndexWriter discrepanies
> -
>
> Key: LUCENE-1462
> URL: https://issues.apache.org/jira/browse/LUCENE-1462
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Critical
> Fix For: 2.9
>
> Attachments: LUCENE-1462.txt
>
>
>  * RAMDirectory seems to do a reset on tokenStreams the first time, this 
> permits to initialise some objects before starting streaming, 
> InstantiatedIndex does not.
>  * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
> of : java.io.NotSerializableException: 
> org.apache.lucene.index.TermVectorOffsetInfo
> http://www.nabble.com/InstatiatedIndex-questions-to20576722.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654592#action_12654592
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

Marvin: 
"I'm also bothered by the proliferation of small deletions files. Probably
you'd want automatic consolidation of files under 4k, but you still could end
up with a lot of files in a big index."

A transaction log might be better here if we want to go to 0ish millisecond 
realtime. 
On Windows at least creating files rapidly and deleting them creates 
significant IO overhead.
UNIX is probably faster but I do not know.



> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1471:
-

Comment: was deleted

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654594#action_12654594
 ] 

Jason Rutherglen commented on LUCENE-1471:
--

Wouldn't it be good to remove BitVector and replace it with OpenBitSet?  OBS is 
faster, has the DocIdSetIterator already.  It just needs to implement write to 
disk compression of the bitset (dgaps?).  This would be a big win for almost 
*all* searches.  We could also create an interface so that any bitset 
implementation could be used.  

Such as:
{code}
public interface WriteableBitSet {
  public void write(IndexOutput output) throws IOException;
}
{code}

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654595#action_12654595
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

Wouldn't it be good to remove BitVector and replace it with OpenBitSet?  OBS is 
faster, has the DocIdSetIterator already.  It just needs to implement write to 
disk compression of the bitset (dgaps?).  This would be a big win for almost 
*all* searches.  We could also create an interface so that any bitset 
implementation could be used.

Such as:
{code}
public interface WriteableBitSet {
 public void write(IndexOutput output) throws IOException;
}
{code}

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread Erik Hatcher
Well, there's the pretty sophisticated and extensible XML query parser  
in contrib.  I've still only scratched the surface of it, but it meets  
the specs you mentioned.


Erik


On Dec 8, 2008, at 4:51 PM, robert engels wrote:

I think an important piece to make this work is the query parser/ 
syntax.


We already have a system similar to what is outlined below.  We made  
changes to the query syntax to support our various query extensions.


The nice thing, is that persisting queries is a simple string.  It  
also makes it very easy for external system to submit queries.


We also have XML definitions for a "result set".

I think the only way to make this work though, is probably a more  
detailed query syntax (similar to SQL), so that it can be easily  
extended with new clauses/functions without breaking existing code.


I would also suggest that any core queries classes have a  
representation here.


I would also like to see a way for "proprietary" clauses to be  
supported (like calls in SQL).


On Dec 8, 2008, at 3:37 PM, eks dev wrote:

That sounds much better. Trying to distribute lucene (my reason why  
all this would be interesting) itself is just not going to work for  
far too many applications and will put burden on API extensions.


My point is, I do not want to distribute Lucene Index, I need to  
distribute my application that is using Lucene. Think of it like  
having distributed Luke, usefull by itself, but not really usefull  
for slightly more complex use cases.
My Hit class is specialized Lucene Hit object, my Query has totally  
diferent features and agregates Lucene Query... this is what I can  
control, what I need to send over the wire and that is the place  
where I define what is my Version/API, if lucene API Classes change  
and all existing featurs remain, I have no problems in keeping my  
serialized objects compatible.  So the versioning becomes under my  
control, Lucene provides only features, library.


Having light layer, easily extensible,  on top of the core  API  
would be just great, as fas as I am concerned java Serialization is  
not my world, having something light and extensible in etch/thrift/ 
hadop IPC/ProtocolBuffers  direction is much more thrilling. That  
is exactly the road hadoop, nutch, katta and probably many others  
are taking, having comon base that supports such cases is maybe  
good idea, why not making RemoteSearchable using hadoop IPC, or  
etch/thrift ...


Maybe there are other reasons to suport java serialization, I do  
not know. Just painting one view on this idea





- Original Message 

From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, 8 December, 2008 19:52:46
Subject: [jira] Commented: (LUCENE-1473) Implement standard  
Serialization across Lucene versions



   [
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654513 
#action_12654513

]

Doug Cutting commented on LUCENE-1473:
--

Would it take any more lines of code to remove Serializeable from  
the core
classes and re-implement RemoteSearchable in a separate layer on  
top of the core

APIs?  That layer could be a contrib module and could get all the
externalizeable love it needs.  It could support a specific  
popular subset of
query and filter classes, rather than arbitrary Query  
implementations.  It would
be extensible, so that if folks wanted to support new kinds of  
queries, they
easily could.  This other approach seems like a slippery slope,  
complicating
already complex code with new concerns.  It would be better to  
encapsulate these
concerns in a layer atop APIs whose back-compatibility we already  
make promises

about, no?


Implement standard Serialization across Lucene versions
---

   Key: LUCENE-1473
   URL: https://issues.apache.org/jira/browse/LUCENE-1473
   Project: Lucene - Java
Issue Type: Bug
Components: Search
  Affects Versions: 2.4
  Reporter: Jason Rutherglen
  Priority: Minor
   Attachments: custom-externalizable-reader.patch,  
LUCENE-1473.patch,

LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch


 Original Estimate: 8h
Remaining Estimate: 8h

To maintain serialization compatibility between Lucene versions,

serialVersionUID needs to be added to classes that implement
java.io.Serializable.  java.io.Externalizable may be implemented  
in classes for

faster performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-m

Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread robert engels
The problem with that is that in most cases you still need a "string"  
based syntax that "people" can enter...


I guess you can always have an "advanced search" page that builds and  
submits the XML query behind the scenes.




On Dec 8, 2008, at 4:40 PM, Erik Hatcher wrote:

Well, there's the pretty sophisticated and extensible XML query  
parser in contrib.  I've still only scratched the surface of it,  
but it meets the specs you mentioned.


Erik


On Dec 8, 2008, at 4:51 PM, robert engels wrote:

I think an important piece to make this work is the query parser/ 
syntax.


We already have a system similar to what is outlined below.  We  
made changes to the query syntax to support our various query  
extensions.


The nice thing, is that persisting queries is a simple string.  It  
also makes it very easy for external system to submit queries.


We also have XML definitions for a "result set".

I think the only way to make this work though, is probably a more  
detailed query syntax (similar to SQL), so that it can be easily  
extended with new clauses/functions without breaking existing code.


I would also suggest that any core queries classes have a  
representation here.


I would also like to see a way for "proprietary" clauses to be  
supported (like calls in SQL).


On Dec 8, 2008, at 3:37 PM, eks dev wrote:

That sounds much better. Trying to distribute lucene (my reason  
why all this would be interesting) itself is just not going to  
work for far too many applications and will put burden on API  
extensions.


My point is, I do not want to distribute Lucene Index, I need to  
distribute my application that is using Lucene. Think of it like  
having distributed Luke, usefull by itself, but not really  
usefull for slightly more complex use cases.
My Hit class is specialized Lucene Hit object, my Query has  
totally diferent features and agregates Lucene Query... this is  
what I can control, what I need to send over the wire and that is  
the place where I define what is my Version/API, if lucene API  
Classes change and all existing featurs remain, I have no  
problems in keeping my serialized objects compatible.  So the  
versioning becomes under my control, Lucene provides only  
features, library.


Having light layer, easily extensible,  on top of the core  API  
would be just great, as fas as I am concerned java Serialization  
is not my world, having something light and extensible in etch/ 
thrift/hadop IPC/ProtocolBuffers  direction is much more  
thrilling. That is exactly the road hadoop, nutch, katta and  
probably many others are taking, having comon base that supports  
such cases is maybe good idea, why not making RemoteSearchable  
using hadoop IPC, or etch/thrift ...


Maybe there are other reasons to suport java serialization, I do  
not know. Just painting one view on this idea





- Original Message 

From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, 8 December, 2008 19:52:46
Subject: [jira] Commented: (LUCENE-1473) Implement standard  
Serialization across Lucene versions



   [
https://issues.apache.org/jira/browse/LUCENE-1473? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel&focusedCommentId=12654513#action_12654513

]

Doug Cutting commented on LUCENE-1473:
--

Would it take any more lines of code to remove Serializeable  
from the core
classes and re-implement RemoteSearchable in a separate layer on  
top of the core

APIs?  That layer could be a contrib module and could get all the
externalizeable love it needs.  It could support a specific  
popular subset of
query and filter classes, rather than arbitrary Query  
implementations.  It would
be extensible, so that if folks wanted to support new kinds of  
queries, they
easily could.  This other approach seems like a slippery slope,  
complicating
already complex code with new concerns.  It would be better to  
encapsulate these
concerns in a layer atop APIs whose back-compatibility we  
already make promises

about, no?


Implement standard Serialization across Lucene versions
---

   Key: LUCENE-1473
   URL: https://issues.apache.org/jira/browse/ 
LUCENE-1473

   Project: Lucene - Java
Issue Type: Bug
Components: Search
  Affects Versions: 2.4
  Reporter: Jason Rutherglen
  Priority: Minor
   Attachments: custom-externalizable-reader.patch,  
LUCENE-1473.patch,

LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch


 Original Estimate: 8h
Remaining Estimate: 8h

To maintain serialization compatibility between Lucene versions,

serialVersionUID needs to be added to classes that implement
java.io.Serializable.  java.io.Externalizable may be implemented  
in classes for

faster performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add 

Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread Earwin Burrfoot
Building your own parser with Antlr is really easy. Using Ragel is
harder, but yields insane parsing performance.
Is there any reason to worry about library-bundled parsers if you're
making something more complex then a college project?

On Tue, Dec 9, 2008 at 01:49, robert engels <[EMAIL PROTECTED]> wrote:
> The problem with that is that in most cases you still need a "string" based
> syntax that "people" can enter...
>
> I guess you can always have an "advanced search" page that builds and
> submits the XML query behind the scenes.
>
>
>
> On Dec 8, 2008, at 4:40 PM, Erik Hatcher wrote:
>
>> Well, there's the pretty sophisticated and extensible XML query parser in
>> contrib.  I've still only scratched the surface of it, but it meets the
>> specs you mentioned.
>>
>>Erik
>>
>>
>> On Dec 8, 2008, at 4:51 PM, robert engels wrote:
>>
>>> I think an important piece to make this work is the query parser/syntax.
>>>
>>> We already have a system similar to what is outlined below.  We made
>>> changes to the query syntax to support our various query extensions.
>>>
>>> The nice thing, is that persisting queries is a simple string.  It also
>>> makes it very easy for external system to submit queries.
>>>
>>> We also have XML definitions for a "result set".
>>>
>>> I think the only way to make this work though, is probably a more
>>> detailed query syntax (similar to SQL), so that it can be easily extended
>>> with new clauses/functions without breaking existing code.
>>>
>>> I would also suggest that any core queries classes have a representation
>>> here.
>>>
>>> I would also like to see a way for "proprietary" clauses to be supported
>>> (like calls in SQL).
>>>
>>> On Dec 8, 2008, at 3:37 PM, eks dev wrote:
>>>
 That sounds much better. Trying to distribute lucene (my reason why all
 this would be interesting) itself is just not going to work for far too 
 many
 applications and will put burden on API extensions.

 My point is, I do not want to distribute Lucene Index, I need to
 distribute my application that is using Lucene. Think of it like having
 distributed Luke, usefull by itself, but not really usefull for slightly
 more complex use cases.
 My Hit class is specialized Lucene Hit object, my Query has totally
 diferent features and agregates Lucene Query... this is what I can control,
 what I need to send over the wire and that is the place where I define what
 is my Version/API, if lucene API Classes change and all existing featurs
 remain, I have no problems in keeping my serialized objects compatible.  So
 the versioning becomes under my control, Lucene provides only features,
 library.

 Having light layer, easily extensible,  on top of the core  API would be
 just great, as fas as I am concerned java Serialization is not my world,
 having something light and extensible in etch/thrift/hadop
 IPC/ProtocolBuffers  direction is much more thrilling. That is exactly the
 road hadoop, nutch, katta and probably many others are taking, having comon
 base that supports such cases is maybe good idea, why not making
 RemoteSearchable using hadoop IPC, or etch/thrift ...

 Maybe there are other reasons to suport java serialization, I do not
 know. Just painting one view on this idea




 - Original Message 
>
> From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Sent: Monday, 8 December, 2008 19:52:46
> Subject: [jira] Commented: (LUCENE-1473) Implement standard
> Serialization across Lucene versions
>
>
>   [
>
> https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654513#action_12654513
> ]
>
> Doug Cutting commented on LUCENE-1473:
> --
>
> Would it take any more lines of code to remove Serializeable from the
> core
> classes and re-implement RemoteSearchable in a separate layer on top of
> the core
> APIs?  That layer could be a contrib module and could get all the
> externalizeable love it needs.  It could support a specific popular
> subset of
> query and filter classes, rather than arbitrary Query implementations.
>  It would
> be extensible, so that if folks wanted to support new kinds of queries,
> they
> easily could.  This other approach seems like a slippery slope,
> complicating
> already complex code with new concerns.  It would be better to
> encapsulate these
> concerns in a layer atop APIs whose back-compatibility we already make
> promises
> about, no?
>
>> Implement standard Serialization across Lucene versions
>> ---
>>
>>   Key: LUCENE-1473
>>   URL: https://issues.apac

[jira] Updated: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1478:
--

Attachment: LUCENE-1478-cleanup.patch

Hi Mike,
sorry, after looking a second time into the new SortField ctors, I chaged two 
cosmetic things:

- The ctor for parser assigns this.type=type and later calls the init method 
with the member variable type. The init method assigns so the meber to the 
member agian. Cleaner is just call the initFieldType() method in the correct 
instanceof clause hit.
- Moved the initFieldType() behind all ctors.

But this is only cosmetic :)

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1478-cleanup.patch, 
> LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread robert engels
I only meant is from a persistence standpoint - if you need a full  
"human enterable" query syntax anyway, why not just use that as the  
persistence format.


On Dec 8, 2008, at 4:53 PM, Earwin Burrfoot wrote:


Building your own parser with Antlr is really easy. Using Ragel is
harder, but yields insane parsing performance.
Is there any reason to worry about library-bundled parsers if you're
making something more complex then a college project?

On Tue, Dec 9, 2008 at 01:49, robert engels <[EMAIL PROTECTED]>  
wrote:
The problem with that is that in most cases you still need a  
"string" based

syntax that "people" can enter...

I guess you can always have an "advanced search" page that builds and
submits the XML query behind the scenes.



On Dec 8, 2008, at 4:40 PM, Erik Hatcher wrote:

Well, there's the pretty sophisticated and extensible XML query  
parser in
contrib.  I've still only scratched the surface of it, but it  
meets the

specs you mentioned.

   Erik


On Dec 8, 2008, at 4:51 PM, robert engels wrote:

I think an important piece to make this work is the query parser/ 
syntax.


We already have a system similar to what is outlined below.  We  
made
changes to the query syntax to support our various query  
extensions.


The nice thing, is that persisting queries is a simple string.   
It also

makes it very easy for external system to submit queries.

We also have XML definitions for a "result set".

I think the only way to make this work though, is probably a more
detailed query syntax (similar to SQL), so that it can be easily  
extended

with new clauses/functions without breaking existing code.

I would also suggest that any core queries classes have a  
representation

here.

I would also like to see a way for "proprietary" clauses to be  
supported

(like calls in SQL).

On Dec 8, 2008, at 3:37 PM, eks dev wrote:

That sounds much better. Trying to distribute lucene (my reason  
why all
this would be interesting) itself is just not going to work for  
far too many

applications and will put burden on API extensions.

My point is, I do not want to distribute Lucene Index, I need to
distribute my application that is using Lucene. Think of it  
like having
distributed Luke, usefull by itself, but not really usefull for  
slightly

more complex use cases.
My Hit class is specialized Lucene Hit object, my Query has  
totally
diferent features and agregates Lucene Query... this is what I  
can control,
what I need to send over the wire and that is the place where I  
define what
is my Version/API, if lucene API Classes change and all  
existing featurs
remain, I have no problems in keeping my serialized objects  
compatible.  So
the versioning becomes under my control, Lucene provides only  
features,

library.

Having light layer, easily extensible,  on top of the core  API  
would be
just great, as fas as I am concerned java Serialization is not  
my world,

having something light and extensible in etch/thrift/hadop
IPC/ProtocolBuffers  direction is much more thrilling. That is  
exactly the
road hadoop, nutch, katta and probably many others are taking,  
having comon

base that supports such cases is maybe good idea, why not making
RemoteSearchable using hadoop IPC, or etch/thrift ...

Maybe there are other reasons to suport java serialization, I  
do not

know. Just painting one view on this idea




- Original Message 


From: Doug Cutting (JIRA) <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, 8 December, 2008 19:52:46
Subject: [jira] Commented: (LUCENE-1473) Implement standard
Serialization across Lucene versions


  [

https://issues.apache.org/jira/browse/LUCENE-1473? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel&focusedCommentId=12654513#action_12654513

]

Doug Cutting commented on LUCENE-1473:
--

Would it take any more lines of code to remove Serializeable  
from the

core
classes and re-implement RemoteSearchable in a separate layer  
on top of

the core
APIs?  That layer could be a contrib module and could get all the
externalizeable love it needs.  It could support a specific  
popular

subset of
query and filter classes, rather than arbitrary Query  
implementations.

 It would
be extensible, so that if folks wanted to support new kinds of  
queries,

they
easily could.  This other approach seems like a slippery slope,
complicating
already complex code with new concerns.  It would be better to
encapsulate these
concerns in a layer atop APIs whose back-compatibility we  
already make

promises
about, no?


Implement standard Serialization across Lucene versions
---

  Key: LUCENE-1473
  URL: https://issues.apache.org/jira/browse/ 
LUCENE-1473

  Project: Lucene - Java
   Issue Type: Bug
   Components: Search
 Affects Versions: 2.4
 Reporter: Jason Rutherglen
 Priority: Minor

Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller

Mark Miller wrote:


Which new sort stuff are you referring to?  Is it LUCENE-1471?


Yes. First thing I did was try and patch this in, but the sort tests 
failed. It would be the right order, but like the two center docs 
would be reversed or something. No time to dig in, so I just switch to 
the trunk MultiSearcher and all tests passed except for the two with 
the above issues.
Spoke too soon. Wasnt LUCENE-1471's fault, it was just hitting different 
aspects of an issue thats messed up with the old MultiSearcher as well.


I think I have everything working, except there is a problem with sort 
by doc id (it won't reverse). Got the auto detection working though.


Thanks for the ref to that bug Uwe, was indeed the problem.

- Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Uwe Schindler
Hi Mark,

> Thanks for the ref to that bug Uwe, was indeed the problem.

This is now committed: updates in FieldSortedHitQueue, new super-interface
for FieldCache.Parsers and SortField changes (see Mikes commit as I have no
committer status yet).

Uwe


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-08 Thread markharw00d


The problem with that is that in most cases you still need a "string" 
based syntax that "people" can enter...


The XML syntax includes a  tag for embedding user input of 
this type.


I guess you can always have an "advanced search" page that builds and 
submits the XML query behind the scenes.


Contrib now includes a worked demo web app showing how a very typical 
search form is converted into XML using XSL.
User input is a mixture of edit boxes for classic QueryParser syntax 
used on free-text fields but also includes drop-downs and checkboxes etc 
that map to other non-free-text fields.


Cheers
Mark




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-08 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654637#action_12654637
 ] 

Michael McCandless commented on LUCENE-1478:


No problem -- Committed revision 724552.  Thanks!

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1478-cleanup.patch, 
> LUCENE-1478-no-superinterface.patch, LUCENE-1478.patch, LUCENE-1478.patch, 
> LUCENE-1478.patch, LUCENE-1478.patch, LUCENE-1478.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Michael McCandless


Mark Miller wrote:


Mark Miller wrote:


Which new sort stuff are you referring to?  Is it LUCENE-1471?


Yes. First thing I did was try and patch this in, but the sort  
tests failed. It would be the right order, but like the two center  
docs would be reversed or something. No time to dig in, so I just  
switch to the trunk MultiSearcher and all tests passed except for  
the two with the above issues.
Spoke too soon. Wasnt LUCENE-1471's fault, it was just hitting  
different aspects of an issue thats messed up with the old  
MultiSearcher as well.


OK.  If you're building on LUCENE-1471, make sure you start from the  
first patch.  It'd be good to factor that logic (2nd pqueue for  
merging) out so it can be reused b/w IndexSearcher & MultiSearcher.


I think I have everything working, except there is a problem with  
sort by doc id (it won't reverse). Got the auto detection working  
though.


Nice!!  Looking forward to the patch :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1435) CollationKeyFilter: convert tokens into CollationKeys encoded using IndexableBinaryStringTools

2008-12-08 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-1435:


Attachment: LUCENE-1435.patch

Removed accidentally included IndexableBinaryString and its test from the patch 
(see LUCENE-1434 for these).

> CollationKeyFilter: convert tokens into CollationKeys encoded using 
> IndexableBinaryStringTools
> --
>
> Key: LUCENE-1435
> URL: https://issues.apache.org/jira/browse/LUCENE-1435
> Project: Lucene - Java
>  Issue Type: New Feature
>Affects Versions: 2.4
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1435.patch, LUCENE-1435.patch, LUCENE-1435.patch
>
>
> Converts each token into its CollationKey using the provided collator, and 
> then encodes the CollationKey with IndexableBinaryStringTools, to allow it to 
> be stored as an index term.
> This will allow for efficient range searches and Sorts over fields that need 
> collation for proper ordering.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for sorted searches

2008-12-08 Thread Mark Miller (JIRA)
Change IndexSearcher to use MultiSearcher semantics for sorted searches
---

 Key: LUCENE-1483
 URL: https://issues.apache.org/jira/browse/LUCENE-1483
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.9
Reporter: Mark Miller
Priority: Minor


Here is a quick test patch. FieldCache for sorting is done at the individual 
IndexReader level and reloading the fieldcache on reopen can be much faster as 
only changed segments need to be reloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller

Michael McCandless wrote:


Mark Miller wrote:


Mark Miller wrote:


Which new sort stuff are you referring to?  Is it LUCENE-1471?


Yes. First thing I did was try and patch this in, but the sort tests 
failed. It would be the right order, but like the two center docs 
would be reversed or something. No time to dig in, so I just switch 
to the trunk MultiSearcher and all tests passed except for the two 
with the above issues.
Spoke too soon. Wasnt LUCENE-1471's fault, it was just hitting 
different aspects of an issue thats messed up with the old 
MultiSearcher as well.


OK.  If you're building on LUCENE-1471, make sure you start from the 
first patch.  It'd be good to factor that logic (2nd pqueue for 
merging) out so it can be reused b/w IndexSearcher & MultiSearcher.
I actually worked with the second. I'll take a look at the first 
instead. I'm sticking with using the MultiSearcher for the first patch - 
it can be worked out later if it speed things up.


Does returning by document id order even make sense with this though? 
Did it make sense with MultiSearcher? They are pseudo ids (mapped), so 
it almost seems I can't support that right...it would depend on the 
order of the readers.


- Mark



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-08 Thread Mark Miller

Mark Miller wrote:

Mark Miller wrote:


Which new sort stuff are you referring to?  Is it LUCENE-1471?


Yes. First thing I did was try and patch this in, but the sort tests 
failed. It would be the right order, but like the two center docs 
would be reversed or something. No time to dig in, so I just switch 
to the trunk MultiSearcher and all tests passed except for the two 
with the above issues.

 Got the auto detection working though.
Bah, I didn't. Brought up an old bug I've seen before - if you use 
multisearcher and an index doesn't have the field, AUTO won't work. 
Advice I always got was don't use AUTO, but even Lucene uses it 
internally. Thought I had a workarount, but didn't quite work. Not sure 
what to do about this one - I'll have to mull it and the ids issue over 
a bit I suppose.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Luke Nezda (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654712#action_12654712
 ] 

Luke Nezda commented on LUCENE-1471:


I will prepare a similar derivative patch that covers MultiSearcher and 
ParallelMultiSearcher.

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Luke Nezda (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654720#action_12654720
 ] 

Luke Nezda commented on LUCENE-1471:


* Simplified MultiSearcherThread 
** Pulled out result merging functionality - was serialized on hq anyway
and made much more similar to parent merge logic (actually so similar it 
felt a little dirty)
** Made it a non-static inner class to cut down on parameters, though after 
moving merge logic, only saved searchables[] ref.
* Made fields searchables[] and starts[] final - really parent version of these 
same fields should probably just be protected final
* Fixed some javadoc typos
* Patch created against 724620 supersedes previous multisearcher.patch - all 
tests pass


> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1471) Faster MultiSearcher.search merge docs

2008-12-08 Thread Luke Nezda (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Nezda updated LUCENE-1471:
---

Attachment: multisearcher.take2.patch

Patch covering MultiSearcher and ParallelMultiSearcher

> Faster MultiSearcher.search merge docs 
> ---
>
> Key: LUCENE-1471
> URL: https://issues.apache.org/jira/browse/LUCENE-1471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1471.patch, multisearcher.patch, 
> multisearcher.take2.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> MultiSearcher.search places sorted search results from individual searchers 
> into a PriorityQueue.  This can be made to be more optimal by taking 
> advantage of the fact that the results returned are already sorted.  
> The proposed solution places the sub-searcher results iterator into a custom 
> PriorityQueue that produces the sorted ScoreDocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]