date:20081205

[jira] Updated: (LUCENE-1470) Add TrieRangeQuery to contrib

2008-12-05 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1470:
--

Attachment: LUCENE-1470-readme.patch

Here the readme changes.

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470-readme.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1470) Add TrieRangeQuery to contrib

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653731#action_12653731
 ] 

Michael McCandless commented on LUCENE-1470:


Committed revision 723701.

Thanks Uwe!

> Add TrieRangeQuery to contrib
> -
>
> Key: LUCENE-1470
> URL: https://issues.apache.org/jira/browse/LUCENE-1470
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: fixbuild-LUCENE-1470.patch, fixbuild-LUCENE-1470.patch, 
> LUCENE-1470-readme.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, LUCENE-1470.patch, 
> LUCENE-1470.patch
>
>
> According to the thread in java-dev 
> (http://www.gossamer-threads.com/lists/lucene/java-dev/67807 and 
> http://www.gossamer-threads.com/lists/lucene/java-dev/67839), I want to 
> include my fast numerical range query implementation into lucene 
> contrib-queries.
> I implemented (based on RangeFilter) another approach for faster
> RangeQueries, based on longs stored in index in a special format.
> The idea behind this is to store the longs in different precision in index
> and partition the query range in such a way, that the outer boundaries are
> search using terms from the highest precision, but the center of the search
> Range with lower precision. The implementation stores the longs in 8
> different precisions (using a class called TrieUtils). It also has support
> for Doubles, using the IEEE 754 floating-point "double format" bit layout
> with some bit mappings to make them binary sortable. The approach is used in
> rather big indexes, query times are even on low performance desktop
> computers <<100 ms (!) for very big ranges on indexes with 50 docs.
> I called this RangeQuery variant and format "TrieRangeRange" query because
> the idea looks like the well-known Trie structures (but it is not identical
> to real tries, but algorithms are related to it).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653735#action_12653735
 ] 

Michael McCandless commented on LUCENE-1473:


SerializeUtils is missing from the patch.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1478:
--

Assignee: Michael McCandless

> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
>Assignee: Michael McCandless
> Attachments: LUCENE-1478-no-superinterface.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1478) Missing possibility to supply custom FieldParser when sorting search results

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653741#action_12653741
 ] 

Michael McCandless commented on LUCENE-1478:


Patch looks good, thanks Uwe!  Back compat looks preserved; while some
APIs (FieldSortedHitQueue.getCachedComparator) were changed, they are
package private.

Back-compat tests ("ant test-tag") pass as well.

bq. For testing, I modified one of my contrib TrieRangeQuery test cases locally 
to sort using a custom LongParser that decoded the encoded longs in the cache 
[parseLong(value) returns TrieUtils.trieCodedToLong(value)].

It looks like this didn't make it into the patch -- could you add it?

Actually, adding a core test case would also be good.  It could be
something silly, eg that parses ints but negates them, and then assert
that this yields the same result as the default IntParser with
reverse=true (assuming no ties).

bq. If you like my patch, we could also discuss about using a super-interface 
for all Parsers. The modifications are rather simple (only the SortField 
constructor would be affected and some casts, and of course: the superinterface 
in all declarations inside FieldCache, ExtendedFieldCache)

I agree, I would like to at least get some minimal static typing into
the API (Object is not ideal) even if it's simply a "marker" interface
If you're sure this can be done, such that all changes to
FieldCache/ExtendedFieldCache remain back compatibitle, then let's do
it?  And I think I do now agree: this can be done w/o breaking back
compat.  The only affected public methods should be your new SortField
methods, which is fine (no public methods take "Object parser" as far
as I can tell).


> Missing possibility to supply custom FieldParser when sorting search results
> 
>
> Key: LUCENE-1478
> URL: https://issues.apache.org/jira/browse/LUCENE-1478
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.4
>Reporter: Uwe Schindler
> Attachments: LUCENE-1478-no-superinterface.patch
>
>
> When implementing the new TrieRangeQuery for contrib (LUCENE-1470), I was 
> confronted by the problem that the special trie-encoded values (which are 
> longs in a special encoding) cannot be sorted by Searcher.search() and 
> SortField. The problem is: If you use SortField.LONG, you get 
> NumberFormatExceptions. The trie encoded values may be sorted using 
> SortField.String (as the encoding is in such a way, that they are sortable as 
> Strings), but this is very memory ineffective.
> ExtendedFieldCache gives the possibility to specify a custom LongParser when 
> retrieving the cached values. But you cannot use this during searching, 
> because there is no possibility to supply this custom LongParser to the 
> SortField.
> I propose a change in the sort classes:
> Include a pointer to the parser instance to be used in SortField (if not 
> given use the default). My idea is to create a SortField using a new 
> constructor
> {code}SortField(String field, int type, Object parser, boolean reverse){code}
> The parser is "object" because all current parsers have no super-interface. 
> The ideal solution would be to have:
> {code}SortField(String field, int type, FieldCache.Parser parser, boolean 
> reverse){code}
> and FieldCache.Parser is a super-interface (just empty, more like a 
> marker-interface) of all other parsers (like LongParser...). The sort 
> implementation then must be changed to respect the given parser (if not 
> NULL), else use the default FieldCache.get without parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653752#action_12653752
 ] 

Michael McCandless commented on LUCENE-1476:


bq. But, SegmentReader needs random access to the bits (DocIdSet only provides 
an iterator)? 

Although IndexReader.isDeleted exposes a random-access API to deleted docs, I 
think it may be overkill.

Ie, in most (all?) uses of deleted docs throughout Lucene core/contrib, a 
simple iterator (DocIdSet) would in fact suffice.

EG in SegmentTermDocs iteration we are always checking deletedDocs by ascending 
docID.  It might be a performance gain (pure speculation) if we used an 
iterator API, because we could hold "nextDelDocID" and only advance that 
(skipTo) when the term's docID has moved past it.  It's just like an "AND NOT 
X" clause.

Similarly, norms, which also now expose a random-access API, should be fine 
with an iterator type API as well.

This may also imply better VM behavior, since we don't actually require 
norms/deletions to be fully memory resident.

This would be a biggish change, and it's not clear whether/when we should 
explore it, but I wanted to get the idea out there.

Marvin, in KS/Lucy are you using random-access or iterator to access 
deletedDocs & norms?

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653757#action_12653757
 ] 

Michael McCandless commented on LUCENE-831:
---

bq. change norm caching to use new caches (if not the same

I think we could go even further, and [eventually] change norms to use an 
iterator API, which'd also have the same benefit of not requiring costly 
materialization of a full byte[] array for every doc in the index (ie, reopen() 
cost would be in proportion to changed segments not total index size).

Likewise field cache / stored fields / column stride fields could eventually 
open up an iterator API as well.  This API would be useful if eg in a custom 
HitCollector you wanted to look at a field's value in order to do custom 
filtering/scoring.

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unnecessary messages creation by LogMergePolicy

2008-12-05 Thread Shai Erera

Hi

As I looked at the code in LogMergePolicy (and its sub-classes), I came
across such lines:

message("findMergesToExpungeDeletes: " + numSegments + " segments");

Those lines print to the info stream (eventually) if it's not null.

If one follows Java logging best practices, then any logging message should
look similar to this:
if (logger.isLoggable(Level)) {
   logger.log(Level, msg);
}

The reason is that when logging messages, one usually does not pay attention
to any performance issues, like String concatenation. Therefore, checking if
logging is enabled saves building the String just to discover later that
logging is not enabled.

I haven't checked other places in the code, because I'd like to get the
committers opinion on this first. Imo, those strings are created
unnecessarily (the above message creates 5 strings) since most of the time
the info stream is null.

I can provide a patch to fix it by first checking if logging (or whatever
other name you'd like to give it) is enabled before attempting to output any
message. The LogMergePolicy classes are one example that I've run at, but
I'm sure there are other places in the code.

I don't foresee any great performance improvements by this fix, except just
following best practices.

What do you think?

Shai

Re: Unnecessary messages creation by LogMergePolicy

2008-12-05 Thread Michael McCandless



I agree, it is a best practice and we should follow it.  Can you work  
out a patch & open an issue?  I assume this means "if (infoStream !=  
null)..." in this case.


Mike

Shai Erera wrote:


Hi

As I looked at the code in LogMergePolicy (and its sub-classes), I  
came across such lines:


message("findMergesToExpungeDeletes: " + numSegments + "  
segments");


Those lines print to the info stream (eventually) if it's not null.

If one follows Java logging best practices, then any logging message  
should look similar to this:

if (logger.isLoggable(Level)) {
   logger.log(Level, msg);
}

The reason is that when logging messages, one usually does not pay  
attention to any performance issues, like String concatenation.  
Therefore, checking if logging is enabled saves building the String  
just to discover later that logging is not enabled.


I haven't checked other places in the code, because I'd like to get  
the committers opinion on this first. Imo, those strings are created  
unnecessarily (the above message creates 5 strings) since most of  
the time the info stream is null.


I can provide a patch to fix it by first checking if logging (or  
whatever other name you'd like to give it) is enabled before  
attempting to output any message. The LogMergePolicy classes are one  
example that I've run at, but I'm sure there are other places in the  
code.


I don't foresee any great performance improvements by this fix,  
except just following best practices.


What do you think?

Shai



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unnecessary messages creation by LogMergePolicy

2008-12-05 Thread Shai Erera

I'll open an issue and work out a patch.

Basically it means infoStream != null, although in LogMergePolicy I might
add a specific method for that, because the messages are output if the
IndexWriter member is not null and its infoStream is not null (this check is
done by IndexWriter).

Therefore I think I'll add a method to IndexWriter messagesEnabled() which
returns true if the infoStream is not null for use by other classes (rather
than the implicit iw.getInfoStream() != null). BTW, getInfoStream() is not
called by any class in Lucene, except one test class.

What do you think about adding this method, and its name?

On Fri, Dec 5, 2008 at 3:35 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> I agree, it is a best practice and we should follow it.  Can you work out a
> patch & open an issue?  I assume this means "if (infoStream != null)..." in
> this case.
>
> Mike
>
>
> Shai Erera wrote:
>
>  Hi
>>
>> As I looked at the code in LogMergePolicy (and its sub-classes), I came
>> across such lines:
>>
>>message("findMergesToExpungeDeletes: " + numSegments + " segments");
>>
>> Those lines print to the info stream (eventually) if it's not null.
>>
>> If one follows Java logging best practices, then any logging message
>> should look similar to this:
>> if (logger.isLoggable(Level)) {
>>   logger.log(Level, msg);
>> }
>>
>> The reason is that when logging messages, one usually does not pay
>> attention to any performance issues, like String concatenation. Therefore,
>> checking if logging is enabled saves building the String just to discover
>> later that logging is not enabled.
>>
>> I haven't checked other places in the code, because I'd like to get the
>> committers opinion on this first. Imo, those strings are created
>> unnecessarily (the above message creates 5 strings) since most of the time
>> the info stream is null.
>>
>> I can provide a patch to fix it by first checking if logging (or whatever
>> other name you'd like to give it) is enabled before attempting to output any
>> message. The LogMergePolicy classes are one example that I've run at, but
>> I'm sure there are other places in the code.
>>
>> I don't foresee any great performance improvements by this fix, except
>> just following best practices.
>>
>> What do you think?
>>
>> Shai
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread robert engels (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653793#action_12653793
 ] 

robert engels commented on LUCENE-1476:
---

I don't think you can change this...

In many cases after you have read an index, and retrieved document numbers, 
these are lazily returned to the client.

By the time some records are needed to be read, they may have already been 
deleted (at least this was the usage in old lucene, where deletions happened in 
the reader).

I think a lot of code assumes this, and calls the isDeleted() to ensure the 
document is still valid.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2008-12-05 Thread Shai Erera (JIRA)

TrecDocMaker skips over documents when "Date" is missing from documents
---

 Key: LUCENE-1479
 URL: https://issues.apache.org/jira/browse/LUCENE-1479
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Shai Erera
 Fix For: 2.4.1


TrecDocMaker skips over Trec documents if they do not have a "Date" line. When 
such a document is encountered, the code may skip over several documents until 
the next tag that is searched for is found.
The result is, instead of reading ~25M documents from the GOV2 collection, the 
code reads only ~23M (don't remember the actual numbers).

The fix adds a terminatingTag to read() such that the code looks for prefix, 
but only until terminatingTag is found. Appropriate changes were made in 
getNextDocData().

Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1479) TrecDocMaker skips over documents when "Date" is missing from documents

2008-12-05 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1479:
---

Attachment: LUCENE-1479.patch

Patch to fix the bug

> TrecDocMaker skips over documents when "Date" is missing from documents
> ---
>
> Key: LUCENE-1479
> URL: https://issues.apache.org/jira/browse/LUCENE-1479
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
> Fix For: 2.4.1
>
> Attachments: LUCENE-1479.patch
>
>
> TrecDocMaker skips over Trec documents if they do not have a "Date" line. 
> When such a document is encountered, the code may skip over several documents 
> until the next tag that is searched for is found.
> The result is, instead of reading ~25M documents from the GOV2 collection, 
> the code reads only ~23M (don't remember the actual numbers).
> The fix adds a terminatingTag to read() such that the code looks for prefix, 
> but only until terminatingTag is found. Appropriate changes were made in 
> getNextDocData().
> Patch to follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Shai Erera (JIRA)

Wrap messages output with a check of InfoStream != null
---

 Key: LUCENE-1480
 URL: https://issues.apache.org/jira/browse/LUCENE-1480
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Shai Erera
Priority: Minor
 Fix For: 2.4.1


I've found several places in the code where messages are output w/o first 
checking if infoStream != null. The result is that in most of the time, 
unnecessary strings are created but never output (because infoStream is not 
set). We should follow Java's logging best practices, where a log message is 
always output in the following format:
if (logger.isLoggable(leve)) {
logger.log(level, msg);
}

Log messages are usually created w/o paying too much attention to performance 
(such as string concatenation using '+' instead of StringBuffer). Therefore, at 
runtime it is important to avoid creating those messages, if they will be 
discarded eventually.

I will add a method to IndexWriter messagesEnabled() and then use it wherever a 
call to iw.message() is made.

Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1480:
---

Attachment: LUCENE-1480.patch

Patch that introduces messagesEnabled() in IndexWriter and fixes all calls to 
IndexWriter.message() to be wrapped by a call to messagesEnabled() or 
infoStream != null (in case of IndexWriter calls to message()).

> Wrap messages output with a check of InfoStream != null
> ---
>
> Key: LUCENE-1480
> URL: https://issues.apache.org/jira/browse/LUCENE-1480
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: LUCENE-1480.patch
>
>
> I've found several places in the code where messages are output w/o first 
> checking if infoStream != null. The result is that in most of the time, 
> unnecessary strings are created but never output (because infoStream is not 
> set). We should follow Java's logging best practices, where a log message is 
> always output in the following format:
> if (logger.isLoggable(leve)) {
> logger.log(level, msg);
> }
> Log messages are usually created w/o paying too much attention to performance 
> (such as string concatenation using '+' instead of StringBuffer). Therefore, 
> at runtime it is important to avoid creating those messages, if they will be 
> discarded eventually.
> I will add a method to IndexWriter messagesEnabled() and then use it wherever 
> a call to iw.message() is made.
> Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unnecessary messages creation by LogMergePolicy

2008-12-05 Thread Michael McCandless



I like the method, but how about the name verbose(), ie:

  if (verbose())
...

Mike

Shai Erera wrote:


I'll open an issue and work out a patch.

Basically it means infoStream != null, although in LogMergePolicy I  
might add a specific method for that, because the messages are  
output if the IndexWriter member is not null and its infoStream is  
not null (this check is done by IndexWriter).


Therefore I think I'll add a method to IndexWriter messagesEnabled()  
which returns true if the infoStream is not null for use by other  
classes (rather than the implicit iw.getInfoStream() != null). BTW,  
getInfoStream() is not called by any class in Lucene, except one  
test class.


What do you think about adding this method, and its name?

On Fri, Dec 5, 2008 at 3:35 PM, Michael McCandless <[EMAIL PROTECTED] 
> wrote:


I agree, it is a best practice and we should follow it.  Can you  
work out a patch & open an issue?  I assume this means "if  
(infoStream != null)..." in this case.


Mike


Shai Erera wrote:

Hi

As I looked at the code in LogMergePolicy (and its sub-classes), I  
came across such lines:


   message("findMergesToExpungeDeletes: " + numSegments + "  
segments");


Those lines print to the info stream (eventually) if it's not null.

If one follows Java logging best practices, then any logging message  
should look similar to this:

if (logger.isLoggable(Level)) {
  logger.log(Level, msg);
}

The reason is that when logging messages, one usually does not pay  
attention to any performance issues, like String concatenation.  
Therefore, checking if logging is enabled saves building the String  
just to discover later that logging is not enabled.


I haven't checked other places in the code, because I'd like to get  
the committers opinion on this first. Imo, those strings are created  
unnecessarily (the above message creates 5 strings) since most of  
the time the info stream is null.


I can provide a patch to fix it by first checking if logging (or  
whatever other name you'd like to give it) is enabled before  
attempting to output any message. The LogMergePolicy classes are one  
example that I've run at, but I'm sure there are other places in the  
code.


I don't foresee any great performance improvements by this fix,  
except just following best practices.


What do you think?

Shai


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1480:
--

Assignee: Michael McCandless

> Wrap messages output with a check of InfoStream != null
> ---
>
> Key: LUCENE-1480
> URL: https://issues.apache.org/jira/browse/LUCENE-1480
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: LUCENE-1480.patch
>
>
> I've found several places in the code where messages are output w/o first 
> checking if infoStream != null. The result is that in most of the time, 
> unnecessary strings are created but never output (because infoStream is not 
> set). We should follow Java's logging best practices, where a log message is 
> always output in the following format:
> if (logger.isLoggable(leve)) {
> logger.log(level, msg);
> }
> Log messages are usually created w/o paying too much attention to performance 
> (such as string concatenation using '+' instead of StringBuffer). Therefore, 
> at runtime it is important to avoid creating those messages, if they will be 
> discarded eventually.
> I will add a method to IndexWriter messagesEnabled() and then use it wherever 
> a call to iw.message() is made.
> Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-1468) FSDirectory.list() is inconsistent

2008-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1468.


   Resolution: Fixed
Fix Version/s: 2.9

Committed revision 723789.

Thanks Marcel!

> FSDirectory.list() is inconsistent
> --
>
> Key: LUCENE-1468
> URL: https://issues.apache.org/jira/browse/LUCENE-1468
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4
>Reporter: Marcel Reutegger
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: DirectoryTest.java, LUCENE-1468.patch
>
>
> LUCENE-638 added a check to the FSDirectory.list() method to only return 
> files that are Lucene related. I think this change made the FSDirectory 
> implementation inconsistent with all other methods in Directory. E.g. you can 
> create a file with an arbitrary name using FSDirectory, fileExists() will 
> report that it is there, deleteFile() will remove it, but the array returned 
> by list() will not contain the file.
> The actual issue that was reported in LUCENE-638 was about sub directories. 
> Those should clearly not be listed, but IMO it is not the responsibility of a 
> Directory implementation to decide what kind of files can be created or 
> listed. The Directory class is an abstraction of a directory and it should't 
> to more than that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Doug Cutting


John Wang wrote:
Thus we are enforcing users 
that care about Serialization to use the release jar.


We already encourage folks to use a release jar if possible.  So this is 
not a big change.  Also, if folks choose to build their own jar, then 
they are expected to use that same jar everywhere, effectively making 
their own release.  That doesn't seem unreasonable to me.  Incrementally 
upgrading distributed systems has, at least in the past, been outside 
the scope of Lucene.


3) Clean up the serialization story, either add SUID or implement 
Externalizable for some classes within Lucene that implements Serializable:


From what I am told, this is too much work for the committers.


Not that it's too much work today, but that it adds an ongoing burden 
and we should take this on cautiously if at all.  If we want to go this 
way we'd need to:


- document precisely which classes we'll evolve back-compatibly;
- document the releases (major? minor?) that will be compatible; and
- provide a test suite that validates this.

As a side note, we should probably move the back-compatibility 
documentation from the wiki to the project website.  This would permit 
patches to it, among other things.


http://wiki.apache.org/lucene-java/BackwardsCompatibility

I hope you guys at least agree with me with the way it is currently, the 
serialization story is broken, whether in documentation or in code.


Documenting an unstated assumption is a good thing to do, especially 
when not everyone seems to share the assumption, but "broken" seems a 
bit strong here.


I see the disagreement being its severity, and whether it is a trivial 
fix, which I have learned it is not really my place to say.


I've outlined above what I think would be required.  If you think that's 
trivial, then please pursue it and show us how trivial it is.  The patch 
provided thus far is incomplete.


Please do understand this is not a far-fetched, made-up use-case, we are 
running into this in production, and we are developing in accordance to 
lucene documentation.


You developed based on some very optimistic guesses about some unstated 
aspects.  In Java, implementing Serializeable alone does not generally 
provide any cross-version guarantees.  Assuming that it did was risky.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653869#action_12653869
 ] 

Doug Cutting commented on LUCENE-1473:
--

> How to write a unit test for multiple versions?

We can save, in files, serialized instances of each query type from the oldest 
release we intend to support.  Then read each of thes queries and check that it 
s equal to a current query that's meant to be equivalent (ssuming all queries 
implement equals well).  Something similar would need to be done for each class 
that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code.  It does not test 
that newer queries can be processed by older code.  Documentation is a big part 
of this effort, that should be completed first.  What guarantees to we intend 
to provide?  Once we've documented these, then we can begin writing tests.  For 
example, we may only guarantee that older queries work with newer code, and 
that newer hits work with older code.  To test that we'd need to have an old 
jar around that we could test against.  This will be a trickier test to 
configure.


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Jason Rutherglen (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1473:
-

Attachment: LUCENE-1473.patch

LUCENE-1473.patch

Added Externalizable to Document, Field, AbstractField (as compared to the 
previous patch).  SerializationUtils is included.

TODO:
- More Externalizable classes with test cases for each one





> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653872#action_12653872
 ] 

Shai Erera commented on LUCENE-1480:


Hey Mike, I like the verbose() name. Would you like me to prepare a new patch, 
or can you apply the current patch and refactor the method name?
Note that a similar method was added to LogMergePolicy

> Wrap messages output with a check of InfoStream != null
> ---
>
> Key: LUCENE-1480
> URL: https://issues.apache.org/jira/browse/LUCENE-1480
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: LUCENE-1480.patch
>
>
> I've found several places in the code where messages are output w/o first 
> checking if infoStream != null. The result is that in most of the time, 
> unnecessary strings are created but never output (because infoStream is not 
> set). We should follow Java's logging best practices, where a log message is 
> always output in the following format:
> if (logger.isLoggable(leve)) {
> logger.log(level, msg);
> }
> Log messages are usually created w/o paying too much attention to performance 
> (such as string concatenation using '+' instead of StringBuffer). Therefore, 
> at runtime it is important to avoid creating those messages, if they will be 
> discarded eventually.
> I will add a method to IndexWriter messagesEnabled() and then use it wherever 
> a call to iw.message() is made.
> Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653877#action_12653877
 ] 

Michael McCandless commented on LUCENE-1480:


Sorry, could you make a new patch?  Thanks.

> Wrap messages output with a check of InfoStream != null
> ---
>
> Key: LUCENE-1480
> URL: https://issues.apache.org/jira/browse/LUCENE-1480
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: LUCENE-1480.patch
>
>
> I've found several places in the code where messages are output w/o first 
> checking if infoStream != null. The result is that in most of the time, 
> unnecessary strings are created but never output (because infoStream is not 
> set). We should follow Java's logging best practices, where a log message is 
> always output in the following format:
> if (logger.isLoggable(leve)) {
> logger.log(level, msg);
> }
> Log messages are usually created w/o paying too much attention to performance 
> (such as string concatenation using '+' instead of StringBuffer). Therefore, 
> at runtime it is important to avoid creating those messages, if they will be 
> discarded eventually.
> I will add a method to IndexWriter messagesEnabled() and then use it wherever 
> a call to iw.message() is made.
> Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Java logging in Lucene

2008-12-05 Thread Shai Erera

Hi

I was wondering why doesn't the Lucene code uses Java logging, instead of
the infoStream set in IndexWriter? Today, if I want to enable tracing of
Lucene code, the only thing I can do is set an infoStream, but then I get
many many messages. Moreoever, those messages seem to cover indexing code
only.

I hope to get some opinions on the use of Java logging instead of
infoStream, and hopefully to start addind logging messages in other places
in the code (like during search, query parsing etc.)

I feel that this is an approach the community has to decide on before we
start adding messages to the code. Using Java logging can greatly benefit
tracing of indexing applications who use Lucene. If the vote is +1 for using
Java logging, we can start by deprecating infoStream (in 2.9, remove in 3.0)
and use logging instead.

What do you think?

Shai

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653883#action_12653883
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

> Marvin, in KS/Lucy are you using random-access or iterator to access 
> deletedDocs & norms?

Both. There's a DelEnum class which is used by NOTScorer and MatchAllScorer, 
but it's implemented using BitVectors which get the next deleted doc num by 
calling nextSetBit() internally. 

 I happened to be coding up those classes this spring when there was the big 
brouhaha about IndexReader.isDeleted().  It seemed wrong to pay the method call 
overhead for IndexReader.isDeleted() on each iter in NOTScorer.next() or 
MatchAllScorer.next(), when we could just store the next deletion:

{code}
i32_t
MatchAllScorer_next(MatchAllScorer* self) 
{
do {
if (++self->doc_num > self->max_docs) {
self->doc_num--;
return 0;
}
if (self->doc_num > self->next_deletion) {
self->next_deletion 
= DelEnum_Skip_To(self->del_enum, self->doc_num);
}
} while (self->doc_num == self->next_deletion);
return self->doc_num;
}
{code}

(Note: Scorer.next() in KS returns the document number; doc nums start at 1, 
and 0 is the sentinel signaling iterator termination. I expect that Lucy will 
be the same.)

Perhaps we could get away without needing the random access, but that's because 
IndexReader.isDeleted() isn't exposed and because IndexReader.fetchDoc(int 
docNum) returns the doc even if it's deleted -- unlike Lucene which throws an 
exception. Also, you can't delete documents against an IndexReader, so Robert's 
objection doesn't apply to us.

I had always assumed we were going to have to expose isDeleted() eventually, 
but maybe we can get away with zapping it. Interesting!

I've actually been trying to figure out a new design for deletions because 
writing them out for big segments is our last big write bottleneck, now that 
we've theoretically solved the sort cache warming issue.  I figured we would 
continue to need bit-vector files because they're straightforward to mmap, but 
if we only need iterator access, we can use vbyte encoding instead... Hmm, we 
still face the problem of outsized write cost when a segment has a large number 
of deletions and you add one more...

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Grant Ingersoll

I think the main motivation has always been to have no dependencies in  
the core so as to keep it as fast and lightweight as possible.  Then,  
of course, there is always the usual religious wars around which  
logging framework to use, not to mention the nightmare that is trying  
to manage multiple logging frameworks across several projects that are  
being integrated.  Then, of course, there is the question of how  
useful any core Lucene logs would be to users writing search  
applications.  For the most part, my experience has been that I want  
logging to tell me when a document was added, when searches occur,  
etc. but I don't necessarily need to know things like the fact that  
Lucene is now entering the analysis phase of Document inversion.  And,  
for all these needs, I can just as well do that logging in the  
application and not in Lucene.


All that is not to say we couldn't add in logging, I'm just suggesting  
reasons I can think of for why it has not been added to date and why I  
am not sure it needs to be there going forward.  I believe various  
other people have contributed reasons in the past.  I seem to recall  
Doug spelling some out, but don't have the thread handy.


-Grant

On Dec 5, 2008, at 1:17 PM, Shai Erera wrote:


Hi

I was wondering why doesn't the Lucene code uses Java logging,  
instead of the infoStream set in IndexWriter? Today, if I want to  
enable tracing of Lucene code, the only thing I can do is set an  
infoStream, but then I get many many messages. Moreoever, those  
messages seem to cover indexing code only.


I hope to get some opinions on the use of Java logging instead of  
infoStream, and hopefully to start addind logging messages in other  
places in the code (like during search, query parsing etc.)


I feel that this is an approach the community has to decide on  
before we start adding messages to the code. Using Java logging can  
greatly benefit tracing of indexing applications who use Lucene. If  
the vote is +1 for using Java logging, we can start by deprecating  
infoStream (in 2.9, remove in 3.0) and use logging instead.


What do you think?

Shai




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653889#action_12653889
 ] 

Michael McCandless commented on LUCENE-1476:



bq. It seemed wrong to pay the method call overhead for IndexReader.isDeleted() 
on each iter in NOTScorer.next() or MatchAllScorer.next(), when we could just 
store the next deletion:

Nice!  This is what I had in mind.

I think we could [almost] do this across the board for Lucene.
SegmentTermDocs would similarly store nextDeleted and apply the same
"AND NOT" logic.

bq. that's because IndexReader.isDeleted() isn't exposed and because 
IndexReader.fetchDoc(int docNum) returns the doc even if it's deleted

Hmm -- that is very nicely enabling.

bq. I've actually been trying to figure out a new design for deletions because 
writing them out for big segments is our last big write bottleneck

One approach would be to use a "segmented" model.  IE, if a few
deletions are added, write that to a new "deletes segment", ie a
single "normal segment" would then have multiple deletion files
associated with it.  These would have to be merged (iterator) when
used during searching, and, periodically coalesced.

bq. if we only need iterator access, we can use vbyte encoding instead

Right: if there are relatively few deletes against a segment, encoding
the "on bits" directly (or deltas) should be a decent win since
iteration is much faster.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653891#action_12653891
 ] 

Michael McCandless commented on LUCENE-1476:



{quote}
In many cases after you have read an index, and retrieved document numbers, 
these are lazily returned to the client.

By the time some records are needed to be read, they may have already been 
deleted (at least this was the usage in old lucene, where deletions happened in 
the reader).

I think a lot of code assumes this, and calls the isDeleted() to ensure the 
document is still valid.
{quote}

But isn't that an uncommon use case?  It's dangerous to wait a long
time after getting a docID from a reader, before looking up the
document.  Most apps pull the doc right away, send it to the user, and
the docID isn't kept (I think?).

But still I agree: we can't eliminate random access to isDeleted
entirely.  We'd still have to offer it for such external cases.

I'm just saying the internal uses of isDeleted could all be switched
to iteration instead, and, we might get some performance gains from
it especially when the number of deletes on a segment is relatively low.


> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Doug Cutting


The infoStream stuff goes back to 1997, before there was log4j or any
other Java logging framework.

There's never been a big push to add logging to Lucene.  It would add a 
dependency, and Lucene's jar has always been standalone, which is nice. 
 Dependencies can conflict.  If Lucene requires one version of a 
dependency, then it may not work well with code that require a different 
version of that dependency.


And it hasn't been clear which framework to adopt.  Log4j is the 
granddaddy, then there's Java logging and commons logging.  Today the 
preferred framework is probably SLF4J.  Good thing we didn't choose the 
wrong one years ago!


And how many log entries would folks really want to see per query or 
document indexed?  In production I don't think most folks want to see 
more than one entry per query or document indexed.  So finer-grained 
logging would be for debugging.  For that one can instead use a 
debugger.  Hence the traditional lack of demand for detailed logging in 
Lucene.


That's the history as I recall it.  The future is less clear.

Doug

Grant Ingersoll wrote:
I think the main motivation has always been to have no dependencies in 
the core so as to keep it as fast and lightweight as possible.  Then, of 
course, there is always the usual religious wars around which logging 
framework to use, not to mention the nightmare that is trying to manage 
multiple logging frameworks across several projects that are being 
integrated.  Then, of course, there is the question of how useful any 
core Lucene logs would be to users writing search applications.  For the 
most part, my experience has been that I want logging to tell me when a 
document was added, when searches occur, etc. but I don't necessarily 
need to know things like the fact that Lucene is now entering the 
analysis phase of Document inversion.  And, for all these needs, I can 
just as well do that logging in the application and not in Lucene.


All that is not to say we couldn't add in logging, I'm just suggesting 
reasons I can think of for why it has not been added to date and why I 
am not sure it needs to be there going forward.  I believe various other 
people have contributed reasons in the past.  I seem to recall Doug 
spelling some out, but don't have the thread handy.


-Grant

On Dec 5, 2008, at 1:17 PM, Shai Erera wrote:


Hi

I was wondering why doesn't the Lucene code uses Java logging, instead 
of the infoStream set in IndexWriter? Today, if I want to enable 
tracing of Lucene code, the only thing I can do is set an infoStream, 
but then I get many many messages. Moreoever, those messages seem to 
cover indexing code only.


I hope to get some opinions on the use of Java logging instead of 
infoStream, and hopefully to start addind logging messages in other 
places in the code (like during search, query parsing etc.)


I feel that this is an approach the community has to decide on before 
we start adding messages to the code. Using Java logging can greatly 
benefit tracing of indexing applications who use Lucene. If the vote 
is +1 for using Java logging, we can start by deprecating infoStream 
(in 2.9, remove in 3.0) and use logging instead.


What do you think?

Shai




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-1448) add getFinalOffset() to TokenStream

2008-12-05 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1448:
--

Assignee: Michael Busch  (was: Michael McCandless)

> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653893#action_12653893
 ] 

Michael McCandless commented on LUCENE-1448:


{quote}
What I'd like to work on soon is an efficient way to buffer attributes
(maybe add methods to attribute that write into a bytebuffer). Then
attributes can implement what variables need to be serialized and
which ones don't. In that case we could add a finalOffset to
OffsetAttribute that does not get serialiezd/deserialized.
{quote}

I like that (it'd make streams like CachingTokenFilter much more
efficient).  It'd also presumably lead to more efficiently serialized
token streams.

But: you'd still need a way in this model to serialize finalOffset, once,
at the end?

{quote}
And possibly it might be worthwhile to have explicit states defined in
a TokenStream that we can enforce with three methods: start(),
increment(), end(). Then people would now if they have to do something
at the end of a stream they have to do it in end().
{quote}

This also seems good.  So end() would be the obvious place to set
the OffsetAttribute.finalOffset, 
PositionIncrementAttribute.positionIncrementGap, etc.

OK I'm gonna assign this one to you, Michael ;)


> add getFinalOffset() to TokenStream
> ---
>
> Key: LUCENE-1448
> URL: https://issues.apache.org/jira/browse/LUCENE-1448
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, 
> LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a 
> document, and you then index those fields with TermVectors storing offsets, 
> it's very likely the offsets for all but the first field instance will be 
> wrong.
> This is because IndexWriter under the hood adds a cumulative base to the 
> offsets of each field instance, where that base is 1 + the endOffset of the 
> last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer 
> is being used, and the text being analyzed ended in 3 whitespace characters, 
> then that information is lost and then next field's offsets are then all 3 
> too small.  Similarly, if a StopFilter appears in the chain, and the last N 
> tokens were stop words, then the base will be 1 + the endOffset of the last 
> non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm 
> thinking by default it returns -1, which means "I don't know so you figure it 
> out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Michael McCandless



I also feel that the primary usage of the internal messaging in Lucene  
today is debugging, and we don't need a logging framework for that.


Mike

Doug Cutting wrote:


The infoStream stuff goes back to 1997, before there was log4j or any
other Java logging framework.

There's never been a big push to add logging to Lucene.  It would  
add a dependency, and Lucene's jar has always been standalone, which  
is nice.  Dependencies can conflict.  If Lucene requires one version  
of a dependency, then it may not work well with code that require a  
different version of that dependency.


And it hasn't been clear which framework to adopt.  Log4j is the  
granddaddy, then there's Java logging and commons logging.  Today  
the preferred framework is probably SLF4J.  Good thing we didn't  
choose the wrong one years ago!


And how many log entries would folks really want to see per query or  
document indexed?  In production I don't think most folks want to  
see more than one entry per query or document indexed.  So finer- 
grained logging would be for debugging.  For that one can instead  
use a debugger.  Hence the traditional lack of demand for detailed  
logging in Lucene.


That's the history as I recall it.  The future is less clear.

Doug

Grant Ingersoll wrote:
I think the main motivation has always been to have no dependencies  
in the core so as to keep it as fast and lightweight as possible.   
Then, of course, there is always the usual religious wars around  
which logging framework to use, not to mention the nightmare that  
is trying to manage multiple logging frameworks across several  
projects that are being integrated.  Then, of course, there is the  
question of how useful any core Lucene logs would be to users  
writing search applications.  For the most part, my experience has  
been that I want logging to tell me when a document was added, when  
searches occur, etc. but I don't necessarily need to know things  
like the fact that Lucene is now entering the analysis phase of  
Document inversion.  And, for all these needs, I can just as well  
do that logging in the application and not in Lucene.
All that is not to say we couldn't add in logging, I'm just  
suggesting reasons I can think of for why it has not been added to  
date and why I am not sure it needs to be there going forward.  I  
believe various other people have contributed reasons in the past.   
I seem to recall Doug spelling some out, but don't have the thread  
handy.

-Grant
On Dec 5, 2008, at 1:17 PM, Shai Erera wrote:

Hi

I was wondering why doesn't the Lucene code uses Java logging,  
instead of the infoStream set in IndexWriter? Today, if I want to  
enable tracing of Lucene code, the only thing I can do is set an  
infoStream, but then I get many many messages. Moreoever, those  
messages seem to cover indexing code only.


I hope to get some opinions on the use of Java logging instead of  
infoStream, and hopefully to start addind logging messages in  
other places in the code (like during search, query parsing etc.)


I feel that this is an approach the community has to decide on  
before we start adding messages to the code. Using Java logging  
can greatly benefit tracing of indexing applications who use  
Lucene. If the vote is +1 for using Java logging, we can start by  
deprecating infoStream (in 2.9, remove in 3.0) and use logging  
instead.


What do you think?

Shai

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-05 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653900#action_12653900
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

Hi Matt! :)

Tim, want to benchmark the two? (since you already benchmarked 1461, you should 
be able to plug in Matt's thing and see how it compares)


> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1480) Wrap messages output with a check of InfoStream != null

2008-12-05 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1480:
---

Attachment: LUCENE-1480-2.patch

Renamed messagesEnabled to verbose. Thanks Mike !

> Wrap messages output with a check of InfoStream != null
> ---
>
> Key: LUCENE-1480
> URL: https://issues.apache.org/jira/browse/LUCENE-1480
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4.1
>
> Attachments: LUCENE-1480-2.patch, LUCENE-1480.patch
>
>
> I've found several places in the code where messages are output w/o first 
> checking if infoStream != null. The result is that in most of the time, 
> unnecessary strings are created but never output (because infoStream is not 
> set). We should follow Java's logging best practices, where a log message is 
> always output in the following format:
> if (logger.isLoggable(leve)) {
> logger.log(level, msg);
> }
> Log messages are usually created w/o paying too much attention to performance 
> (such as string concatenation using '+' instead of StringBuffer). Therefore, 
> at runtime it is important to avoid creating those messages, if they will be 
> discarded eventually.
> I will add a method to IndexWriter messagesEnabled() and then use it wherever 
> a call to iw.message() is made.
> Patch will follow

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread John Wang

Doug:
1)  "Incrementally upgrading distributed systems has, at least in the past,
been outside the scope of Lucene" - That's good to know. Is it also out of
the scope for distributed lucene effort (if it is still happening)?

2) I used the word broken to describe what happened for our deployment. I
will try to use less harsh words when addressing lucene in the future.

3) " If you think that's trivial, then please pursue it and show us how
trivial it is." - My proposal is to add the suid to Serializable classes, if
you don't think that's trivial, many IDEs doe that for you. I think your
main concern is that this is not the perfect solution to this problem, but
it does provide better behavior than what it is now IMO. I understand we
have discussed earlier in the thread there are cases where adding suid does
not work. Given many of these classes are rather static, I don't share your
concern.

4) "You developed based on some very optimistic guesses about some unstated
aspects" - this is developed based on our understanding of Serializable
without Lucene documentation discouraging us doing so. We also interpreted
the fact RemoteSearcher being part of the package is an example of a valid
use-case. The JOSS protocol is designed to handle versioning (although not
perfectly) We didn't think that was risky, obviously in hindsight it is. But
I do find it hard to believe it is something the author of these classes had
in mind when Serializable interface was implemented.

This is getting into a philosophical discussion on Java Serialization, and
how it pertains to lucene. I don't see any resolution in the near future.
Moving forward, we'd be happy to provide patches given the agreed solution.
There is no reason to provide code patches if it is decided only
documentation needs to change. (from what you have outlined, I interpret it
being only documentation changes)

Also, if you find us addressing this issue being a hassle, e.g. addressing
serialization in lucene is an incorrect thing to do, feel free to let us
know and we can close the bug and terminate the thread.

Thanks

-John

On Fri, Dec 5, 2008 at 9:18 AM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> John Wang wrote:
>
>> Thus we are enforcing users that care about Serialization to use the
>> release jar.
>>
>
> We already encourage folks to use a release jar if possible.  So this is
> not a big change.  Also, if folks choose to build their own jar, then they
> are expected to use that same jar everywhere, effectively making their own
> release.  That doesn't seem unreasonable to me.  Incrementally upgrading
> distributed systems has, at least in the past, been outside the scope of
> Lucene.
>
>  3) Clean up the serialization story, either add SUID or implement
>> Externalizable for some classes within Lucene that implements Serializable:
>>
>> From what I am told, this is too much work for the committers.
>>
>
> Not that it's too much work today, but that it adds an ongoing burden and
> we should take this on cautiously if at all.  If we want to go this way we'd
> need to:
>
> - document precisely which classes we'll evolve back-compatibly;
> - document the releases (major? minor?) that will be compatible; and
> - provide a test suite that validates this.
>
> As a side note, we should probably move the back-compatibility
> documentation from the wiki to the project website.  This would permit
> patches to it, among other things.
>
> http://wiki.apache.org/lucene-java/BackwardsCompatibility
>
>  I hope you guys at least agree with me with the way it is currently, the
>> serialization story is broken, whether in documentation or in code.
>>
>
> Documenting an unstated assumption is a good thing to do, especially when
> not everyone seems to share the assumption, but "broken" seems a bit strong
> here.
>
>  I see the disagreement being its severity, and whether it is a trivial
>> fix, which I have learned it is not really my place to say.
>>
>
> I've outlined above what I think would be required.  If you think that's
> trivial, then please pursue it and show us how trivial it is.  The patch
> provided thus far is incomplete.
>
>  Please do understand this is not a far-fetched, made-up use-case, we are
>> running into this in production, and we are developing in accordance to
>> lucene documentation.
>>
>
> You developed based on some very optimistic guesses about some unstated
> aspects.  In Java, implementing Serializeable alone does not generally
> provide any cross-version guarantees.  Assuming that it did was risky.
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Java logging in Lucene

2008-12-05 Thread Shai Erera

Have you ever tried to debug your search application after it was shipped to
a customer? When problems occur on the customer end, you cannot very easily
reproduce problems because customers don't like to give you access to their
systems, not always they are willing to share the index with you and let
alone the documents that have been indexed.

Logging is very common in products just for that purpose. Of course I can
use debugging when something happens in my development environment. But
that's not the case after the product has shipped.

As for the logging framework, I'd think that Java logging creates no
dependencies for Lucene. java.util.logging exists at least since 1.4. So
it's already in the JDK. You might argue that some applications who embed a
search component over Lucene use a different logging system (such as Log4j),
but in that case I think it'd be fair to say that Java logging is what
Lucene uses.

You already do it today - you say that you use infoStream which prints
messages. Only the solution in Lucene today cannot be customized. I either
turn on *logging* for the entire Lucene package (or actually just the
indexing part) or not. I cannot, for example, turn on *logging* just for the
merge part.

The debugging on the customer side is mostly what I'm after. My experience
with another search library (proprietary) with exactly the same *logging*
capabilities like Lucene (you either turn on/off logging for everything),
although it contained messages from other parts of the search library as
well, show that it's extremely difficult to debug what's going on during
search on the customer side. Sometimes, all the application can log is that
it adds a document with some attributes, but if you really want to
understand what's going on inside Lucene, it's impossible. One useful
information might be what are the actual tokens that were added to the
index. There's no way the application can tell you that, w/o running the
Analyzer on the text. But then it needs to write code, which I think could
have been written in Lucene.
Another useful information is what is the query that's actually being run. I
guess that printing the QueryParser Query output object might be enough, but
you never know.
Maybe you'd like to know what indexes participated in the search, in case of
a distributed indexing scenario.

And the list can only grow ...

Like I said in my first email - logging is an approach the community has to
make, w/o neccessarily going over all the existing code and add messages.
Those can be added over time, by many people who'd like to get detailed
information from Lucene.

I hope my intentions are clearer now.

On Fri, Dec 5, 2008 at 9:06 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> I also feel that the primary usage of the internal messaging in Lucene
> today is debugging, and we don't need a logging framework for that.
>
> Mike
>
>
> Doug Cutting wrote:
>
>  The infoStream stuff goes back to 1997, before there was log4j or any
>> other Java logging framework.
>>
>> There's never been a big push to add logging to Lucene.  It would add a
>> dependency, and Lucene's jar has always been standalone, which is nice.
>>  Dependencies can conflict.  If Lucene requires one version of a dependency,
>> then it may not work well with code that require a different version of that
>> dependency.
>>
>> And it hasn't been clear which framework to adopt.  Log4j is the
>> granddaddy, then there's Java logging and commons logging.  Today the
>> preferred framework is probably SLF4J.  Good thing we didn't choose the
>> wrong one years ago!
>>
>> And how many log entries would folks really want to see per query or
>> document indexed?  In production I don't think most folks want to see more
>> than one entry per query or document indexed.  So finer-grained logging
>> would be for debugging.  For that one can instead use a debugger.  Hence the
>> traditional lack of demand for detailed logging in Lucene.
>>
>> That's the history as I recall it.  The future is less clear.
>>
>> Doug
>>
>> Grant Ingersoll wrote:
>>
>>> I think the main motivation has always been to have no dependencies in
>>> the core so as to keep it as fast and lightweight as possible.  Then, of
>>> course, there is always the usual religious wars around which logging
>>> framework to use, not to mention the nightmare that is trying to manage
>>> multiple logging frameworks across several projects that are being
>>> integrated.  Then, of course, there is the question of how useful any core
>>> Lucene logs would be to users writing search applications.  For the most
>>> part, my experience has been that I want logging to tell me when a document
>>> was added, when searches occur, etc. but I don't necessarily need to know
>>> things like the fact that Lucene is now entering the analysis phase of
>>> Document inversion.  And, for all these needs, I can just as well do that
>>> logging in the application and not in Lucene.
>>> All that is not to say w

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread robert engels (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653912#action_12653912
 ] 

robert engels commented on LUCENE-1476:
---

but IndexReader.document(n) throws an exception if the document is deleted...0 
so you still need random access

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653915#action_12653915
 ] 

Michael McCandless commented on LUCENE-1476:


bq. but IndexReader.document throws an exception if the document is deleted...0 
so you still need random access 

Does it really need to throw an exception?  (Of course for back compat it does, 
but we could move away from that to a new method that doesn't check).

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Michael McCandless



John Wang wrote:


My proposal is to add the suid to Serializable classes


That's too brittle.

If we do that, then what happens when we need to add a field to the
class (eg, in 2.9 we've replaced "inclusive" in RangeQuery with
"includeLower" and "includeUpper")?  The standard answer is you bump
the suid, but, then that breaks back compatibility.

Since we would still sometimes, unpredictably, break back
compatibility, no app could rely on it.  You can't have a "mostly
back compatible" promise.

So... we have to either 1) only support "live serialization" and
update the javadocs saying so, or 2) support full back compat of
serialized classes and spell out the actual policy, make thorough
tests for it, etc.

Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Shai Erera

BTW, one thing I forgot to add: with infoStream, it's very difficult to
extend the level of output if one wants, for example, to add logging
messages to the search part (or other parts). The reason is one would need
to permeate infoStream down too many classes. Instead, with Java logging,
each class is responsible for its own logging (by obtaining a Logger
instance given the class name). You can later turn on/off logging per
package/class.

Perhaps instead of introducing Java logging then (if you're too against it),
we could introdue a static InfoStream class, with a static message() and
isVerbose() methods. That way, all classes who wish to log any message can
use it and it will be easier to add messages in the future from other
classes.
Even though it won't allow controlling which classes/packages will output to
the log file, it will give easier extension to Lucene logging. Would that
make more sense?

I still would prefer to see Java logging embedded, but if that's
unacceptable by the community, then having the above solution is better than
what we have today.

On Fri, Dec 5, 2008 at 9:38 PM, Shai Erera <[EMAIL PROTECTED]> wrote:

> Have you ever tried to debug your search application after it was shipped
> to a customer? When problems occur on the customer end, you cannot very
> easily reproduce problems because customers don't like to give you access to
> their systems, not always they are willing to share the index with you and
> let alone the documents that have been indexed.
>
> Logging is very common in products just for that purpose. Of course I can
> use debugging when something happens in my development environment. But
> that's not the case after the product has shipped.
>
> As for the logging framework, I'd think that Java logging creates no
> dependencies for Lucene. java.util.logging exists at least since 1.4. So
> it's already in the JDK. You might argue that some applications who embed a
> search component over Lucene use a different logging system (such as Log4j),
> but in that case I think it'd be fair to say that Java logging is what
> Lucene uses.
>
> You already do it today - you say that you use infoStream which prints
> messages. Only the solution in Lucene today cannot be customized. I either
> turn on *logging* for the entire Lucene package (or actually just the
> indexing part) or not. I cannot, for example, turn on *logging* just for the
> merge part.
>
> The debugging on the customer side is mostly what I'm after. My experience
> with another search library (proprietary) with exactly the same *logging*
> capabilities like Lucene (you either turn on/off logging for everything),
> although it contained messages from other parts of the search library as
> well, show that it's extremely difficult to debug what's going on during
> search on the customer side. Sometimes, all the application can log is that
> it adds a document with some attributes, but if you really want to
> understand what's going on inside Lucene, it's impossible. One useful
> information might be what are the actual tokens that were added to the
> index. There's no way the application can tell you that, w/o running the
> Analyzer on the text. But then it needs to write code, which I think could
> have been written in Lucene.
> Another useful information is what is the query that's actually being run.
> I guess that printing the QueryParser Query output object might be enough,
> but you never know.
> Maybe you'd like to know what indexes participated in the search, in case
> of a distributed indexing scenario.
>
> And the list can only grow ...
>
> Like I said in my first email - logging is an approach the community has to
> make, w/o neccessarily going over all the existing code and add messages.
> Those can be added over time, by many people who'd like to get detailed
> information from Lucene.
>
> I hope my intentions are clearer now.
>
>
> On Fri, Dec 5, 2008 at 9:06 PM, Michael McCandless <
> [EMAIL PROTECTED]> wrote:
>
>>
>> I also feel that the primary usage of the internal messaging in Lucene
>> today is debugging, and we don't need a logging framework for that.
>>
>> Mike
>>
>>
>> Doug Cutting wrote:
>>
>>  The infoStream stuff goes back to 1997, before there was log4j or any
>>> other Java logging framework.
>>>
>>> There's never been a big push to add logging to Lucene.  It would add a
>>> dependency, and Lucene's jar has always been standalone, which is nice.
>>>  Dependencies can conflict.  If Lucene requires one version of a dependency,
>>> then it may not work well with code that require a different version of that
>>> dependency.
>>>
>>> And it hasn't been clear which framework to adopt.  Log4j is the
>>> granddaddy, then there's Java logging and commons logging.  Today the
>>> preferred framework is probably SLF4J.  Good thing we didn't choose the
>>> wrong one years ago!
>>>
>>> And how many log entries would folks really want to see per query or
>>> document indexed?  In production I don't thin

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653923#action_12653923
 ] 

Jason Rutherglen commented on LUCENE-1476:
--

It would be great if instead of relying on Lucene to manage the deletedDocs 
file, the API would be pluggable enough such that a DocIdBitSet (DocIdSet with 
random access) could be set in a SegmentReader, and the file access (reading 
and writing) could be managed from outside.  Of course this is difficult to do 
and still make things backwards compatible, however for 3.0 I would *really* 
like to be a part of efforts to create a completely generic and pluggable API 
that is cleanly separated from the underlying index format and files.  This 
would mean that the analyzing, querying, scoring portions of Lucene could 
access an IndexReader like pluggable class where the underlying index files, 
when and how the index files are written to disk is completely separated.  

One motivation for this patch is to allow custom queries access to the 
deletedDocs in a clean way (meaning not needing to be a part of the o.a.l.s. 
package) 

I am wondering if it is good to try to get IndexReader.clone working again, or 
if there is some other better way related to this patch to externally manage 
the deletedDocs.  

Improving the performance of deletedDocs would help for every query so it's 
worth looking at.  

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Doug Cutting


Shai Erera wrote:
Perhaps instead of introducing Java logging then (if you're too against 
it), we could introdue a static InfoStream class, with a static 
message() and isVerbose() methods.


It's tempting to add our own logging API, as you suggest, but I fear 
that would re-invent what so many have re-invented before.



As for the logging framework, I'd think that Java logging creates no
dependencies for Lucene. java.util.logging exists at least since 1.4.
So it's already in the JDK.


Good point.  Java's built-in logging would not add a dependency, but it 
can still conflict.  But in other projects with serious logging needs 
where I've tried using Java's built in logging, but we've always ended 
up switching to log4j.  So I worry that choosing Java's logging might 
not help those who need logging, and it would conflict with those who 
already use log4j.



You might argue that some applications
who embed a search component over Lucene use a different logging
system (such as Log4j), but in that case I think it'd be fair to say
that Java logging is what Lucene uses.


What do we tell folks who currently use both log4j and Lucene?  How 
would they benefit from this?


A meta-logger like SLF4J seems preferable, since it could integrate with 
whatever logging system folks already use.  Adding this would be an 
incompatible change, since folks would need to add new jars into their 
applications besides the lucene jar.  But that's perhaps not a huge 
burden.  What do others think?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread John Wang

Mike:
   This has been gone back and forth on this thread already. Again, I
agree it is not the perfect solution. I am comparing that to the current
behavior, I don't think it is worse. (Only in my opinion).

   "live serialization" is not familiar to me. To understand it more,
can you point me to somewhere the J2EE spec defines it? AFAIK, the J2EE spec
does not make a distinction, and from what I gather from this thread, Lucene
does not fall into the special category on how Serializable is used. Of
course, it could just be my lack of understanding in the spec.

   We are happy to accept whatever you guys think on this issue. As it
is currently, it is not consistent amongst different committers.

Thanks

-John

On Fri, Dec 5, 2008 at 12:07 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> John Wang wrote:
>
>  My proposal is to add the suid to Serializable classes
>>
>
> That's too brittle.
>
> If we do that, then what happens when we need to add a field to the
> class (eg, in 2.9 we've replaced "inclusive" in RangeQuery with
> "includeLower" and "includeUpper")?  The standard answer is you bump
> the suid, but, then that breaks back compatibility.
>
> Since we would still sometimes, unpredictably, break back
> compatibility, no app could rely on it.  You can't have a "mostly
> back compatible" promise.
>
> So... we have to either 1) only support "live serialization" and
> update the javadocs saying so, or 2) support full back compat of
> serialized classes and spell out the actual policy, make thorough
> tests for it, etc.
>
> Mike
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Doug Cutting


John Wang wrote:
Also, if you find us addressing this issue being a hassle, e.g. 
addressing serialization in lucene is an incorrect thing to do, feel 
free to let us know and we can close the bug and terminate the thread.


I don't know whether cross-version serialization belongs in Lucene.  We 
need to discuss it, to find out how many users might want it, how many 
developers might fear it, how reasonable their fears are, etc.


The discussion so far has not been an easy one.  There have been many 
claims made which have little to do with the technical issue.  As a 
project, we must reach consensus before we can do anything.  Polarized 
comments do not help build consensus.


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread John Wang

I thought the main point is to avoid a jar dependency.
If we were to depend on a jar for logging, then why not log4j or
commons-logging?

-John


On Fri, Dec 5, 2008 at 1:00 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> Shai Erera wrote:
>
>> Perhaps instead of introducing Java logging then (if you're too against
>> it), we could introdue a static InfoStream class, with a static message()
>> and isVerbose() methods.
>>
>
> It's tempting to add our own logging API, as you suggest, but I fear that
> would re-invent what so many have re-invented before.
>
>  As for the logging framework, I'd think that Java logging creates no
>> dependencies for Lucene. java.util.logging exists at least since 1.4.
>> So it's already in the JDK.
>>
>
> Good point.  Java's built-in logging would not add a dependency, but it can
> still conflict.  But in other projects with serious logging needs where I've
> tried using Java's built in logging, but we've always ended up switching to
> log4j.  So I worry that choosing Java's logging might not help those who
> need logging, and it would conflict with those who already use log4j.
>
>  You might argue that some applications
>> who embed a search component over Lucene use a different logging
>> system (such as Log4j), but in that case I think it'd be fair to say
>> that Java logging is what Lucene uses.
>>
>
> What do we tell folks who currently use both log4j and Lucene?  How would
> they benefit from this?
>
> A meta-logger like SLF4J seems preferable, since it could integrate with
> whatever logging system folks already use.  Adding this would be an
> incompatible change, since folks would need to add new jars into their
> applications besides the lucene jar.  But that's perhaps not a huge burden.
>  What do others think?
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Java logging in Lucene

2008-12-05 Thread Doug Cutting


John Wang wrote:
If we were to depend on a jar for logging, then why not log4j or 
commons-logging?


Lucene is used by many applications.  Many of those applications already 
perform some kind of logging.  We'd like whatever Lucene adds to fit in 
with their existing logging framework, not conflict with it.  Thus the 
motivation to use a meta-logging framwork like commons logging or slf4j. 
 And articles like the following point towards slf4j, not commons logging.


http://www.qos.ch/logging/thinkAgain.jsp

Doug




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Doug Cutting


John Wang wrote:
   This has been gone back and forth on this thread already. Again, 
I agree it is not the perfect solution. I am comparing that to the 
current behavior, I don't think it is worse. (Only in my opinion).


So, if it's good enough for you, a user of java serialization, then 
perhaps those of us who don't use java serialization shouldn't complain. 
 I think we'd want to add to the documentation something to the effect 
that this is all that's been done, and that if the classes change 
substantially then all bets are off.  We do not want to imply that we're 
making any cross-version compatibility guarantees about serialization, 
rather just that folks who're willing to take their chances will not be 
impeded.  Could something like that work?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread John Wang

Works for me.

Thanks

-John

On Fri, Dec 5, 2008 at 1:23 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> John Wang wrote:
>
>>   This has been gone back and forth on this thread already. Again, I
>> agree it is not the perfect solution. I am comparing that to the current
>> behavior, I don't think it is worse. (Only in my opinion).
>>
>
> So, if it's good enough for you, a user of java serialization, then perhaps
> those of us who don't use java serialization shouldn't complain.  I think
> we'd want to add to the documentation something to the effect that this is
> all that's been done, and that if the classes change substantially then all
> bets are off.  We do not want to imply that we're making any cross-version
> compatibility guarantees about serialization, rather just that folks who're
> willing to take their chances will not be impeded.  Could something like
> that work?
>
> Doug
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Michael McCandless



OK works for me too.

John or Jason, can you update the patch on LUCENE-1743?  We no longer  
need to implement Externalizable (just add fixed SUIDs), but we do  
need to update the javadocs for all classes implementing Serializable  
to state that cross-version compatibility is not guaranteed.


Mike

John Wang wrote:


Works for me.

Thanks

-John

On Fri, Dec 5, 2008 at 1:23 PM, Doug Cutting <[EMAIL PROTECTED]>  
wrote:

John Wang wrote:
  This has been gone back and forth on this thread already.  
Again, I agree it is not the perfect solution. I am comparing that  
to the current behavior, I don't think it is worse. (Only in my  
opinion).


So, if it's good enough for you, a user of java serialization, then  
perhaps those of us who don't use java serialization shouldn't  
complain.  I think we'd want to add to the documentation something  
to the effect that this is all that's been done, and that if the  
classes change substantially then all bets are off.  We do not want  
to imply that we're making any cross-version compatibility  
guarantees about serialization, rather just that folks who're  
willing to take their chances will not be impeded.  Could something  
like that work?


Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653939#action_12653939
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

> Does it really need to throw an exception?

Aside from back compat, I don't see why it would need to.  I think the only 
rationale is to serve as a backstop protecting against invalid reads.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Jason Rutherglen

I think it's best to implement Externalizable as long as someone is willing
to maintain it.  I commit to maintaining the Externalizable code.  The
programming overhead is no more than implementing the equals method in the
classes.  New classes outside the Lucene code base simply need to implement
Serializable to work.  External developers are not required to implement
Externalizable but may if they see fit.

This will insure forward compatability between serialized versions, make the
serialized objects smaller, and make serialization faster.

Apparently it matters enough for Hadoop to implement Writeable in all over
the wire classes.

On Fri, Dec 5, 2008 at 1:47 PM, Michael McCandless <
[EMAIL PROTECTED]> wrote:

>
> OK works for me too.
>
> John or Jason, can you update the patch on LUCENE-1743?  We no longer need
> to implement Externalizable (just add fixed SUIDs), but we do need to update
> the javadocs for all classes implementing Serializable to state that
> cross-version compatibility is not guaranteed.
>
> Mike
>
>
> John Wang wrote:
>
>  Works for me.
>>
>> Thanks
>>
>> -John
>>
>> On Fri, Dec 5, 2008 at 1:23 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> John Wang wrote:
>>  This has been gone back and forth on this thread already. Again, I
>> agree it is not the perfect solution. I am comparing that to the current
>> behavior, I don't think it is worse. (Only in my opinion).
>>
>> So, if it's good enough for you, a user of java serialization, then
>> perhaps those of us who don't use java serialization shouldn't complain.  I
>> think we'd want to add to the documentation something to the effect that
>> this is all that's been done, and that if the classes change substantially
>> then all bets are off.  We do not want to imply that we're making any
>> cross-version compatibility guarantees about serialization, rather just that
>> folks who're willing to take their chances will not be impeded.  Could
>> something like that work?
>>
>> Doug
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Java logging in Lucene

2008-12-05 Thread Grant Ingersoll

For some additional context, go over to the Solr mail archive and  
search for "Logging, SLF4J" or see http://lucene.markmail.org/message/gxifhjzmn6hgloy7?q=Solr+logging+SLF4J


I personally don't like JUL and would be against using it.  I could,  
maybe, just maybe, be talked into SLF4J.


The other thing I worry about is that the logging will probably be  
carefully crafted at first, but then will grow and grow and end up in  
some tight loops, etc.


Ah, for the days of the C preprocessor, where we could easily deliver  
a version of Lucene w/ logging and without for this kind of  
debugging...  ;-)


-Grant

On Dec 5, 2008, at 4:19 PM, Doug Cutting wrote:


John Wang wrote:
If we were to depend on a jar for logging, then why not log4j or  
commons-logging?


Lucene is used by many applications.  Many of those applications  
already perform some kind of logging.  We'd like whatever Lucene  
adds to fit in with their existing logging framework, not conflict  
with it.  Thus the motivation to use a meta-logging framwork like  
commons logging or slf4j.  And articles like the following point  
towards slf4j, not commons logging.


http://www.qos.ch/logging/thinkAgain.jsp

Doug




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Java logging in Lucene

2008-12-05 Thread Jason Rutherglen

As a developer who would like to eventually develop core code in Lucene (I
started but then things changed drastically and so will wait for flexible
indexing and other APIs?), a standard logging system would make development
easier by making debugging easier.  I rely heavily on log analysis in
developing and debugging search code.  A detailed view of what is happening
internally will speed development, and as Shai mentioned allow production
and pre-production systems to be monitored in new ways.

-J

On Fri, Dec 5, 2008 at 1:19 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> John Wang wrote:
>
>> If we were to depend on a jar for logging, then why not log4j or
>> commons-logging?
>>
>
> Lucene is used by many applications.  Many of those applications already
> perform some kind of logging.  We'd like whatever Lucene adds to fit in with
> their existing logging framework, not conflict with it.  Thus the motivation
> to use a meta-logging framwork like commons logging or slf4j.  And articles
> like the following point towards slf4j, not commons logging.
>
> http://www.qos.ch/logging/thinkAgain.jsp
>
>
> Doug
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Doug Cutting


Jason Rutherglen wrote:
I think it's best to implement Externalizable as long as someone is 
willing to maintain it.  I commit to maintaining the Externalizable 
code.


We need to agree to maintain things as a community, not as individuals. 
 We can't rely on any particular individual being around in the future.


This will insure forward compatability between serialized versions, make 
the serialized objects smaller, and make serialization faster. 


If we want to promise compatibility we need to scope it and test it.  We 
cannot in good faith promise that Query will be serially compatible 
forever, nor should we make any promises that we don't test.  So if you 
choose to continue promoting this route, please specify the scope of 
compatibility and your plans to add tests for it.


Apparently it matters enough for Hadoop to implement Writeable in all 
over the wire classes.


I'm not sure what you're saying here.  As I've said before, Hadoop is 
moving away from Writable because it is too fragile as classes change. 
As a part of the preparations for Hadoop 1.0 we are agreeing on 
serialization back-compatibility requirements and what technology we 
will use to support these.  Hadoop is at its core a distributed system, 
while Lucene is not.  Even then, Hadoop will continue to require that 
one update all nodes in a cluster in a coordinated manner, so only 
end-user protocols need be cross-version compatible, not internal 
protocols.  I do not yet see a strong analogy here.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653955#action_12653955
 ] 

Jason Rutherglen commented on LUCENE-1473:
--

Doug wrote: We can save, in files, serialized instances of each query type from 
the oldest release we intend to support. Then read each of thes queries and 
check that it s equal to a current query that's meant to be equivalent (ssuming 
all queries implement equals well). Something similar would need to be done for 
each class that is meant to be transmitted cross-version.

This tests that older queries may be processed by newer code. It does not test 
that newer queries can be processed by older code. Documentation is a big part 
of this effort, that should be completed first. What guarantees to we intend to 
provide? Once we've documented these, then we can begin writing tests. For 
example, we may only guarantee that older queries work with newer code, and 
that newer hits work with older code. To test that we'd need to have an old jar 
around that we could test against. This will be a trickier test to configure.

--

Makes sense.  I guarantee 2.9 and above classes will be backward compatible 
with the previous classes.  I think that for 3.0 we'll start to create new 
replacement classes that will not conflict with the old classes.  I'd really 
like to redesign the query, similarity, and scoring code to work with flexible 
indexing and allow new algorithms.  This new code will not create changes in 
the existing query, similarity, and scoring code which will remain 
serialization compatible with 2.9.  The 2.9 query, similarity, and scoring 
should leverage the new query, similarity and scoring code to be backwards 
compatible.  

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread robert engels (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653954#action_12653954
 ] 

robert engels commented on LUCENE-1476:
---

That's my point, in complex multi-treaded software with multiple readers, etc. 
it is a good backspot against errors.. :)

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

2008-12-05 Thread Marvin Humphrey (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653959#action_12653959
 ] 

Marvin Humphrey commented on LUCENE-1476:
-

> It would be great if instead of relying on Lucene to manage the 
> deletedDocs file, the API would be pluggable

In LUCENE-1478, "IndexComponent" was proposed, with potential subclasses 
including PostingsComponent, LexiconComponent/TermDictComponent, 
TermVectorsComponent, and so on.  Since then, it has become apparent that 
SnapshotComponent and DeletionsComponent also belong at the top level.

In Lucy/KS, these would all be specified within a Schema: 

{code}
class MySchema extends Schema {
  DeletionsComponent deletionsComponent() { 
return new DocIdBitSetDeletionsComponent();
  }

  void initFields() {
addField("title", "text");
addField("content", "text");
  }

  Analyzer analyzer() {
return new PolyAnalyzer("en");
  }
}
{code}

Mike, you were planning on managing IndexComponents via IndexReader and 
IndexWriter constructor args, but won't that get unwieldy if there are too many 
components?  A Schema class allows you to group them together.  You don't have 
to use it to manage fields the way KS does -- just leave that out.

> BitVector implement DocIdSet
> 
>
> Key: LUCENE-1476
> URL: https://issues.apache.org/jira/browse/LUCENE-1476
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Trivial
> Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making 
> SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653972#action_12653972
 ] 

Doug Cutting commented on LUCENE-1473:
--

> I guarantee 2.9 and above classes will be backward compatible with the 
> previous classes.

It sounds like you are personally guaranteeing that all serializeable classes 
will be forever compatible.  That's not what we'd need.  We'd need a proposed 
policy for the project to consider in terms of major and minor releases, 
specifying forward and/or backward compatibility guarantees.  For example, we 
might say, "within a major release cycle, serialized queries from older 
releases will work with newer releases, however serialized queries from newer 
releases will not generally work with older releases, since we might add new 
kinds of queries in the course of a major release cycle".  Similarly detailed 
statements would need to be made for each Externalizeable, no?

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-05 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1473:
-

Attachment: LUCENE-1473.patch

LUCENE-1473.patch

Added some more Externalizables.  

o.a.l.util.Parameter is peculiar in that it implements readResolve to override 
the serialization and return a local object to emulate enums.  I haven't 
figured out the places this is used and what the best approach is to 
externalize them.

TODO:
- Same as before

Doug wrote: ""within a major release cycle, serialized queries from older 
releases will work with newer releases, however serialized queries from newer 
releases will not generally work with older releases, since we might add new 
kinds of queries in the course of a major release cycle". Similarly detailed 
statements would need to be made for each Externalizeable, no?"

Serialized objects in minor releases will work.  Serialized objects of older 
versions starting with 2.9 will be compatible with newer versions.  New 
versions will be compatible with older versions on a classes by class basis 
defined in the release notes.  It could look something like this:

Serialization notes:
BooleanQuery added a scoreMap variable that does not have a default value in 
3.0 and is now not backwards compatible with 2.9.  
PhraseQuery added a ultraWeight variable that defaults to true in 3.0 and is 
backwards compatible with 2.9.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch, 
> LUCENE-1473.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1473) Implement Externalizable in main top level searcher classes

2008-12-05 Thread Jason Rutherglen

The tests will be for backwards compatibility with previous versions of
Lucene using the described process of including previous versioned encoded
serialized objects into the test code base.  Similar to how CFS index files
are included in the test code tree.

There is a an elegance to the RemoteSearcher type of code that allows one to
focus on their queries and algorithms and ignore the fact that they are
searching over N machines.

Protocol buffers seem okay.  However given the way that Lucene allows
customizations in things like SortComparatorSource I do not see how protocol
buffers can be used with custom Java classes in the same way Java
serialization works.  If in the future Lucene allows greater customization
such as with scorers, similarities and queries in Lucene 3.0 then marrying
the data with code in a grid environment using protocol buffers gets ugly.
Protocol buffers are nice and can be added to a distributed Lucene
environment, but the cost of implementing them vs. Serialization is much
higher.

Uber distributed search may not be the most common use case right now for
Lucene but as it improves it's capabilities then people will try to use
Lucene in a distributed grid environment.  One could conceivably execute
arbitrarily complex coordinated operations over the standard Lucene 3.0 APIs
without tearing down processes and other worries. Oracle has PL/SQL and
Lucene effectively operates using Java for customized query operations like
PL/SQL.  It would seem natural to at least support Java as a way to execute
customized queries.  The customized queries would be dynamically loaded Java
objects.

In the marketplace Lucene seems to be a good place to do realtime search
based data processing.  At least compared to Sphinx and MG4J.

A little further into the future with SSDs, it should be possible to perform
place replacement of inverted index data using Lucene (at which point it is
similar to a database) and the ability to execute remote code may be very
useful.  Hopefully the APIs for 3.0 will have a goal of being open enough
for this.

On Fri, Dec 5, 2008 at 2:40 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:

> Jason Rutherglen wrote:
>
>> I think it's best to implement Externalizable as long as someone is
>> willing to maintain it.  I commit to maintaining the Externalizable code.
>>
>
> We need to agree to maintain things as a community, not as individuals.  We
> can't rely on any particular individual being around in the future.
>
>  This will insure forward compatability between serialized versions, make
>> the serialized objects smaller, and make serialization faster.
>>
>
> If we want to promise compatibility we need to scope it and test it.  We
> cannot in good faith promise that Query will be serially compatible forever,
> nor should we make any promises that we don't test.  So if you choose to
> continue promoting this route, please specify the scope of compatibility and
> your plans to add tests for it.
>
>  Apparently it matters enough for Hadoop to implement Writeable in all over
>> the wire classes.
>>
>
> I'm not sure what you're saying here.  As I've said before, Hadoop is
> moving away from Writable because it is too fragile as classes change. As a
> part of the preparations for Hadoop 1.0 we are agreeing on serialization
> back-compatibility requirements and what technology we will use to support
> these.  Hadoop is at its core a distributed system, while Lucene is not.
>  Even then, Hadoop will continue to require that one update all nodes in a
> cluster in a coordinated manner, so only end-user protocols need be
> cross-version compatible, not internal protocols.  I do not yet see a strong
> analogy here.
>
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Java logging in Lucene

2008-12-05 Thread Shai Erera

Doug,

bq. It's tempting to add our own logging API, as you suggest, but I fear
that would re-invent what so many have re-invented before.

Haven't we already added our own logging API by introducing infoStream in
IndexWriter? All I'm proposing (as an alternative to Java logging) is to
make it a service for all of Lucene classes, even contrib. I didn't propose
to add Java logging-like capabilities, like levels (eventhough I think it's
useful), but instead take what IW has today (a message() method) and make a
static one for other classes.

bq. What do we tell folks who currently use both log4j and Lucene?  How
would they benefit from this?

I don't think it's such a big deal. To turn on Lucene logging, they have to
introduce some API (or UI) for users/administrators to configure. They then
probably set infoStream to the stream log4j uses.
By using Java logging, all we'll ask them is to configure the Java logging
system, which is pretty easy.

About SLF4J, I'm not familiar with it so I cannot comment. The only thing I
can comment about is the additional jar people would have to add to their
applications. That's really not an issue imo because people already add many
jars to support Lucene. If one uses any contrib package, it's an additional
jar. If one wants to use Snowball, it's 2 jars (the snowball and the contrib
analyzer).
When you use Apache HttpClient, you have to add several jars, which is ok
...

Grant,

What do you have against JUL? I've used it and in my company (which is quite
a large one btw) we've moved to JUL just because it's so easy to configure,
comes already with the JDK and very intuitive. Perhaps it has some
shortcomings which I'm not aware of, and I hope you can point me at them.

The argument on whether to choose JUL, commons, log4j or slf4j can go on, I
don't mind participating in it as I think it's interesting. But the core
question is whether the community (and especially the committers) think that
we need more logging in Lucene, except IW's infoStream. If so, we can start
by introducing that InfoStream service class, which willl only expose
today's functionality at start (i.e., only indexing code will use), but will
allow for other classes to log operations as well.

I personally would like to use more standard logging frameworks (and
preferrably JUL), but what I want more is the ability to debug my product
after it has been shipped. So eventhough it's not as great as standard
logging, the InfoStream service is still better than what Lucene has today.

My 2 cents.

Shai

On Sat, Dec 6, 2008 at 12:32 AM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote:

> As a developer who would like to eventually develop core code in Lucene (I
> started but then things changed drastically and so will wait for flexible
> indexing and other APIs?), a standard logging system would make development
> easier by making debugging easier.  I rely heavily on log analysis in
> developing and debugging search code.  A detailed view of what is happening
> internally will speed development, and as Shai mentioned allow production
> and pre-production systems to be monitored in new ways.
>
> -J
>
>
> On Fri, Dec 5, 2008 at 1:19 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
>> John Wang wrote:
>>
>>> If we were to depend on a jar for logging, then why not log4j or
>>> commons-logging?
>>>
>>
>> Lucene is used by many applications.  Many of those applications already
>> perform some kind of logging.  We'd like whatever Lucene adds to fit in with
>> their existing logging framework, not conflict with it.  Thus the motivation
>> to use a meta-logging framwork like commons logging or slf4j.  And articles
>> like the following point towards slf4j, not commons logging.
>>
>> http://www.qos.ch/logging/thinkAgain.jsp
>>
>>
>> Doug
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>

61 matches

Mail list logo