date:20061207

Re: Shiny New Logos

2006-12-07 Thread Ronnie Kolehmainen



Doug Cutting wrote:


But my preference would not be to redesign the Lucene logo, but rather 
just find a nice way to combine it with Duke, perhaps cleaning it up a 
bit.


+1

Out of these 5 (or 6) samples I find lucene2.png (bottom image) most 
attractive.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-737) Provision of encryption/decryption services API to support Field.Store.Encrypted

2006-12-07 Thread Hoss Man (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-737?page=comments#action_12456687 ] 

Hoss Man commented on LUCENE-737:
-

for future ref: copious discussion on this issue was discussed in the orriginal 
email thread...

http://www.nabble.com/Attached-proposed-modifications-to-Lucene-2.0-to-support-Field.Store.Encrypted-tf2727614.html#a7607415


> Provision of encryption/decryption services API to support 
> Field.Store.Encrypted
> 
>
> Key: LUCENE-737
> URL: http://issues.apache.org/jira/browse/LUCENE-737
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search, Store
>Affects Versions: 2.0.0, 2.0.1
>Reporter: victor negrin
> Attachments: bcprov-jdk14-133.jar, BouncyCastle Licence & disclaimer, 
> LuceneEncryptionMods.zip
>
>
> Attached are proposed modifications to Lucene 2.0 to support 
> Field.Store.Encrypted.
> The rational behind this proposal is simple. Since Lucene can store data in 
> the index, it effectively makes the data portable. It is conceivable that 
> some of the data may be sensitive in nature, hence the option to encrypt it. 
> Both the data and its index are encrypted in this implementation.
> This is only an initial implementation. It has the following several 
> restrictions, all of which can be resolved if required, albeit with some 
> effort and more changes to Lucene:
> 1) binary and compressed fields cannot be encrypted as well (a plaintext once 
> encrypted becomes binary).
> 2) Field.Store.Encrypted implies Field.Store.Yes
> This makes sense but it forces one to store the data in the same index where 
> the tokens are stored. It may be preferable at times to have two indeces, one 
> for tokens, the other for the data.
> 3) As implemented, it uses RC4 encryption from BouncyCastle. This is an open 
> source package, very simple to use which has the advantage of guaranteeing 
> that the length of the encrypted field is the same as the original plaintext. 
> As of Java 1.5 (5.0) Sun provides an RC4 equivalent in its Java Cryptography 
> Extension, but unfortunately not in Java 1.4.
> The BouncyCastle RC4 is not the only algorythm available, others not 
> depending on third party code can be used, but it was just the simplest to 
> implement for this first attempt.
> 4) The attachements are modifications in diff form based on an early (I think 
> August or September '06) repository snapshot of Lucene 2.0 subsequently 
> updated from the Lucene repository on 29/11/06. They may need some additional 
> work to merge with the latest version in the Lucene repository. They also 
> include a couple of JUnit test programs which explain, as well as test, the 
> usage. You will need the BouncyCastle .jar (bcprov-jdk14-134.jar) to run 
> them. I did not attach it to minimize the size of the attachements, but it 
> can be downloaded free from:
>  http://www.bouncycastle.org/latest_releases.html
>  
> 5) Searching an encrypted field is restricted to single terms, no phrase or 
> boolean searches allowed yet, and the term has to be encrypted by the 
> application before searching it. (ref. attached JUnit test programs)
> To the extent that I have tested it, the code works as intended and does not 
> appear to introduce any regression problems, but more testing by others would 
> be desirable.
> I don't propose at this stage to do any further work with this API extensions 
> unless there is some expression of interest and direction from the Lucene 
> Developers team. I have an application ready to roll which uses the proposed 
> Lucene encryption API additions (please see 
> http://www.kbforge.com/index.html). The application is not yet available for 
> downloading simply because I am not sure if the Lucene licence allows me to 
> do so. I would appreciate your advice in this regard. My application is free 
> but its source code is not available (yet). I should add that encryption does 
> not have to be an integral part of Lucene, it can be just part of the end 
> application, but somehow it seems to me that Field.Store.Encrypted belongs in 
> the same category as compression and binary values.
> I would be happy to receive your feedback.
> victor negrin 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Spliting the Lucene

2006-12-07 Thread howard chen


Hi,

A friend from Hadoop told me someone in the list has code for spliting
the Lucene index, can anyone point me to the right place?

Thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-732) Support DateTools in QueryParser

2006-12-07 Thread Michael Busch (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-732?page=comments#action_12456662 ] 

Michael Busch commented on LUCENE-732:
--

You're right Hoss, the word "format" is used ambiguously in the javadoc. We 
could change it to

 * In [EMAIL PROTECTED] RangeQuery}s, QueryParser tries to detect date 
values, e.g. date:[6/1/2005 TO 6/4/2005]
 * produces a range query that searches for "date" fields between 2005-06-01 
and 2005-06-04. Note
 * that the format of the accepted input depends on [EMAIL PROTECTED] 
#setLocale(Locale) the locale}. 
 * By default a date is converted into a search term using the deprecated 
[EMAIL PROTECTED] DateField} for compatibility reasons.
 * To use the new [EMAIL PROTECTED] DateTools} to convert dates, a [EMAIL 
PROTECTED] DateTools.Resolution} has to be set. 
 * The date resolution that shall be used for RangeQueries can be set using 
[EMAIL PROTECTED] #setDateResolution(DateTools.Resolution)}
 * or [EMAIL PROTECTED] #setDateResolution(String, DateTools.Resolution)}. The 
former sets the default date resolution for all fields, whereas the latter can
 * be used to set field specific date resolutions. Field specific date 
resolutions take, if set, precedence over the default date resolution.
 * 
 * If you use neither [EMAIL PROTECTED] DateField} nor [EMAIL PROTECTED] 
DateTools} in your index, you can create your own
 * query parser that inherits QueryParser and overwrites [EMAIL PROTECTED] 
#getRangeQuery(String, String, String, boolean)} to
 * use a different method for date conversion.
 *  

Sounds better? Do you want me to create another patch that includes this 
javadoc?

> Support DateTools in QueryParser
> 
>
> Key: LUCENE-732
> URL: http://issues.apache.org/jira/browse/LUCENE-732
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Reporter: Michael Busch
> Assigned To: Michael Busch
>Priority: Minor
> Attachments: queryparser_datetools.patch, queryparser_datetools2.patch
>
>
> The QueryParser currently uses the deprecated class DateField to create 
> RangeQueries with date values. However, most users probably use DateTools to 
> store date values in their indexes, because this is the recommended way since 
> DateField has been deprecated. In that case RangeQueries with date values 
> produced by the QueryParser won't work with those indexes.
> This patch replaces the use of DateField in QueryParser by DateTools. Because 
> DateTools can produce date values with different resolutions, this patch adds 
> the following methods to QueryParser:
>   /**
>* Sets the default date resolution used by RangeQueries for fields for 
> which no
>* specific date resolutions has been set. Field specific resolutions can 
> be set
>* with [EMAIL PROTECTED] #setDateResolution(String, DateTools.Resolution)}.
>*  
>* @param dateResolution the default date resolution to set
>*/
>   public void setDateResolution(DateTools.Resolution dateResolution);
>   
>   /**
>* Sets the date resolution used by RangeQueries for a specific field.
>*  
>* @param field field for which the date resolution is to be set 
>* @param dateResolution date resolution to set
>*/
>   public void setDateResolution(String fieldName, DateTools.Resolution 
> dateResolution);
> (I also added the corresponding getter methods).
> Now the user can set a default date resolution used for all fields or, with 
> the second method, field specific date resolutions.
> The initial default resolution, which is used if the user does not set a 
> different resolution, is DateTools.Resolution.DAY. 
> Please let me know if you think we should use a different resolution as 
> default.
> I extended TestQueryParser to test this new feature.
> All unit tests pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

2006-12-07 Thread Daniel John Debrunner


Doug Cutting wrote:

Daniel John Debrunner wrote:
Can anyone use the Duke logo like this? I thought it was a trademark 
of Sun Microsystems.


http://logos.sun.com/logosite.jsp?Category=third&Logo=duke-button


Sun has open-sourced Duke:

https://duke.dev.java.net/


Very cool, thanks Doug.

Dan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

2006-12-07 Thread Doug Cutting


Daniel John Debrunner wrote:
Can anyone use the Duke logo like this? I thought it was a trademark of 
Sun Microsystems.


http://logos.sun.com/logosite.jsp?Category=third&Logo=duke-button


Sun has open-sourced Duke:

https://duke.dev.java.net/

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

2006-12-07 Thread Doug Cutting


Grant Ingersoll wrote:
So, if people want to vote for one (just reply with the name of the one 
you like), I will be happy to incorporate the consensus into the 
website.


Of these, my favorite is #2 (lower image), since the typeface is most 
similar to the existing Lucene logo.


But my preference would not be to redesign the Lucene logo, but rather 
just find a nice way to combine it with Duke, perhaps cleaning it up a 
bit.  Someone once found a typeface that's really similar to the 
existing Lucene logo (which was hand-drawn).


http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200412.mbox/[EMAIL 
PROTECTED]

The font is ITC Magneto Bold Extended.

http://www.itcfonts.com/Fonts/detail.htm?pid=409331

And the old green could be grabbed from the old image.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

2006-12-07 Thread Michael Busch (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-739?page=comments#action_12456654 ] 

Michael Busch commented on LUCENE-739:
--

Thanks Yonik! Well then, let's commit it? ;-)

> Performance improvement for SegmentMerger.mergeNorms()
> --
>
> Key: LUCENE-739
> URL: http://issues.apache.org/jira/browse/LUCENE-739
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
>Priority: Minor
> Attachments: mergeNorms.patch
>
>
> This patch makes two improvements to SegmentMerger.mergeNorms():
> 1) When the SegmentMerger merges the norm values it allocates a new byte 
> array to buffer the values for every field of every segment. The size of such 
> an array equals the size of the corresponding segment, so if large segments 
> are being merged, those arrays become very large, too.
> We can easily reduce the number of array allocations by reusing a byte array 
> to buffer the norm values that only grows, if a segment is larger than the 
> previous one.
> 2) Before a norm value is written it is checked if the corresponding document 
> is deleted. If not, the norm is written using IndexOutput.writeByte(byte[]). 
> This patch introduces an optimized case for segments that do not have any 
> deleted docs. In this case the frequent call of IndexReader.isDeleted(int) 
> can be avoided and the more efficient method IndexOutput.writeBytes(byte[], 
> int) can be used.
> This patch only changes the method SegmentMerger.mergeNorms(). All unit tests 
> pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

2006-12-07 Thread Daniel John Debrunner


Grant Ingersoll wrote:


Marcello Prattico (http://www.astrochimp.com) has graciously agreed to 
create some Lucene logos that incorporate Duke into the logo so that we 
can distinguish Lucene Java from Lucene TLP (Top-Level Project).  The 
Lucene TLP logo (a.k.a the Green cursive Lucene that we all know and 
love) will appear in the upper right of the website (replacing the 
feather), and the new logo will be in the center where the current green 
logo is now located.


There are 5 total images labeled lucene?.png that can be viewed at 
http://people.apache.org/~gsingers/images/   The tar file in the 
directory contains all the images in both psd and png formats, plus the 
Duke originals used.


Can anyone use the Duke logo like this? I thought it was a trademark of 
Sun Microsystems.


http://logos.sun.com/logosite.jsp?Category=third&Logo=duke-button

Dan.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

2006-12-07 Thread Otis Gospodnetic

Nice, thanks to Marcello if he's reading.
I like 1.png and 2.png (bottom logo) and 3.png.  I would prefer the old green 
colour, so we don't lose that bit of branding and history.
The logo in 3.png makes me think of Ghostbusters. :)  It might be interesting 
to see that circle turned into a magnifying glass a la any other search icon, 
and maybe using the original cursive font.

I believe Doug has the EPS version of the logo.

Otis

- Original Message 
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, December 7, 2006 5:43:13 PM
Subject: Shiny New Logos


Marcello Prattico (http://www.astrochimp.com) has graciously agreed  
to create some Lucene logos that incorporate Duke into the logo so  
that we can distinguish Lucene Java from Lucene TLP (Top-Level  
Project).  The Lucene TLP logo (a.k.a the Green cursive Lucene that  
we all know and love) will appear in the upper right of the website  
(replacing the feather), and the new logo will be in the center where  
the current green logo is now located.

There are 5 total images labeled lucene?.png that can be viewed at  
http://people.apache.org/~gsingers/images/   The tar file in the  
directory contains all the images in both psd and png formats, plus  
the Duke originals used.

So, if people want to vote for one (just reply with the name of the  
one you like), I will be happy to incorporate the consensus into the  
website.  Of course, since this is Open Source, others are free to  
submit there own versions or provide other suggestions.

Personally, I vote for http://people.apache.org/~gsingers/images/ 
lucene3.png but I also like http://people.apache.org/~gsingers/images/ 
lucene1.png as well.

-Grant

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: non-overlapping Span queries

2006-12-07 Thread Ruslan Sivak


Paul Elschot wrote:

On Thursday 07 December 2006 22:57, Ruslan Sivak wrote:
  
I see back in Jul 2005 there was a thread about SpanNearQueries which 
were overlapping.





http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200507.mbox/[EMAIL 
PROTECTED]
  
A fix was posted by Paul Elschot at that time.  Did this fix ever make 
it into 2.0?  I'm having problems with SpanNearQuerie's matching the 
same thing again, example



  
I'm searching for ((Brooklyn) near (Brooklyn near NY slop 0) slop 10) 
and it matches Brooklyn, NY  I want it to not match unless the phrase is 
something like


Brooklyn High which is in Brooklyn, NY

 
That requires a minimum distance between the matches of the

subqueries, and that is not yet implemented.

The previous fix is in the trunk:
http://issues.apache.org/jira/browse/LUCENE-569

You can try the svn head, or a nightly build instead of 2.0:
http://cvs.apache.org/dist/lucene/java/nightly/
but a minimum span distance facility is not in there afaik.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  
I tried the latest nightly build, but it still doesn't help.  BTW i was 
hitting the same error before as the one mentioned in the original 
thread, so perhaps the fix is not yet complete?  I was getting an error 
when doing this kind of query:


(brooklyn) near (brooklyn) slop 10

Russ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: non-overlapping Span queries

2006-12-07 Thread Chris Hostetter


: > Brooklyn High which is in Brooklyn, NY
:
: That requires a minimum distance between the matches of the
: subqueries, and that is not yet implemented.

I was about to suggest that adding that seems like it would be fairly
easy, just add a new "int minDistance" to SpanNearQuery and then use it in
NearSpansOrdered.docSpansOrdered to ensure that "end1 + minDistance <=
start2" and in NearSpansUnordered.atMatch to test that "min.end() +
minDistance <= max.start()" ... but then it orruced to me that the whole
issue isn't thatsimple when you have a SpanNearQuery with more then two
clauses.

I'm not even sure what a three clause SpanNearQuery with a miDistance of N
would even mean .. is that the min distance between each clause, or
between the outer most?

Paul: you under stand Span queries a lot better then i do: if you had a
two clause SpanNear would my suggestion make sense?

we could allways add minDistance to SpanNearQuery, but make it private
only only setable from a new constructor that explicitly only takes in two
SpanQuery clauses (instead of an array).


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: non-overlapping Span queries

2006-12-07 Thread Paul Elschot

On Thursday 07 December 2006 22:57, Ruslan Sivak wrote:
> I see back in Jul 2005 there was a thread about SpanNearQueries which 
> were overlapping.
> 
> 
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200507.mbox/[EMAIL 
PROTECTED]
> 
> A fix was posted by Paul Elschot at that time.  Did this fix ever make 
> it into 2.0?  I'm having problems with SpanNearQuerie's matching the 
> same thing again, example

> I'm searching for ((Brooklyn) near (Brooklyn near NY slop 0) slop 10) 
> and it matches Brooklyn, NY  I want it to not match unless the phrase is 
> something like
> 
> Brooklyn High which is in Brooklyn, NY
 
That requires a minimum distance between the matches of the
subqueries, and that is not yet implemented.

The previous fix is in the trunk:
http://issues.apache.org/jira/browse/LUCENE-569

You can try the svn head, or a nightly build instead of 2.0:
http://cvs.apache.org/dist/lucene/java/nightly/
but a minimum span distance facility is not in there afaik.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Shiny New Logos

2006-12-07 Thread Grant Ingersoll



Marcello Prattico (http://www.astrochimp.com) has graciously agreed  
to create some Lucene logos that incorporate Duke into the logo so  
that we can distinguish Lucene Java from Lucene TLP (Top-Level  
Project).  The Lucene TLP logo (a.k.a the Green cursive Lucene that  
we all know and love) will appear in the upper right of the website  
(replacing the feather), and the new logo will be in the center where  
the current green logo is now located.


There are 5 total images labeled lucene?.png that can be viewed at  
http://people.apache.org/~gsingers/images/   The tar file in the  
directory contains all the images in both psd and png formats, plus  
the Duke originals used.


So, if people want to vote for one (just reply with the name of the  
one you like), I will be happy to incorporate the consensus into the  
website.  Of course, since this is Open Source, others are free to  
submit there own versions or provide other suggestions.


Personally, I vote for http://people.apache.org/~gsingers/images/ 
lucene3.png but I also like http://people.apache.org/~gsingers/images/ 
lucene1.png as well.


-Grant

--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

non-overlapping Span queries

2006-12-07 Thread Ruslan Sivak

I see back in Jul 2005 there was a thread about SpanNearQueries which 
were overlapping.


http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200507.mbox/[EMAIL 
PROTECTED]

A fix was posted by Paul Elschot at that time.  Did this fix ever make 
it into 2.0?  I'm having problems with SpanNearQuerie's matching the 
same thing again, example


I'm searching for ((Brooklyn) near (Brooklyn near NY slop 0) slop 10) 
and it matches Brooklyn, NY  I want it to not match unless the phrase is 
something like


Brooklyn High which is in Brooklyn, NY

Russ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

2006-12-07 Thread Yonik Seeley (JIRA)

[ 
http://issues.apache.org/jira/browse/LUCENE-739?page=comments#action_12456505 ] 

Yonik Seeley commented on LUCENE-739:
-

+1, looks great Michael!

> Performance improvement for SegmentMerger.mergeNorms()
> --
>
> Key: LUCENE-739
> URL: http://issues.apache.org/jira/browse/LUCENE-739
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
>Priority: Minor
> Attachments: mergeNorms.patch
>
>
> This patch makes two improvements to SegmentMerger.mergeNorms():
> 1) When the SegmentMerger merges the norm values it allocates a new byte 
> array to buffer the values for every field of every segment. The size of such 
> an array equals the size of the corresponding segment, so if large segments 
> are being merged, those arrays become very large, too.
> We can easily reduce the number of array allocations by reusing a byte array 
> to buffer the norm values that only grows, if a segment is larger than the 
> previous one.
> 2) Before a norm value is written it is checked if the corresponding document 
> is deleted. If not, the norm is written using IndexOutput.writeByte(byte[]). 
> This patch introduces an optimized case for segments that do not have any 
> deleted docs. In this case the frequent call of IndexReader.isDeleted(int) 
> can be avoided and the more efficient method IndexOutput.writeBytes(byte[], 
> int) can be used.
> This patch only changes the method SegmentMerger.mergeNorms(). All unit tests 
> pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

2006-12-07 Thread Michael Busch (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-739?page=all ]

Michael Busch updated LUCENE-739:
-

Attachment: mergeNorms.patch

> Performance improvement for SegmentMerger.mergeNorms()
> --
>
> Key: LUCENE-739
> URL: http://issues.apache.org/jira/browse/LUCENE-739
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
>Priority: Minor
> Attachments: mergeNorms.patch
>
>
> This patch makes two improvements to SegmentMerger.mergeNorms():
> 1) When the SegmentMerger merges the norm values it allocates a new byte 
> array to buffer the values for every field of every segment. The size of such 
> an array equals the size of the corresponding segment, so if large segments 
> are being merged, those arrays become very large, too.
> We can easily reduce the number of array allocations by reusing a byte array 
> to buffer the norm values that only grows, if a segment is larger than the 
> previous one.
> 2) Before a norm value is written it is checked if the corresponding document 
> is deleted. If not, the norm is written using IndexOutput.writeByte(byte[]). 
> This patch introduces an optimized case for segments that do not have any 
> deleted docs. In this case the frequent call of IndexReader.isDeleted(int) 
> can be avoided and the more efficient method IndexOutput.writeBytes(byte[], 
> int) can be used.
> This patch only changes the method SegmentMerger.mergeNorms(). All unit tests 
> pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

2006-12-07 Thread Michael Busch (JIRA)

Performance improvement for SegmentMerger.mergeNorms()
--

 Key: LUCENE-739
 URL: http://issues.apache.org/jira/browse/LUCENE-739
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
 Assigned To: Michael Busch
Priority: Minor
 Attachments: mergeNorms.patch

This patch makes two improvements to SegmentMerger.mergeNorms():

1) When the SegmentMerger merges the norm values it allocates a new byte array 
to buffer the values for every field of every segment. The size of such an 
array equals the size of the corresponding segment, so if large segments are 
being merged, those arrays become very large, too.
We can easily reduce the number of array allocations by reusing a byte array to 
buffer the norm values that only grows, if a segment is larger than the 
previous one.

2) Before a norm value is written it is checked if the corresponding document 
is deleted. If not, the norm is written using IndexOutput.writeByte(byte[]). 
This patch introduces an optimized case for segments that do not have any 
deleted docs. In this case the frequent call of IndexReader.isDeleted(int) can 
be avoided and the more efficient method IndexOutput.writeBytes(byte[], int) 
can be used.


This patch only changes the method SegmentMerger.mergeNorms(). All unit tests 
pass.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

2006-12-07 Thread Doron Cohen (JIRA)

 [ http://issues.apache.org/jira/browse/LUCENE-738?page=all ]

Doron Cohen updated LUCENE-738:
---

Attachment: FileFormatDoc.patch.txt

FileFormat document updated to reflect this format change.

> read/write .del as d-gaps when the deleted bit vector is sufficiently sparse
> 
>
> Key: LUCENE-738
> URL: http://issues.apache.org/jira/browse/LUCENE-738
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 2.1
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
> Attachments: del.dgap.patch.txt, FileFormatDoc.patch.txt
>
>
> .del file of a segment maintains info on deleted documents in that segment. 
> The file exists only for segments having deleted docs, so it does not exists 
> for newly created segments (e.g. resulted from merge). Each time closing an 
> index reader that deleted any document, the .del file is rewritten. In fact, 
> since the lock-less commits change a new (generation of) .del file is created 
> in each such occasion.
> For small indexes there is no real problem with current situation. But for 
> very large indexes, each time such an index reader is closed, creating such 
> new bit-vector seems like unnecessary overhead in cases that the bit vector 
> is sparse (just a few docs were deleted). For instance, for an index with a 
> segment of 1M docs, the sequence: {open reader; delete 1 doc from that 
> segment; close reader;} would write a file of ~128KB. Repeat this sequence 8 
> times: 8 new files of total size of 1MB are written to disk.
> Whether this is a bottleneck or not depends on the application deletes 
> pattern, but for the case that deleted docs are sparse, writing just the 
> d-gaps would save space and time. 
> I have this (simple) change to BitVector running and currently trying some 
> performance tests to, yet, convince myself on the worthiness of this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Shiny New Logos

[jira] Commented: (LUCENE-737) Provision of encryption/decryption services API to support Field.Store.Encrypted

Spliting the Lucene

[jira] Commented: (LUCENE-732) Support DateTools in QueryParser

Re: Shiny New Logos

Re: Shiny New Logos

Re: Shiny New Logos

[jira] Commented: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

Re: Shiny New Logos

Re: Shiny New Logos

Re: non-overlapping Span queries

Re: non-overlapping Span queries

Re: non-overlapping Span queries

Shiny New Logos

non-overlapping Span queries

[jira] Commented: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

[jira] Updated: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

[jira] Created: (LUCENE-739) Performance improvement for SegmentMerger.mergeNorms()

[jira] Updated: (LUCENE-738) read/write .del as d-gaps when the deleted bit vector is sufficiently sparse

19 matches

Site Navigation

Mail list logo

Footer information