Re: Continuous Integration for Lucene

2006-12-20 Thread Chris Hostetter

: Now Parabuild will re-build Lucene whenever new changes are committed
: to the repository and send a message to the dev list if new changes
: break the build. Here is the URL:
:
: http://parabuild.viewtier.com:8080/parabuild/index.htm?displaygroupid=5

Alex: thanks for setting this up.

Just so you know, There are currently two other automated builds of lucene
that send email in the event of a build failure - one is the "official"
nightly build script...

  http://svn.apache.org/viewvc/lucene/java/nightly/
  http://www.nabble.com/Lucene-nightly-build-failure-tf2833950.html#a7911948

...the other is the "GUMP" system...

  http://vmgump.apache.org/gump/public/lucene-java/lucene-java/index.html
  
http://www.nabble.com/-GUMP%40vmgump-%3A-Project-lucene-java-%28in-module-lucene-java%29-failed-tf2817974.html#a7865339

One thing you might want to watch out for, is that your system doesn't
seem to run the unit tests, which is an important part of verifying that a
build was "successful" (i notice this only because the trunk was acctually
broken recently, yet the logs available on your system indicate that the
build succeeded on that day)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2006-12-20 Thread Bogdan Ghidireac (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-753?page=comments#action_12459868 ] 

Bogdan Ghidireac commented on LUCENE-753:
-

You can find a NIO variation of IndexInput attached to this issue: 
http://issues.apache.org/jira/browse/LUCENE-519

I had good results on multiprocessor machines under heavy load.

Regards,
Bogdan

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: http://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Payloads

2006-12-20 Thread Michael Busch

Hi all,

currently it is not possible to add generic payloads to a posting list. 
However, this feature would be useful for various use cases. Some examples:

- XML search
 to index XML documents and allow structured search (e.g. XPath) it is 
neccessary to store the depths of the terms

- part-of-speech
 payloads can be used to store the part of speech of a term occurrence
- term boost
 for terms that occur e.g. in bold font a payload containing a boost 
value can be stored

- ...

The feature payloads has been requested and discussed a couple of times, 
e. g. in

- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago that 
adds the possibility to Lucene to store variable-length payloads inline 
in the posting list of a term. However, this design had some drawbacks: 
the already complex field API was extended and the payloads encoding was 
not optimal in terms of disk space.  Furthermore, the overall Lucene 
runtime performance suffered due to the growth of the .prx file. In the 
meantime the patch LUCENE-687 (Lazy skipping on proximity file) was 
committed, which reduces the number of reads and seeks on the .prx file. 
This minimizes the performance degradation of a bigger .prx file. Also, 
LUCENE-695 (Improve BufferedIndexInput.readBytes() performance) was 
committed, that speeds up reading mid-size chunks of bytes, which is 
beneficial for payloads that are bigger than just a few bytes.


Some weeks ago I started working on an improved design which I would 
like to propose now. The new design simplifies the API extensions (the 
Field API remains unchanged) and uses less disk space in most use cases. 
Now there are only two classes that get new methods:

- Token.setPayload()
 Use this method to add arbitrary metadata to a Token in the form of a 
byte[] array.


- TermPositions.getPayload()
 Use this method to retrieve the payload of a term occurrence.

The implementation is very flexible: the user does not have to enable 
payloads explicilty for a field and can add payloads to all, some or no 
Tokens. Due to the improved encoding those use cases are handled 
efficiently in terms of disk space.


Another thing I would like to point out is that this feature is 
backwards compatible, meaning that the file format only changes if the 
user explicitly adds payloads to the index. If no payloads are used, all 
data structures remain unchanged.


I'm going to open a new JIRA issue soon containing the patch and details 
about implementation and file format changes.


One more comment: It is a rather big patch and this is the initial 
version, so I'm sure there will be a lot of discussions. I would like to 
encourage people who consider this feature as useful to try it out and 
give me some feedback about possible improvements.


Best regards,
- Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads

2006-12-20 Thread Grant Ingersoll

Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been saying  
that for a while, but I really am.)  At any rate, another set of eyes  
would be good and I would be interested in hearing how your version  
compares/works with this patch from Nicolas.


-Grant

On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:


Hi all,

currently it is not possible to add generic payloads to a posting  
list. However, this feature would be useful for various use cases.  
Some examples:

- XML search
 to index XML documents and allow structured search (e.g. XPath) it  
is neccessary to store the depths of the terms

- part-of-speech
 payloads can be used to store the part of speech of a term occurrence
- term boost
 for terms that occur e.g. in bold font a payload containing a  
boost value can be stored

- ...

The feature payloads has been requested and discussed a couple of  
times, e. g. in

- http://www.gossamer-threads.com/lists/lucene/java-dev/29465
- http://www.gossamer-threads.com/lists/lucene/java-dev/37409

In the latter thread I proposed a design a couple of months ago  
that adds the possibility to Lucene to store variable-length  
payloads inline in the posting list of a term. However, this design  
had some drawbacks: the already complex field API was extended and  
the payloads encoding was not optimal in terms of disk space.   
Furthermore, the overall Lucene runtime performance suffered due to  
the growth of the .prx file. In the meantime the patch LUCENE-687  
(Lazy skipping on proximity file) was committed, which reduces the  
number of reads and seeks on the .prx file. This minimizes the  
performance degradation of a bigger .prx file. Also, LUCENE-695  
(Improve BufferedIndexInput.readBytes() performance) was committed,  
that speeds up reading mid-size chunks of bytes, which is  
beneficial for payloads that are bigger than just a few bytes.


Some weeks ago I started working on an improved design which I  
would like to propose now. The new design simplifies the API  
extensions (the Field API remains unchanged) and uses less disk  
space in most use cases. Now there are only two classes that get  
new methods:

- Token.setPayload()
 Use this method to add arbitrary metadata to a Token in the form  
of a byte[] array.

- TermPositions.getPayload()
 Use this method to retrieve the payload of a term occurrence.
The implementation is very flexible: the user does not have to  
enable payloads explicilty for a field and can add payloads to all,  
some or no Tokens. Due to the improved encoding those use cases are  
handled efficiently in terms of disk space.


Another thing I would like to point out is that this feature is  
backwards compatible, meaning that the file format only changes if  
the user explicitly adds payloads to the index. If no payloads are  
used, all data structures remain unchanged.


I'm going to open a new JIRA issue soon containing the patch and  
details about implementation and file format changes.


One more comment: It is a rather big patch and this is the initial  
version, so I'm sure there will be a lot of discussions. I would  
like to encourage people who consider this feature as useful to try  
it out and give me some feedback about possible improvements.


Best regards,
- Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2006-12-20 Thread Yonik Seeley (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-753?page=comments#action_12459967 ] 

Yonik Seeley commented on LUCENE-753:
-

Thanks for the pointer Bogdan, it's interesting you use transferTo instead of 
read... is there any advantage to this?  You still need to create a new object 
every read(), but at least it looks like a smaller object.

It's also been pointed out to me that 
http://issues.apache.org/jira/browse/LUCENE-414 has some more NIO code.

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: http://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads

2006-12-20 Thread Nicolas Lalevée
Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
> Hi Michael,
>
> Have a look at https://issues.apache.org/jira/browse/LUCENE-662
>
> I am planning on starting on this soon (I know, I have been saying
> that for a while, but I really am.)  At any rate, another set of eyes
> would be good and I would be interested in hearing how your version
> compares/works with this patch from Nicolas.

In fact the work I have done is more about the storing part of Lucene than the 
indexing part. But I think that the mechanism of defining in Java 
an "IndexFormat" I have introduced in my patch will be usefull in defining 
how the payload should be read and wrote.

About my patch, it needs to be synchronized with the current trunk. I will 
update it soon. It just need some clean up.

Nicolas

>
> -Grant
>
> On Dec 20, 2006, at 9:19 AM, Michael Busch wrote:
> > Hi all,
> >
> > currently it is not possible to add generic payloads to a posting
> > list. However, this feature would be useful for various use cases.
> > Some examples:
> > - XML search
> >  to index XML documents and allow structured search (e.g. XPath) it
> > is neccessary to store the depths of the terms
> > - part-of-speech
> >  payloads can be used to store the part of speech of a term occurrence
> > - term boost
> >  for terms that occur e.g. in bold font a payload containing a
> > boost value can be stored
> > - ...
> >
> > The feature payloads has been requested and discussed a couple of
> > times, e. g. in
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/29465
> > - http://www.gossamer-threads.com/lists/lucene/java-dev/37409
> >
> > In the latter thread I proposed a design a couple of months ago
> > that adds the possibility to Lucene to store variable-length
> > payloads inline in the posting list of a term. However, this design
> > had some drawbacks: the already complex field API was extended and
> > the payloads encoding was not optimal in terms of disk space.
> > Furthermore, the overall Lucene runtime performance suffered due to
> > the growth of the .prx file. In the meantime the patch LUCENE-687
> > (Lazy skipping on proximity file) was committed, which reduces the
> > number of reads and seeks on the .prx file. This minimizes the
> > performance degradation of a bigger .prx file. Also, LUCENE-695
> > (Improve BufferedIndexInput.readBytes() performance) was committed,
> > that speeds up reading mid-size chunks of bytes, which is
> > beneficial for payloads that are bigger than just a few bytes.
> >
> > Some weeks ago I started working on an improved design which I
> > would like to propose now. The new design simplifies the API
> > extensions (the Field API remains unchanged) and uses less disk
> > space in most use cases. Now there are only two classes that get
> > new methods:
> > - Token.setPayload()
> >  Use this method to add arbitrary metadata to a Token in the form
> > of a byte[] array.
> > - TermPositions.getPayload()
> >  Use this method to retrieve the payload of a term occurrence.
> > The implementation is very flexible: the user does not have to
> > enable payloads explicilty for a field and can add payloads to all,
> > some or no Tokens. Due to the improved encoding those use cases are
> > handled efficiently in terms of disk space.
> >
> > Another thing I would like to point out is that this feature is
> > backwards compatible, meaning that the file format only changes if
> > the user explicitly adds payloads to the index. If no payloads are
> > used, all data structures remain unchanged.
> >
> > I'm going to open a new JIRA issue soon containing the patch and
> > details about implementation and file format changes.
> >
> > One more comment: It is a rather big patch and this is the initial
> > version, so I'm sure there will be a lot of discussions. I would
> > like to encourage people who consider this feature as useful to try
> > it out and give me some feedback about possible improvements.
> >
> > Best regards,
> > - Michael
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> --
> Grant Ingersoll
> http://www.grantingersoll.com/
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-- 
Nicolas LALEVÉE
Solutions & Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2006-12-20 Thread Bogdan Ghidireac (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-753?page=comments#action_12459971 ] 

Bogdan Ghidireac commented on LUCENE-753:
-

The Javadoc says that transferTo can be more efficient because the OS can 
transfer bytes directly from the filesystem cache to the target channel without 
actually copying them. 

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: http://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-754) FieldCache keeps hard references to readers, doesn't prevent multiple threads from creating same instance

2006-12-20 Thread Otis Gospodnetic (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-754?page=comments#action_12459987 ] 

Otis Gospodnetic commented on LUCENE-754:
-

Since I was the one who first whined about this leak, I'm just following up to 
report that this change indeed eliminated the leak.

> FieldCache keeps hard references to readers, doesn't prevent multiple threads 
> from creating same instance
> -
>
> Key: LUCENE-754
> URL: http://issues.apache.org/jira/browse/LUCENE-754
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Yonik Seeley
> Assigned To: Yonik Seeley
> Attachments: FieldCache.patch
>
>


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Continuous Integration for Lucene

2006-12-20 Thread Doug Cutting

Chris Hostetter wrote:

One thing you might want to watch out for, is that your system doesn't
seem to run the unit tests, which is an important part of verifying that a
build was "successful" (i notice this only because the trunk was acctually
broken recently, yet the logs available on your system indicate that the
build succeeded on that day)


It would also be nice if the downloadable result was the full tar.gz 
file, not simply the jar file.  The 'nightly' build target runs both 
unit tests and builds the distribution.  Ideally one would invoke it 
with a descriptive version name, e.g.:


ant -Dversion=trunk-${revision} nightly

Where $revision is the subversion revision number.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads

2006-12-20 Thread Doug Cutting

Michael Busch wrote:
> Some weeks ago I started working on an improved design which I would
> like to propose now. The new design simplifies the API extensions (the
> Field API remains unchanged) and uses less disk space in most use cases.
> Now there are only two classes that get new methods:
> - Token.setPayload()
>  Use this method to add arbitrary metadata to a Token in the form of a
> byte[] array.
>
> - TermPositions.getPayload()
>  Use this method to retrieve the payload of a term occurrence.

Michael,

This sounds like very good work.  The back-compatibility of this 
approach is great.  But we should also consider this in the broader 
context of index-format flexibility.


Three general approaches have been proposed.  They are not exclusive.

1. Make the index format extensible by adding user-implementable reader 
and writer interfaces for postings.


2. Add a richer set of standard index formats, including things like 
compressed fields, no-positions, per-position weights, etc.


3. Provide hooks for including arbitrary binary data.

Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of type 
(2) are most friendly to non-Java implementations, since the semantics 
of each variation are well-defined.


I don't see a reason not to pursue all three, but in a coordinated 
manner.  In particular, we don't want to add a feature of type (3) that 
would make it harder to add type (1) APIs.  It would thus be best if we 
had a rough specification of type (1) and type (2).  A proposal of type 
(2) is at:


http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

But I'm not sure that we yet have any proposed designs for an extensible 
posting API.  (Is anyone aware of one?)  This payload proposal can 
probably be easily incorporated into such a design, but I would have 
more confidence if we had one.  I guess I should attempt one!


Here's a very rough, sketchy, first draft of a type (1) proposal.

IndexWriter#setPostingFormat(PostingFormat)
IndexWriter#setDictionaryFormat(DictionaryFormat)

interface PostingFormat {
  PostingInverter getInverter(FieldInfo, Segment, Directory);
  PostingReader getReader(FieldInfo, Segment, Directory);
  PostingWriter getWriter(FieldInfo, Segment, Directory);
}

interface PostingPointer {} ???

interface DictionaryFormat {
  DictionaryWriter getWriter(FieldInfo, Segment, Directory);
  DictionaryWriter getReader(FieldInfo, Segment, Directory);
}

IndexWriter#addDocument(Document doc)
  loop over doc.fields
call PostingFormat#getPostingInverter(FieldInfo, Segment, Directory)
  to create a PostingInverter
if field is analyzed
  call Analyzer#tokenStream() to get TokenStream
  loop over tokens
PostingInverter#collectToken(Token, Field);
else
  PostingInverter#collectToken(Field);

  call DictionaryFormat#getWriter(FieldInfo, Segment, Directory)
to create a DictionaryWriter
  Iterator terms = PostingInverter#getTerms();
  loop over terms
PostingPointer p = PostingInverter#getPointer();
PostingInverter#write(term);
DictionaryWriter#addTerm(term, p);

IndexMerger#mergePostings()
  call DictionaryFormat#getReader(FieldInfo, Segment, Directory)
to create a DictionaryReader
  loop over fields
call PostingFormat#getWriter(FieldInfo, Segment, Directory)
  to create a PostingWriter
loop over segments
  call PostingFormat#getReader(FieldInfo, Segment, Directory)
to create a PostingReader
  loop over dictionary.terms
PostingPointer p = PostingWriter#getPointer();
DictionaryWriter#addTerm(Term, p);
loop over docs
  int doc = PostingReader#readPostings();
  PostingWriter#writePostings(doc);

So the question is, does something like this conflict with your 
proposal?  Should Term and/or Token be extensible?  If so, what should 
their interfaces look like?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-755) Payloads

2006-12-20 Thread Michael Busch (JIRA)
Payloads


 Key: LUCENE-755
 URL: http://issues.apache.org/jira/browse/LUCENE-755
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael Busch
 Assigned To: Michael Busch


This patch adds the possibility to store arbitrary metadata (payloads) together 
with each position of a term in its posting lists. A while ago this was 
discussed on the dev mailing list, where I proposed an initial design. This 
patch has a much improved design with modifications, that make this new feature 
easier to use and more efficient.

A payload is an array of bytes that can be stored inline in the ProxFile 
(.prx). Therefore this patch provides low-level APIs to simply store and 
retrieve byte arrays in the posting lists in an efficient way. 

API and Usage
--   
The new class index.Payload is basically just a wrapper around a byte[] array 
together with int variables for offset and length. So a user does not have to 
create a byte array for every payload, but can rather allocate one array for 
all payloads of a document and provide offset and length information. This 
reduces object allocations on the application side.

In order to store payloads in the posting lists one has to provide a 
TokenStream or TokenFilter that produces Tokens with payloads. I added the 
following two methods to the Token class:
  /** Sets this Token's payload. */
  public void setPayload(Payload payload);
  
  /** Returns this Token's payload. */
  public Payload getPayload();

In order to retrieve the data from the index the interface TermPositions now 
offers two new methods:
  /** Returns the payload length of the current term position.
   *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   *  the first time.
   * 
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();
  
  /** Returns the payload data of the current term position.
   * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   * 
   * @param data the array into which the data of this payload is to be
   * stored, if it is big enough; otherwise, a new byte[] array
   * is allocated for this purpose. 
   * @param offset the offset in the array into which the data of this payload
   *   is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] getPayload(byte[] data, int offset) throws IOException;

Furthermore, this patch indroduces the new method IndexOutput.writeBytes(byte[] 
b, int offset, int length). So far there was only a writeBytes()-method without 
an offset argument. 

Implementation details
--
- One field bit in FieldInfos is used to indicate if payloads are enabled for a 
field. The user does not have to enable payloads for a field, this is done 
automatically:
   * The DocumentWriter enables payloads for a field, if one ore more Tokens 
carry payloads.
   * The SegmentMerger enables payloads for a field during a merge, if payloads 
are enabled for that field in one or more segments.
- Backwards compatible: If payloads are not used, then the formats of the 
ProxFile and FreqFile don't change
- Payloads are stored inline in the posting list of a term in the ProxFile. A 
payload of a term occurrence is stored right after its PositionDelta.
- Same-length compression: If payloads are enabled for a field, then the 
PositionDelta is shifted one bit. The lowest bit is used to indicate whether 
the length of the following payload is stored explicitly. If not, i. e. the bit 
is false, then the payload has the same length as the payload of the previous 
term occurrence.
- In order to support skipping on the ProxFile the length of the payload at 
every skip point has to be known. Therefore the payload length is also stored 
in the skip list located in the FreqFile. Here the same-length compression is 
also used: The lowest bit of DocSkip is used to indicate if the payload length 
is stored for a SkipDatum or if the length is the same as in the last SkipDatum.
- Payloads are loaded lazily. When a user calls TermPositions.nextPosition() 
then only the position and the payload length is loaded from the ProxFile. If 
the user calls getPayload() then the payload is actually loaded. If 
getPayload() is not called before nextPosition() is called again, then the 
payload data is just skipped.
  
Changes of file formats
--
- FieldInfos (.fnm)
The format of the .fnm file does not change. The only change is the use o

[jira] Updated: (LUCENE-755) Payloads

2006-12-20 Thread Michael Busch (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-755?page=all ]

Michael Busch updated LUCENE-755:
-

Attachment: payloads.patch

> Payloads
> 
>
> Key: LUCENE-755
> URL: http://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the 
> PositionDelta is shifted one bit. The lowest bit is used to indicate whether 
> the length of the following payload is stored explicitly. If not, i. e. the 
> bit is false, then the payload has the same length as the payload of the 
> previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at 
> every skip point has to be known. Therefore the payload length is also stored 
> in the skip list located in the FreqFile. Here the same-length compression is 
> also used: The lowest bit of DocSkip is used to indicate if the payload 
> length is stored for a SkipDatum or if the length is the same as in the last 
> SkipDatum.
> - Payloads are loaded lazily. When a user calls TermPositions.nextPosition() 
> then only the position and th

Re: Payloads

2006-12-20 Thread Michael Busch

Nicolas Lalevée wrote:

Le Mercredi 20 Décembre 2006 15:31, Grant Ingersoll a écrit :
  

Hi Michael,

Have a look at https://issues.apache.org/jira/browse/LUCENE-662

I am planning on starting on this soon (I know, I have been saying
that for a while, but I really am.)  At any rate, another set of eyes
would be good and I would be interested in hearing how your version
compares/works with this patch from Nicolas.



In fact the work I have done is more about the storing part of Lucene than the 
indexing part. But I think that the mechanism of defining in Java 
an "IndexFormat" I have introduced in my patch will be usefull in defining 
how the payload should be read and wrote.


About my patch, it needs to be synchronized with the current trunk. I will 
update it soon. It just need some clean up.


Nicolas

  


That's right, Nicolas' patch makes the Lucene *store* more flexible, 
whereas my payloads patch extends the *index* data structures.


Nicolas, I'm aware of your patch but haven't looked completely at it 
yet. I think it would be a great thing if our patches would work 
together. And with Dougs suggestions (see his response) we would be on 
the right track to the flexible indexing format! I would love to work 
together with you to achieve this goal. I will look at your patch more 
closely in the next days.


- Michael



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Continuous Integration for Lucene

2006-12-20 Thread Alex Pimenov
Doug,

- Original Message - 
From: "Doug Cutting" <[EMAIL PROTECTED]>
To: 
Sent: Wednesday, December 20, 2006 9:14 AM
Subject: Re: Continuous Integration for Lucene
> Chris Hostetter wrote:
> > One thing you might want to watch out for, is that your system doesn't
> > seem to run the unit tests, which is an important part of verifying that a
> > build was "successful" (i notice this only because the trunk was acctually
> > broken recently, yet the logs available on your system indicate that the
> > build succeeded on that day)

Yes, Parabuild runs tests now.

> It would also be nice if the downloadable result was the full tar.gz 
> file, not simply the jar file.  The 'nightly' build target runs both 
> unit tests and builds the distribution.  Ideally one would invoke it 
> with a descriptive version name, e.g.:
> 
> ant -Dversion=trunk-${revision} nightly
> 
> Where $revision is the subversion revision number.

Yes, this has been done as well. An integration build now produces 
the core build resuts and the tar.gz package after every check in. 

We have also added a daily build that is scheduled to run at 1:00PM 
PST:

http://parabuild.viewtier.com:8080/parabuild/index.htm?displaygroupid=5

Alex Pimenov

> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re LUCENE-754

2006-12-20 Thread Otis Gospodnetic
Hi,
Am I reading this right?  It sounds like you are saying LUCENE-651 did *not* 
fix the original problem it was supposed to fix, and in addition it introduced 
a bug that LUCENE-754 fixed.

28. LUCENE-754: Fix a problem introduced by LUCENE-651, causing
IndexReaders to hang around forever, in addition to not
fixing the original FieldCache performance problem.
(Chris Hostetter, Yonik Seeley)

Otis




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads

2006-12-20 Thread Michael Busch

Doug Cutting wrote:

Michael,

This sounds like very good work.  The back-compatibility of this 
approach is great.  But we should also consider this in the broader 
context of index-format flexibility.


Three general approaches have been proposed.  They are not exclusive.

1. Make the index format extensible by adding user-implementable 
reader and writer interfaces for postings.


2. Add a richer set of standard index formats, including things like 
compressed fields, no-positions, per-position weights, etc.


3. Provide hooks for including arbitrary binary data.

Your proposal is of type (3).  LUCENE-662 is a (1).  Approaches of 
type (2) are most friendly to non-Java implementations, since the 
semantics of each variation are well-defined.


I don't see a reason not to pursue all three, but in a coordinated 
manner.  In particular, we don't want to add a feature of type (3) 
that would make it harder to add type (1) APIs.  It would thus be best 
if we had a rough specification of type (1) and type (2).  A proposal 
of type (2) is at:


http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

But I'm not sure that we yet have any proposed designs for an 
extensible posting API.  (Is anyone aware of one?)  This payload 
proposal can probably be easily incorporated into such a design, but I 
would have more confidence if we had one.  I guess I should attempt one!




Doug,

thanks for your detailed response. I'm aware that the long-term goal is 
the flexible index format and I see the payloads patch only as a part of 
it. The patch focuses on extending the index data structures and about a 
possible payload encoding. It doesn't focus yet on a flexible API, it 
only offers the two mentioned low-level methods to add and retrieve byte 
arrays.


I would love to work with you guys on the flexible index format and to 
combine my patch with your suggestions and the patch from Nicolas! I 
will look at your proposal and Nicolas' patch tomorrow (have to go now). 
I just attached my patch (LUCENE-755), so if you get a chance you could 
take a look at it.


Maybe it would make sense now to follow your suggestion you made earlier 
this year and start a new package to work on the new index format? On 
the other hand, if people would like to use the payloads soon I guess 
due to the backwards compatibility it would be low risk to add it to the 
current index format to provide this feature until we can finish the 
flexible format?


- Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Payloads

2006-12-20 Thread Doug Cutting

Michael Busch wrote:
the other hand, if people would like to use the payloads soon I guess 
due to the backwards compatibility it would be low risk to add it to the 
current index format to provide this feature until we can finish the 
flexible format?


A reason not to commit something like this now would be if it 
complicates the effort to make the format extensible.  Each index 
feature we add now will require back-compatibility in the future, and we 
should be hesitant to add features that might be difficult to support in 
the future.


For example, this modifies the Token API.  If, long-term, we think that 
Token should be extensible, then perhaps we should make it extensible 
now, and add this through a subclass of Token (perhaps a mixin interface 
that Tokens can implement).


I like the Payload feature, and think it should probably be added.  I 
just want to make sure that we've first thought a bit about its 
future-compatibility.


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-436) [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception

2006-12-20 Thread Otis Gospodnetic (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-436?page=all ]

Otis Gospodnetic resolved LUCENE-436.
-

Resolution: Fixed

Applied and committed the LUCENE-436.patch (is JIRA smart enough not to 
hyperlink this?) - all unit tests still pass.

> [PATCH] TermInfosReader, SegmentTermEnum Out Of Memory Exception
> 
>
> Key: LUCENE-436
> URL: http://issues.apache.org/jira/browse/LUCENE-436
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.4
> Environment: Solaris JVM 1.4.1
> Linux JVM 1.4.2/1.5.0
> Windows not tested
>Reporter: kieran
> Assigned To: Otis Gospodnetic
> Attachments: FixedThreadLocal.java, lucene-1.9.1.patch, 
> Lucene-436-TestCase.tar.gz, LUCENE-436.patch, TermInfosReader.java, 
> ThreadLocalTest.java
>
>
> We've been experiencing terrible memory problems on our production search 
> server, running lucene (1.4.3).
> Our live app regularly opens new indexes and, in doing so, releases old 
> IndexReaders for garbage collection.
> But...there appears to be a memory leak in 
> org.apache.lucene.index.TermInfosReader.java.
> Under certain conditions (possibly related to JVM version, although I've 
> personally observed it under both linux JVM 1.4.2_06, and 1.5.0_03, and SUNOS 
> JVM 1.4.1) the ThreadLocal member variable, "enumerators" doesn't get 
> garbage-collected when the TermInfosReader object is gc-ed.
> Looking at the code in TermInfosReader.java, there's no reason why it 
> _shouldn't_ be gc-ed, so I can only presume (and I've seen this suggested 
> elsewhere) that there could be a bug in the garbage collector of some JVMs.
> I've seen this problem briefly discussed; in particular at the following URL:
>   http://java2.5341.com/msg/85821.html
> The patch that Doug recommended, which is included in lucene-1.4.3 doesn't 
> work in our particular circumstances. Doug's patch only clears the 
> ThreadLocal variable for the thread running the finalizer (my knowledge of 
> java breaks down here - I'm not sure which thread actually runs the 
> finalizer). In our situation, the TermInfosReader is (potentially) used by 
> more than one thread, meaning that Doug's patch _doesn't_ allow the affected 
> JVMs to correctly collect garbage.
> So...I've devised a simple patch which, from my observations on linux JVMs 
> 1.4.2_06, and 1.5.0_03, fixes this problem.
> Kieran
> PS Thanks to daniel naber for pointing me to jira/lucene
> @@ -19,6 +19,7 @@
>  import java.io.IOException;
>  import org.apache.lucene.store.Directory;
> +import java.util.Hashtable;
>  /** This stores a monotonically increasing set of  pairs in a
>   * Directory.  Pairs are accessed either by Term or by ordinal position the
> @@ -29,7 +30,7 @@
>private String segment;
>private FieldInfos fieldInfos;
> -  private ThreadLocal enumerators = new ThreadLocal();
> +  private final Hashtable enumeratorsByThread = new Hashtable();
>private SegmentTermEnum origEnum;
>private long size;
> @@ -60,10 +61,10 @@
>}
>private SegmentTermEnum getEnum() {
> -SegmentTermEnum termEnum = (SegmentTermEnum)enumerators.get();
> +SegmentTermEnum termEnum = 
> (SegmentTermEnum)enumeratorsByThread.get(Thread.currentThread());
>  if (termEnum == null) {
>termEnum = terms();
> -  enumerators.set(termEnum);
> +  enumeratorsByThread.put(Thread.currentThread(), termEnum);
>  }
>  return termEnum;
>}
> @@ -195,5 +196,15 @@
>public SegmentTermEnum terms(Term term) throws IOException {
>  get(term);
>  return (SegmentTermEnum)getEnum().clone();
> +  }
> +
> +  /* some jvms might have trouble gc-ing enumeratorsByThread */
> +  protected void finalize() throws Throwable {
> +try {
> +// make sure gc can clear up.
> +enumeratorsByThread.clear();
> +} finally {
> +super.finalize();
> +}
>}
>  }
> TermInfosReader.java, full source:
> ==
> package org.apache.lucene.index;
> /**
>  * Copyright 2004 The Apache Software Foundation
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  * http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> import java.io.IOException;
> import org.apache.lucene.store.Directory;
> import java.util.Hashtable;
> /** This stores a monotonically increasing set of  pairs in a
>  * Directo

[jira] Updated: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

2006-12-20 Thread Marcelo F. Ochoa (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-724?page=all ]

Marcelo F. Ochoa updated LUCENE-724:


Attachment: ojvm-12-20-06.tar.gz

This new release of the OJVMDirectory Lucene Store includes a fully functional 
Oracle Domain Index with a queue for update/insert massive operations and a lot 
of performance improvement.
See the db/readmeOJVM.html file for more detail.


> Oracle JVM implementation for Lucene DataStore also a preliminary 
> implementation for an Oracle Domain index using Lucene
> 
>
> Key: LUCENE-724
> URL: http://issues.apache.org/jira/browse/LUCENE-724
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.0.0
> Environment: Oracle 10g R2 with latest patchset, there is a txt file 
> into the lib directory with the required libraries to compile this extension, 
> which for legal issues I can't redistribute. All these libraries are include 
> into the Oracle home directory,
>Reporter: Marcelo F. Ochoa
>Priority: Minor
> Attachments: ojvm-11-28-06.tar.gz, ojvm-12-20-06.tar.gz, ojvm.tar.gz
>
>
> Here a preliminary implementation of the Oracle JVM Directory data store 
> which replace a file system by BLOB data storage.
> The reason to do this is:
>   - Using traditional File System for storing the inverted index is not a 
> good option for some users.
>   - Using BLOB for storing the inverted index running Lucene outside the 
> Oracle database has a bad performance because there are a lot of network 
> round trips and data marshalling.
>   - Indexing relational data stores such as tables with VARCHAR2, CLOB or 
> XMLType with Lucene running outside the database has the same problem as the 
> previous point.
>   - The JVM included inside the Oracle database can scale up to 10.000+ 
> concurrent threads without memory leaks or deadlock and all the operation on 
> tables are in the same memory space!!
>   With these points in mind, I uploaded the complete Lucene framework inside 
> the Oracle JVM and I runned the complete JUnit test case successful, except 
> for some test such as the RMI test which requires special grants to open 
> ports inside the database.
>   The Lucene's test cases run faster inside the Oracle database (11g) than 
> the Sun JDK 1.5, because the classes are automatically JITed after some 
> executions.
>   I had implemented and OJVMDirectory Lucene Store which replaces the file 
> system storage with a BLOB based storage, compared with a RAMDirectory 
> implementation is a bit slower but we gets all the benefits of the BLOB 
> storage (backup, concurrence control, and so on).
>  The OJVMDirectory is cloned from the source at
> http://issues.apache.org/jira/browse/LUCENE-150 (DBDirectory) but with some 
> changes to run faster inside the Oracle JVM.
>  At this moment, I am working in a full integration with the SQL Engine using 
> the Data Cartridge API, it means using Lucene as a new Oracle Domain Index.
>  With this extension we can create a Lucene Inverted index in a table using:
> create index it1 on t1(f2) indextype is LuceneIndex parameters('test');
>  assuming that the table t1 has a column f2 of type VARCHAR2, CLOB or 
> XMLType, after this, the query against the Lucene inverted index can be made 
> using a new Oracle operator:
> select * from t1 where contains(f2, 'Marcelo') = 1;
>  the important point here is that this query is integrated with the execution 
> plan of the Oracle database, so in this simple example the Oracle optimizer 
> see that the column "f2" is indexed with the Lucene Domain index, then using 
> the Data Cartridge API a Java code running inside the Oracle JVM is executed 
> to open the search, a fetch all the ROWID that match with "Marcelo" and get 
> the rows using the pointer,
> here the output:
> SELECT STATEMENT  ALL_ROWS  3   1 
>   115
>TABLE ACCESS(BY INDEX ROWID) LUCENE.T1  3   1   115
> DOMAIN INDEX LUCENE.IT1
>  Another benefits of using the Data Cartridge API is that if the table T1 has 
> insert, update or delete rows operations a corresponding Java method will be 
> called to automatically update the Lucene Index.
>   There is a simple HTML file with some explanation of the code.
>The install.sql script is not fully tested and must be lunched into the 
> Oracle database, not remotely.
>   Best regards, Marcelo.
> - For Oracle users the big question is, Why do I use Lucene instead of Oracle 
> Text which is implemented in C?
>   I think that the answer is too simple, Lucene is open source and anybody 
> can extend it and add the functionality needed
> - For Lucene users which try to use Lucene as enterprise search engine, the 
> Oracle JVM pr

[jira] Resolved: (LUCENE-741) Field norm modifier (CLI tool)

2006-12-20 Thread Otis Gospodnetic (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-741?page=all ]

Otis Gospodnetic resolved LUCENE-741.
-

Resolution: Fixed

Committed.  I'll also remove the old version of this code (+ its unit test), 
the one that still lives in 
contrib/miscellaneous/src/java/org/apache/lucene/misc/ .


> Field norm modifier (CLI tool)
> --
>
> Key: LUCENE-741
> URL: http://issues.apache.org/jira/browse/LUCENE-741
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Otis Gospodnetic
> Assigned To: Otis Gospodnetic
>Priority: Minor
> Attachments: LUCENE-741.patch
>
>
> I took Chris' LengthNormModifier (contrib/misc) and modified it slightly, to 
> allow us to set fake norms on an existing fields, effectively making it 
> equivalent to Field.Index.NO_NORMS.
> This is related to LUCENE-448 (NO_NORMS patch) and LUCENE-496 
> (LengthNormModifier contrib from Chris).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene nightly build failure

2006-12-20 Thread java-dev

javacc-uptodate-check:

javacc-notice:
 [echo] 
 [echo]   One or more of the JavaCC .jj files is newer than its 
corresponding
 [echo]   .java file.  Run the "javacc" target to regenerate the 
artifacts.
 [echo] 

init:

clover.setup:

clover.info:
 [echo] 
 [echo]   Clover not found. Code coverage reports disabled.
 [echo] 

clover:

common.compile-core:
[mkdir] Created dir: /tmp/lucene-nightly/build/classes/java
[javac] Compiling 204 source files to /tmp/lucene-nightly/build/classes/java
[javac] 
/tmp/lucene-nightly/src/java/org/apache/lucene/index/TermInfosReader.java:67: 
cannot resolve symbol
[javac] symbol  : method remove ()
[javac] location: class java.lang.ThreadLocal
[javac] enumerators.remove();
[javac]^
[javac] Note: 
/tmp/lucene-nightly/src/java/org/apache/lucene/queryParser/QueryParser.java 
uses or overrides a deprecated API.
[javac] Note: Recompile with -deprecation for details.
[javac] 1 error

BUILD FAILED
/tmp/lucene-nightly/common-build.xml:135: The following error occurred while 
executing this line:
/tmp/lucene-nightly/common-build.xml:291: Compile failed; see the compiler 
error output for details.

Total time: 10 seconds



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene nightly build failure

2006-12-20 Thread java-dev

javacc-uptodate-check:

javacc-notice:
 [echo] 
 [echo]   One or more of the JavaCC .jj files is newer than its 
corresponding
 [echo]   .java file.  Run the "javacc" target to regenerate the 
artifacts.
 [echo] 

init:

clover.setup:
[mkdir] Created dir: /tmp/lucene-nightly-gsi/build/test/clover/db
[clover-setup] Clover Version 1.3.2, built on November 01 2004
[clover-setup] loaded from: 
/export/home/gsingers/bin/apache-ant-1.6.5/lib/clover.jar
[clover-setup] Open Source License registered to the Apache Software 
Foundation. This license of Clover is provided to support projects of the 
Apache Software Foundation only.

[clover-setup] Clover is enabled with initstring 
'/tmp/lucene-nightly-gsi/build/test/clover/db/lucene_coverage.db'

clover.info:

clover:

common.compile-core:
[mkdir] Created dir: /tmp/lucene-nightly-gsi/build/classes/java
[javac] Compiling 204 source files to 
/tmp/lucene-nightly-gsi/build/classes/java
   [clover] Clover Version 1.3.2, built on November 01 2004
   [clover] loaded from: 
/export/home/gsingers/bin/apache-ant-1.6.5/lib/clover.jar
   [clover] Open Source License registered to the Apache Software Foundation. 
This license of Clover is provided to support projects of the Apache Software 
Foundation only.

   [clover] No coverage database 
'/tmp/lucene-nightly-gsi/build/test/clover/db/lucene_coverage.db' found. 
Creating a fresh one.
   [clover] Clover all over. Instrumented 204 files.
[javac] 
/var/tmp/clover6527.tmp/src6528.tmp/org/apache/lucene/index/TermInfosReader.java:67:
 cannot resolve symbol
[javac] symbol  : method remove ()
[javac] location: class java.lang.ThreadLocal
[javac] }__CLOVER_80_0.cloverRec.S[5312]++;enumerators.remove();
[javac]   ^
[javac] Note: 
/var/tmp/clover6527.tmp/src6528.tmp/org/apache/lucene/queryParser/QueryParser.java
 uses or overrides a deprecated API.
[javac] Note: Recompile with -deprecation for details.
[javac] 1 error

BUILD FAILED
/tmp/lucene-nightly-gsi/common-build.xml:135: The following error occurred 
while executing this line:
/tmp/lucene-nightly-gsi/common-build.xml:291: Compile failed; see the compiler 
error output for details.

Total time: 24 seconds



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-589) Demo HTML parser doesn't work for international documents

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-589?page=all ]

Grant Ingersoll updated LUCENE-589:
---

 Issue Type: Improvement  (was: Bug)
Description: 
Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it 
would read the charset from the HTML markup, but that can by tricky. For now 
assuming unicode would do the trick:

Add the following line marked with a + to HTMLParser.jj:

options {
  STATIC = false;
  OPTIMIZE_TOKEN_MANAGER = true;
  //DEBUG_LOOKAHEAD = true;
  //DEBUG_TOKEN_MANAGER = true;
+  UNICODE_INPUT = true;
}


  was:

Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally it 
would read the charset from the HTML markup, but that can by tricky. For now 
assuming unicode would do the trick:

Add the following line marked with a + to HTMLParser.jj:

options {
  STATIC = false;
  OPTIMIZE_TOKEN_MANAGER = true;
  //DEBUG_LOOKAHEAD = true;
  //DEBUG_TOKEN_MANAGER = true;
+  UNICODE_INPUT = true;
}


   Priority: Minor  (was: Major)

Decrease priority, mark as improvement, since it only affects demo.  Also, I'm 
not sure we need to support other languages as this code should not be used in 
production anyway. 

> Demo HTML parser doesn't work for international documents
> -
>
> Key: LUCENE-589
> URL: http://issues.apache.org/jira/browse/LUCENE-589
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.0.0
>Reporter: Curtis d'Entremont
>Priority: Minor
>
> Javacc assumes ASCII so it won't work with, say, japanese documents. Ideally 
> it would read the charset from the HTML markup, but that can by tricky. For 
> now assuming unicode would do the trick:
> Add the following line marked with a + to HTMLParser.jj:
> options {
>   STATIC = false;
>   OPTIMIZE_TOKEN_MANAGER = true;
>   //DEBUG_LOOKAHEAD = true;
>   //DEBUG_TOKEN_MANAGER = true;
> +  UNICODE_INPUT = true;
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
Maintain norms in a single file .nrm


 Key: LUCENE-756
 URL: http://issues.apache.org/jira/browse/LUCENE-756
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Doron Cohen
Priority: Minor


Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
comparing to compound indexes. But their file descriptors foot print is much 
higher. 

By maintaining all field norms in a single .nrm file, we can bound the number 
of files used by non compound indexes, and possibly allow more applications to 
use this format.

More details on the motivation for this in: 
http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
 (in particular 
http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen reassigned LUCENE-756:
--

Assignee: Doron Cohen

> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.txt

Attached patch - nrm.patch.txt - modifies field norms maintenance to a single 
.nrm file.

Modification is backwards compatible - existing indexes with norms in a file 
per norm are read. - the first merge would create a single .nrm file.

All tests pass.

No performance degtadations were observed as result of this change, but my 
tests so far were not very extensive.


> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
> Attachments: nrm.patch.txt
>
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-493) Nightly build archives do not contain Java source code.

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-493?page=all ]

Grant Ingersoll reassigned LUCENE-493:
--

Assignee: Grant Ingersoll

> Nightly build archives do not contain Java source code.
> ---
>
> Key: LUCENE-493
> URL: http://issues.apache.org/jira/browse/LUCENE-493
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Website
>Reporter: James Pine
> Assigned To: Grant Ingersoll
>
> Under the Lucene News section of the Overview page, this item's link:
> 26 January 2006 - Nightly builds available
> http://cvs.apache.org/dist/lucene/java/nightly/
> goes to a directory with several 1.9M files, none of which have the src/java 
> tree in them.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-757) Source packaging fails if ${dist.dir} does not exist

2006-12-20 Thread Grant Ingersoll (JIRA)
Source packaging fails if ${dist.dir} does not exist


 Key: LUCENE-757
 URL: http://issues.apache.org/jira/browse/LUCENE-757
 Project: Lucene - Java
  Issue Type: Bug
  Components: Other
Reporter: Grant Ingersoll
 Assigned To: Grant Ingersoll
Priority: Minor


package-tgz-src and package-zip-src fail if ${dist.dir} does not exist, since 
these two targets do not call the package target, which is responsible for 
making the dir.

I have a fix and will commit shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-654) GData-Server - Website sandbox part

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-654?page=all ]

Grant Ingersoll reassigned LUCENE-654:
--

Assignee: Grant Ingersoll

> GData-Server - Website sandbox part
> ---
>
> Key: LUCENE-654
> URL: http://issues.apache.org/jira/browse/LUCENE-654
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Website
>Reporter: Simon Willnauer
> Assigned To: Grant Ingersoll
> Attachments: sandbox.diff
>
>
> Added GData-Server to the sandbox part of the website -- xdocs/sandbox/
> Build of website is fine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Lucene Fields: [Patch Available]  (was: [New])

> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
> Attachments: nrm.patch.txt
>
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-493) Nightly build archives do not contain Java source code.

2006-12-20 Thread Grant Ingersoll (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-493?page=comments#action_12460123 ] 

Grant Ingersoll commented on LUCENE-493:


Scratch that comment on tar...

> Nightly build archives do not contain Java source code.
> ---
>
> Key: LUCENE-493
> URL: http://issues.apache.org/jira/browse/LUCENE-493
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Website
>Reporter: James Pine
> Assigned To: Grant Ingersoll
>
> Under the Lucene News section of the Overview page, this item's link:
> 26 January 2006 - Nightly builds available
> http://cvs.apache.org/dist/lucene/java/nightly/
> goes to a directory with several 1.9M files, none of which have the src/java 
> tree in them.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-654) GData-Server - Website sandbox part

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-654?page=all ]

Grant Ingersoll resolved LUCENE-654.


Resolution: Fixed

Committed (with some minor updates to the text).  Should be sync'd on the 
website in 30 mins or so.

> GData-Server - Website sandbox part
> ---
>
> Key: LUCENE-654
> URL: http://issues.apache.org/jira/browse/LUCENE-654
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Website
>Reporter: Simon Willnauer
> Assigned To: Grant Ingersoll
> Attachments: sandbox.diff
>
>
> Added GData-Server to the sandbox part of the website -- xdocs/sandbox/
> Build of website is fine.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Resolved: (LUCENE-654) GData-Server - Website sandbox part

2006-12-20 Thread Simon Willnauer

Thank you Grant. :)

On 12/20/06, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote:

[ http://issues.apache.org/jira/browse/LUCENE-654?page=all ]

Grant Ingersoll resolved LUCENE-654.


   Resolution: Fixed

Committed (with some minor updates to the text).  Should be sync'd on the 
website in 30 mins or so.

> GData-Server - Website sandbox part
> ---
>
> Key: LUCENE-654
> URL: http://issues.apache.org/jira/browse/LUCENE-654
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Website
>Reporter: Simon Willnauer
> Assigned To: Grant Ingersoll
> Attachments: sandbox.diff
>
>
> Added GData-Server to the sandbox part of the website -- xdocs/sandbox/
> Build of website is fine.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-757) Source packaging fails if ${dist.dir} does not exist

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-757?page=all ]

Grant Ingersoll resolved LUCENE-757.


Resolution: Fixed

Added init-dist target and had package and package-*-src call it so that it 
always builds the dist dir.

> Source packaging fails if ${dist.dir} does not exist
> 
>
> Key: LUCENE-757
> URL: http://issues.apache.org/jira/browse/LUCENE-757
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Other
>Reporter: Grant Ingersoll
> Assigned To: Grant Ingersoll
>Priority: Minor
>
> package-tgz-src and package-zip-src fail if ${dist.dir} does not exist, since 
> these two targets do not call the package target, which is responsible for 
> making the dir.
> I have a fix and will commit shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-493) Nightly build archives do not contain Java source code.

2006-12-20 Thread Grant Ingersoll (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-493?page=all ]

Grant Ingersoll updated LUCENE-493:
---

Priority: Minor  (was: Major)

> Nightly build archives do not contain Java source code.
> ---
>
> Key: LUCENE-493
> URL: http://issues.apache.org/jira/browse/LUCENE-493
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Website
>Reporter: James Pine
> Assigned To: Grant Ingersoll
>Priority: Minor
>
> Under the Lucene News section of the Overview page, this item's link:
> 26 January 2006 - Nightly builds available
> http://cvs.apache.org/dist/lucene/java/nightly/
> goes to a directory with several 1.9M files, none of which have the src/java 
> tree in them.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



New Issues

2006-12-20 Thread Grant Ingersoll
+1 for changing the "Create New Issue" screen in JIRA to have a  
default priority of Minor instead of Major.  Me thinks a fair number  
of people don't pay attention to the priority, so a lower default  
would be good for those of

us scanning issue lists trying to prioritize.


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Component/s: Index

> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
> Attachments: nrm.patch.txt
>
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: (was: nrm.patch.txt)

> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-756) Maintain norms in a single file .nrm

2006-12-20 Thread Doron Cohen (JIRA)
 [ http://issues.apache.org/jira/browse/LUCENE-756?page=all ]

Doron Cohen updated LUCENE-756:
---

Attachment: nrm.patch.txt

Replacing the patch file (prev file was garbage - "svn stat" instead of "svn 
diff").

Few words on how this patch works: 
- .nrm file was added.
- addDocument  (DocumentWriter) still writes each norm to a separate file - but 
that's in memory, 
- at merge, all norms are written to a single file.
- CFS now also maintains all norms in a single file.
- IndexWriter merge-decision now considers hasSeparateNorms() not only for CFS 
but also for non compound.
- SegmentReader.openNorms() still creates ready-to-use/load Norm objects (which 
would read the norms only when needed). But the Norm object is now assigned a 
normSeek value, which is nonzero if the norm file is .nrm.
- existing indexes, prior to this change, are managed the same way that 
segments resulted of addDocument are managed.

Tests:
- I verified that also the (contrib) tests for FieldNormModifier and 
LengthNormModofier are working.

Remaining:
- I might add a test.
- more benchmarking?
- update fileFormat document.

> Maintain norms in a single file .nrm
> 
>
> Key: LUCENE-756
> URL: http://issues.apache.org/jira/browse/LUCENE-756
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Doron Cohen
> Assigned To: Doron Cohen
>Priority: Minor
> Attachments: nrm.patch.txt
>
>
> Non-compound indexes are ~10% faster at indexing, and perform 50% IO activity 
> comparing to compound indexes. But their file descriptors foot print is much 
> higher. 
> By maintaining all field norms in a single .nrm file, we can bound the number 
> of files used by non compound indexes, and possibly allow more applications 
> to use this format.
> More details on the motivation for this in: 
> http://www.nabble.com/potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-tf2826909.html
>  (in particular 
> http://www.nabble.com/Re%3A-potential-indexing-perormance-improvement-for-compound-index---cut-IO---have-more-files-though-p7910403.html).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: New Issues

2006-12-20 Thread Chris Hostetter


+1

: +1 for changing the "Create New Issue" screen in JIRA to have a
: default priority of Minor instead of Major.  Me thinks a fair number


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re LUCENE-754

2006-12-20 Thread Yonik Seeley

On 12/20/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Am I reading this right?  It sounds like you are saying LUCENE-651 did *not* 
fix the original problem it was supposed to fix, and in addition it introduced 
a bug that LUCENE-754 fixed.


Correct.  The placeholder was filed under "reader" but looked up under
"key", so multiple threads all asking for the same entry would never
find previous entries and all generate their own.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server



28. LUCENE-754: Fix a problem introduced by LUCENE-651, causing
IndexReaders to hang around forever, in addition to not
fixing the original FieldCache performance problem.
(Chris Hostetter, Yonik Seeley)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: potential indexing perormance improvement for compound index - cut IO - have more files though

2006-12-20 Thread Doron Cohen
Doron Cohen wrote:
> Doug Cutting wrote:
> > > Therefore, a "semi compound" segment file can be defined, that would
be
> > > made of 4 files (instead of 1):
> > > - File 0: .fdx .tis .tvx
> > > - File 1: .fdt .tii .tvd
> > > - File 2: .frq .tvf
> > > - File 3: .fnm .prx .fN
> >
> > I think this is a promising direction.  Perhaps instead of adding a
> > third index format, we can significantly improve the non-compound
format
> > without too much effort.  For example, simply writing all the norms
into
> > a single file could have a large impact on total file handles and would
> > be a rather simple change.  We could start with that, then see if there
> > are further incremental improvements to be had.
>
> We can start with that - at least it would set the number of segment
files
> to a fixed number - 11 - currently it depends on the number of fields
with
> norms.

Okay, started with this step - see issue 756
http://issues.apache.org/jira/browse/LUCENE-756

>
> One advantage of keeping the a plain non-compound format is educational /
> debugging - it is often helpful to actually see the files being created
on
> disk. (Although, just concatenating all norms to a single file is simple
> enough in that regard.)


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]