Re: Flexible indexing

2007-03-11 Thread Michael Busch

Hi Grant,

I certainly agree that it would be great if we could make some progress 
and commit the payloads patch soon. I think it is quite independent from 
FI. FI will introduce different posting formats (see Wiki: 
http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be 
part of some of those formats, but not all (i. e. per-position payloads 
only make sense if positions are stored).


The only concern some people had was about the API the patch introduces. 
It extends Token and TermPositions. Doug's argument was, that if we 
introduce new APIs now but want to change them with FI, then it will be 
hard to support those APIs. I think that is a valid point, but at the 
same time it slows down progress to have to plan ahead in too many 
directions. That's why I'd vote for marking the new APIs as experimental 
so that people can try them out at own risk.
If we could agree on that approach then I'd go ahead and submit an 
updated payloads patch in the next days, that applies cleanly on the 
current trunk and contains the additional warnings in the javadocs.



In regard of FI and 662 however I really believe we should split it up 
and plan ahead (in a way I mentioned already), so that we have more 
isolated patches. It is really great that we have 662 already (Nicolas, 
thank you so much for your hard work, I hope you'll keep working with us 
on FI!!). We'll probably use some of that code, and it will definitely 
be helpful.


Michael

Grant Ingersoll wrote:

Hi Michael,

This is very good.  I know 662 is different, just wasn't sure if 
Nicolas patch was meant to be applied after 662, b/c I know we had 
discussed this before.


I do agree with you about planning this out, but I also know that 
patches seem to motivate people the best and provide a certain 
concreteness to it all.  I mostly started asking questions on these 
two issues b/c I wanted to spur some more discussion and see if we can 
get people motivated to move on it.


I was hoping that I would be able to apply each patch to two different 
checkouts so I could start seeing where the overlap is and how they 
could fit together (I also admit I was procrastinating on my ApacheCon 
talk...).  In the new, flexible world, the payloads implementation 
could be a separate implementation of the indexing or it could be part 
of the core/existing file format implementation.  Sometimes I just 
need to get my hands on the code to get a real feel for what I feel is 
the best way to do it.


I agree about the XML storage for Index information.  We do that in 
our in-house wrapper around Lucene, storing info about the language, 
analyzer used, etc.  We may also want a binary index-level storage 
capability.  I know most people just create a single document usually 
to store binary info about the index, but an binary storage might be 
good too.


Part of me says to apply the Payloads patch now, as it provides a lot 
of bang for the buck and I think the FI is going to take a lot longer 
to hash out.  However, I know that it may pin us in or force us to 
change things for FI.  Ultimately, I would love to see both these 
features for the next release, but that isn't a requirement.  Also, on 
FI, I would love to see two different implementations of whatever API 
we choose before releasing it, as I always find two implementations of 
an Interface really work out the API details.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-755) Payloads

2007-03-11 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-755:
-

Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with length greater than 
1024 bytes and extended testcase TestPayloads to test this fix
- added the following warning comments to the new APIs:

  *  Warning: The status of the Payloads feature is experimental. The APIs
  *  introduced here might change in the future and will not be supported 
anymore
  *  in such a case. If you want to use this feature in a production environment
  *  you should wait for an official release.


Another comment about an API change: In BufferedIndexOutput I changed the 
method 
  protected abstract void flushBuffer(byte[] b, int len) throws IOException;
to
  protected abstract void flushBuffer(byte[] b, int offset, int len) throws 
IOException;

which means that subclasses of BufferedIndexOutput won't compile anymore. I 
made this change for performance reasons: If a payload is longer than 1024 
bytes (standard buffer size of BufferedIndexOutput) then it can be flushed 
efficiently to disk without having to perform array copies. 

Is this API change acceptable? Users who have custom subclasses of 
BufferedIndexOutput would have to change their classes in order to work.

> Payloads
> 
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:

Re: Flexible indexing

2007-03-11 Thread Grant Ingersoll


On Mar 11, 2007, at 5:41 PM, Michael Busch wrote:


Hi Grant,

I certainly agree that it would be great if we could make some  
progress and commit the payloads patch soon. I think it is quite  
independent from FI. FI will introduce different posting formats  
(see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing).  
Payloads will be part of some of those formats, but not all (i. e.  
per-position payloads only make sense if positions are stored).




Yep, I agree.

The only concern some people had was about the API the patch  
introduces. It extends Token and TermPositions. Doug's argument  
was, that if we introduce new APIs now but want to change them with  
FI, then it will be hard to support those APIs. I think that is a  
valid point, but at the same time it slows down progress to have to  
plan ahead in too many directions. That's why I'd vote for marking  
the new APIs as experimental so that people can try them out at own  
risk.
If we could agree on that approach then I'd go ahead and submit an  
updated payloads patch in the next days, that applies cleanly on  
the current trunk and contains the additional warnings in the  
javadocs.




+1.



In regard of FI and 662 however I really believe we should split it  
up and plan ahead (in a way I mentioned already), so that we have  
more isolated patches. It is really great that we have 662 already  
(Nicolas, thank you so much for your hard work, I hope you'll keep  
working with us on FI!!). We'll probably use some of that code, and  
it will definitely be helpful.




+1  I think this makes a lot of sense.  We have been deliberating  
these changes for some time, so no reason to hurry.  I don't think  
they are urgent, yet they really will give us more flexibility and  
more capabilities for more people, so it will be a good thing to have.




Michael

Grant Ingersoll wrote:

Hi Michael,

This is very good.  I know 662 is different, just wasn't sure if  
Nicolas patch was meant to be applied after 662, b/c I know we had  
discussed this before.


I do agree with you about planning this out, but I also know that  
patches seem to motivate people the best and provide a certain  
concreteness to it all.  I mostly started asking questions on  
these two issues b/c I wanted to spur some more discussion and see  
if we can get people motivated to move on it.


I was hoping that I would be able to apply each patch to two  
different checkouts so I could start seeing where the overlap is  
and how they could fit together (I also admit I was  
procrastinating on my ApacheCon talk...).  In the new, flexible  
world, the payloads implementation could be a separate  
implementation of the indexing or it could be part of the core/ 
existing file format implementation.  Sometimes I just need to get  
my hands on the code to get a real feel for what I feel is the  
best way to do it.


I agree about the XML storage for Index information.  We do that  
in our in-house wrapper around Lucene, storing info about the  
language, analyzer used, etc.  We may also want a binary index- 
level storage capability.  I know most people just create a single  
document usually to store binary info about the index, but an  
binary storage might be good too.


Part of me says to apply the Payloads patch now, as it provides a  
lot of bang for the buck and I think the FI is going to take a lot  
longer to hash out.  However, I know that it may pin us in or  
force us to change things for FI.  Ultimately, I would love to see  
both these features for the next release, but that isn't a  
requirement.  Also, on FI, I would love to see two different  
implementations of whatever API we choose before releasing it, as  
I always find two implementations of an Interface really work out  
the API details.


-Grant



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Updated: (LUCENE-755) Payloads

2007-03-11 Thread Grant Ingersoll
Cool.  I will try and take a look at it tomorrow.  Since we have the  
lazy SegTermPos thing in now, we should be able to integrate this  
into scoring via the Similarity and merge TermDocs and TermPositions  
like you suggested.


If I can get the Scoring piece in and people are fine w/ the  
flushBuffer change then hopefully we can get this in this week.  I  
will try to post a patch that includes your patch and the scoring  
integration by tomorrow or Tuesday if that is fine with you.


-Grant

On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote:



 [ https://issues.apache.org/jira/browse/LUCENE-755? 
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]


Michael Busch updated LUCENE-755:
-

Attachment: payloads.patch

I'm attaching the new patch with the following changes:
- applies cleanly on the current trunk
- fixed a bug in FSDirectory which affected payloads with length  
greater than 1024 bytes and extended testcase TestPayloads to test  
this fix

- added the following warning comments to the new APIs:

  *  Warning: The status of the Payloads feature is experimental.  
The APIs
  *  introduced here might change in the future and will not be  
supported anymore
  *  in such a case. If you want to use this feature in a  
production environment

  *  you should wait for an official release.


Another comment about an API change: In BufferedIndexOutput I  
changed the method
  protected abstract void flushBuffer(byte[] b, int len) throws  
IOException;

to
  protected abstract void flushBuffer(byte[] b, int offset, int  
len) throws IOException;


which means that subclasses of BufferedIndexOutput won't compile  
anymore. I made this change for performance reasons: If a payload  
is longer than 1024 bytes (standard buffer size of  
BufferedIndexOutput) then it can be flushed efficiently to disk  
without having to perform array copies.


Is this API change acceptable? Users who have custom subclasses of  
BufferedIndexOutput would have to change their classes in order to  
work.



Payloads


Key: LUCENE-755
URL: https://issues.apache.org/jira/browse/LUCENE-755
Project: Lucene - Java
 Issue Type: New Feature
 Components: Index
   Reporter: Michael Busch
Assigned To: Michael Busch
Attachments: payload.patch, payloads.patch, payloads.patch


This patch adds the possibility to store arbitrary metadata  
(payloads) together with each position of a term in its posting  
lists. A while ago this was discussed on the dev mailing list,  
where I proposed an initial design. This patch has a much improved  
design with modifications, that make this new feature easier to  
use and more efficient.
A payload is an array of bytes that can be stored inline in the  
ProxFile (.prx). Therefore this patch provides low-level APIs to  
simply store and retrieve byte arrays in the posting lists in an  
efficient way.

API and Usage
--
The new class index.Payload is basically just a wrapper around a  
byte[] array together with int variables for offset and length. So  
a user does not have to create a byte array for every payload, but  
can rather allocate one array for all payloads of a document and  
provide offset and length information. This reduces object  
allocations on the application side.
In order to store payloads in the posting lists one has to provide  
a TokenStream or TokenFilter that produces Tokens with payloads. I  
added the following two methods to the Token class:

  /** Sets this Token's payload. */
  public void setPayload(Payload payload);

  /** Returns this Token's payload. */
  public Payload getPayload();
In order to retrieve the data from the index the interface  
TermPositions now offers two new methods:

  /** Returns the payload length of the current term position.
   *  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   *  the first time.
   *
   * @return length of the current payload in number of bytes
   */
  int getPayloadLength();

  /** Returns the payload data of the current term position.
   * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
   * the first time.
   * This method must not be called more than once after each call
   * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded lazily,
   * so if the payload data for the current position is not needed,
   * this method may not be called at all for performance reasons.
   *
   * @param data the array into which the data of this payload is  
to be
   * stored, if it is big enough; otherwise, a new byte 
[] array

   * is allocated for this purpose.
   * @param offset the offset in the array into which the data of  
this payload

   *   is to be stored.
   * @return a byte[] array containing the data of this payload
   * @throws IOException
   */
  byte[] 

Re: [jira] Updated: (LUCENE-755) Payloads

2007-03-11 Thread Michael Busch

Grant Ingersoll wrote:
Cool.  I will try and take a look at it tomorrow.  Since we have the 
lazy SegTermPos thing in now, we should be able to integrate this into 
scoring via the Similarity and merge TermDocs and TermPositions like 
you suggested.


If I can get the Scoring piece in and people are fine w/ the 
flushBuffer change then hopefully we can get this in this week.  I 
will try to post a patch that includes your patch and the scoring 
integration by tomorrow or Tuesday if that is fine with you.


I'm not completely sure how you want to integrate this in the Similarity 
class. Payloads can not only be used for scoring. Consider for example 
XML search: the payloads can be used here to store in which element a 
term occurs. During search (e. g. an XPath query) the payloads would be 
used then to find hits, not for scoring.


On the other hand if you want to store e. g. per-postions boosts in the 
payloads, you could use the norm en/decoding methods that are already in 
Similarity. You could use the following code in a TokenStream:

 byte[] payload = new byte[1];
 payload[0] = Similari.encodeNorm(boost);
 token.setPayload(payload);

and in a scorer you could get the boost then with:
 termPositions.getPayload(payloadBuffer);
 float boost = Similarity.decodeNorm(payloadBuffer[0]);

But maybe you have something different in mind? Could you elaborate, please?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Flexible indexing

2007-03-11 Thread Michael Busch

Grant Ingersoll wrote:


In regard of FI and 662 however I really believe we should split it 
up and plan ahead (in a way I mentioned already), so that we have 
more isolated patches. It is really great that we have 662 already 
(Nicolas, thank you so much for your hard work, I hope you'll keep 
working with us on FI!!). We'll probably use some of that code, and 
it will definitely be helpful.




+1  I think this makes a lot of sense.  We have been deliberating 
these changes for some time, so no reason to hurry.  I don't think 
they are urgent, yet they really will give us more flexibility and 
more capabilities for more people, so it will be a good thing to have.




Right, we don't have to hurry. But still it would be cool to have some 
of the FI features in the next release and once we start (now!) we 
should try to keep the momentum going!


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]