Re: Flexible indexing
Hi Grant, I certainly agree that it would be great if we could make some progress and commit the payloads patch soon. I think it is quite independent from FI. FI will introduce different posting formats (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be part of some of those formats, but not all (i. e. per-position payloads only make sense if positions are stored). The only concern some people had was about the API the patch introduces. It extends Token and TermPositions. Doug's argument was, that if we introduce new APIs now but want to change them with FI, then it will be hard to support those APIs. I think that is a valid point, but at the same time it slows down progress to have to plan ahead in too many directions. That's why I'd vote for marking the new APIs as experimental so that people can try them out at own risk. If we could agree on that approach then I'd go ahead and submit an updated payloads patch in the next days, that applies cleanly on the current trunk and contains the additional warnings in the javadocs. In regard of FI and 662 however I really believe we should split it up and plan ahead (in a way I mentioned already), so that we have more isolated patches. It is really great that we have 662 already (Nicolas, thank you so much for your hard work, I hope you'll keep working with us on FI!!). We'll probably use some of that code, and it will definitely be helpful. Michael Grant Ingersoll wrote: Hi Michael, This is very good. I know 662 is different, just wasn't sure if Nicolas patch was meant to be applied after 662, b/c I know we had discussed this before. I do agree with you about planning this out, but I also know that patches seem to motivate people the best and provide a certain concreteness to it all. I mostly started asking questions on these two issues b/c I wanted to spur some more discussion and see if we can get people motivated to move on it. I was hoping that I would be able to apply each patch to two different checkouts so I could start seeing where the overlap is and how they could fit together (I also admit I was procrastinating on my ApacheCon talk...). In the new, flexible world, the payloads implementation could be a separate implementation of the indexing or it could be part of the core/existing file format implementation. Sometimes I just need to get my hands on the code to get a real feel for what I feel is the best way to do it. I agree about the XML storage for Index information. We do that in our in-house wrapper around Lucene, storing info about the language, analyzer used, etc. We may also want a binary index-level storage capability. I know most people just create a single document usually to store binary info about the index, but an binary storage might be good too. Part of me says to apply the Payloads patch now, as it provides a lot of bang for the buck and I think the FI is going to take a lot longer to hash out. However, I know that it may pin us in or force us to change things for FI. Ultimately, I would love to see both these features for the next release, but that isn't a requirement. Also, on FI, I would love to see two different implementations of whatever API we choose before releasing it, as I always find two implementations of an Interface really work out the API details. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-755) Payloads
[ https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-755: - Attachment: payloads.patch I'm attaching the new patch with the following changes: - applies cleanly on the current trunk - fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix - added the following warning comments to the new APIs: * Warning: The status of the Payloads feature is experimental. The APIs * introduced here might change in the future and will not be supported anymore * in such a case. If you want to use this feature in a production environment * you should wait for an official release. Another comment about an API change: In BufferedIndexOutput I changed the method protected abstract void flushBuffer(byte[] b, int len) throws IOException; to protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException; which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies. Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work. > Payloads > > > Key: LUCENE-755 > URL: https://issues.apache.org/jira/browse/LUCENE-755 > Project: Lucene - Java > Issue Type: New Feature > Components: Index >Reporter: Michael Busch > Assigned To: Michael Busch > Attachments: payload.patch, payloads.patch, payloads.patch > > > This patch adds the possibility to store arbitrary metadata (payloads) > together with each position of a term in its posting lists. A while ago this > was discussed on the dev mailing list, where I proposed an initial design. > This patch has a much improved design with modifications, that make this new > feature easier to use and more efficient. > A payload is an array of bytes that can be stored inline in the ProxFile > (.prx). Therefore this patch provides low-level APIs to simply store and > retrieve byte arrays in the posting lists in an efficient way. > API and Usage > -- > The new class index.Payload is basically just a wrapper around a byte[] array > together with int variables for offset and length. So a user does not have to > create a byte array for every payload, but can rather allocate one array for > all payloads of a document and provide offset and length information. This > reduces object allocations on the application side. > In order to store payloads in the posting lists one has to provide a > TokenStream or TokenFilter that produces Tokens with payloads. I added the > following two methods to the Token class: > /** Sets this Token's payload. */ > public void setPayload(Payload payload); > > /** Returns this Token's payload. */ > public Payload getPayload(); > In order to retrieve the data from the index the interface TermPositions now > offers two new methods: > /** Returns the payload length of the current term position. >* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for >* the first time. >* >* @return length of the current payload in number of bytes >*/ > int getPayloadLength(); > > /** Returns the payload data of the current term position. >* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for >* the first time. >* This method must not be called more than once after each call >* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded > lazily, >* so if the payload data for the current position is not needed, >* this method may not be called at all for performance reasons. >* >* @param data the array into which the data of this payload is to be >* stored, if it is big enough; otherwise, a new byte[] array >* is allocated for this purpose. >* @param offset the offset in the array into which the data of this payload >* is to be stored. >* @return a byte[] array containing the data of this payload >* @throws IOException >*/ > byte[] getPayload(byte[] data, int offset) throws IOException; > Furthermore, this patch indroduces the new method > IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was > only a writeBytes()-method without an offset argument. > Implementation details > -- > - One field bit in FieldInfos is used to indicate if payloads are enabled for > a field. The user does not have to enable payloads for a field, this is done > automatically:
Re: Flexible indexing
On Mar 11, 2007, at 5:41 PM, Michael Busch wrote: Hi Grant, I certainly agree that it would be great if we could make some progress and commit the payloads patch soon. I think it is quite independent from FI. FI will introduce different posting formats (see Wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing). Payloads will be part of some of those formats, but not all (i. e. per-position payloads only make sense if positions are stored). Yep, I agree. The only concern some people had was about the API the patch introduces. It extends Token and TermPositions. Doug's argument was, that if we introduce new APIs now but want to change them with FI, then it will be hard to support those APIs. I think that is a valid point, but at the same time it slows down progress to have to plan ahead in too many directions. That's why I'd vote for marking the new APIs as experimental so that people can try them out at own risk. If we could agree on that approach then I'd go ahead and submit an updated payloads patch in the next days, that applies cleanly on the current trunk and contains the additional warnings in the javadocs. +1. In regard of FI and 662 however I really believe we should split it up and plan ahead (in a way I mentioned already), so that we have more isolated patches. It is really great that we have 662 already (Nicolas, thank you so much for your hard work, I hope you'll keep working with us on FI!!). We'll probably use some of that code, and it will definitely be helpful. +1 I think this makes a lot of sense. We have been deliberating these changes for some time, so no reason to hurry. I don't think they are urgent, yet they really will give us more flexibility and more capabilities for more people, so it will be a good thing to have. Michael Grant Ingersoll wrote: Hi Michael, This is very good. I know 662 is different, just wasn't sure if Nicolas patch was meant to be applied after 662, b/c I know we had discussed this before. I do agree with you about planning this out, but I also know that patches seem to motivate people the best and provide a certain concreteness to it all. I mostly started asking questions on these two issues b/c I wanted to spur some more discussion and see if we can get people motivated to move on it. I was hoping that I would be able to apply each patch to two different checkouts so I could start seeing where the overlap is and how they could fit together (I also admit I was procrastinating on my ApacheCon talk...). In the new, flexible world, the payloads implementation could be a separate implementation of the indexing or it could be part of the core/ existing file format implementation. Sometimes I just need to get my hands on the code to get a real feel for what I feel is the best way to do it. I agree about the XML storage for Index information. We do that in our in-house wrapper around Lucene, storing info about the language, analyzer used, etc. We may also want a binary index- level storage capability. I know most people just create a single document usually to store binary info about the index, but an binary storage might be good too. Part of me says to apply the Payloads patch now, as it provides a lot of bang for the buck and I think the FI is going to take a lot longer to hash out. However, I know that it may pin us in or force us to change things for FI. Ultimately, I would love to see both these features for the next release, but that isn't a requirement. Also, on FI, I would love to see two different implementations of whatever API we choose before releasing it, as I always find two implementations of an Interface really work out the API details. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-755) Payloads
Cool. I will try and take a look at it tomorrow. Since we have the lazy SegTermPos thing in now, we should be able to integrate this into scoring via the Similarity and merge TermDocs and TermPositions like you suggested. If I can get the Scoring piece in and people are fine w/ the flushBuffer change then hopefully we can get this in this week. I will try to post a patch that includes your patch and the scoring integration by tomorrow or Tuesday if that is fine with you. -Grant On Mar 11, 2007, at 8:35 PM, Michael Busch (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-755? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch updated LUCENE-755: - Attachment: payloads.patch I'm attaching the new patch with the following changes: - applies cleanly on the current trunk - fixed a bug in FSDirectory which affected payloads with length greater than 1024 bytes and extended testcase TestPayloads to test this fix - added the following warning comments to the new APIs: * Warning: The status of the Payloads feature is experimental. The APIs * introduced here might change in the future and will not be supported anymore * in such a case. If you want to use this feature in a production environment * you should wait for an official release. Another comment about an API change: In BufferedIndexOutput I changed the method protected abstract void flushBuffer(byte[] b, int len) throws IOException; to protected abstract void flushBuffer(byte[] b, int offset, int len) throws IOException; which means that subclasses of BufferedIndexOutput won't compile anymore. I made this change for performance reasons: If a payload is longer than 1024 bytes (standard buffer size of BufferedIndexOutput) then it can be flushed efficiently to disk without having to perform array copies. Is this API change acceptable? Users who have custom subclasses of BufferedIndexOutput would have to change their classes in order to work. Payloads Key: LUCENE-755 URL: https://issues.apache.org/jira/browse/LUCENE-755 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assigned To: Michael Busch Attachments: payload.patch, payloads.patch, payloads.patch This patch adds the possibility to store arbitrary metadata (payloads) together with each position of a term in its posting lists. A while ago this was discussed on the dev mailing list, where I proposed an initial design. This patch has a much improved design with modifications, that make this new feature easier to use and more efficient. A payload is an array of bytes that can be stored inline in the ProxFile (.prx). Therefore this patch provides low-level APIs to simply store and retrieve byte arrays in the posting lists in an efficient way. API and Usage -- The new class index.Payload is basically just a wrapper around a byte[] array together with int variables for offset and length. So a user does not have to create a byte array for every payload, but can rather allocate one array for all payloads of a document and provide offset and length information. This reduces object allocations on the application side. In order to store payloads in the posting lists one has to provide a TokenStream or TokenFilter that produces Tokens with payloads. I added the following two methods to the Token class: /** Sets this Token's payload. */ public void setPayload(Payload payload); /** Returns this Token's payload. */ public Payload getPayload(); In order to retrieve the data from the index the interface TermPositions now offers two new methods: /** Returns the payload length of the current term position. * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for * the first time. * * @return length of the current payload in number of bytes */ int getPayloadLength(); /** Returns the payload data of the current term position. * This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for * the first time. * This method must not be called more than once after each call * of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded lazily, * so if the payload data for the current position is not needed, * this method may not be called at all for performance reasons. * * @param data the array into which the data of this payload is to be * stored, if it is big enough; otherwise, a new byte [] array * is allocated for this purpose. * @param offset the offset in the array into which the data of this payload * is to be stored. * @return a byte[] array containing the data of this payload * @throws IOException */ byte[]
Re: [jira] Updated: (LUCENE-755) Payloads
Grant Ingersoll wrote: Cool. I will try and take a look at it tomorrow. Since we have the lazy SegTermPos thing in now, we should be able to integrate this into scoring via the Similarity and merge TermDocs and TermPositions like you suggested. If I can get the Scoring piece in and people are fine w/ the flushBuffer change then hopefully we can get this in this week. I will try to post a patch that includes your patch and the scoring integration by tomorrow or Tuesday if that is fine with you. I'm not completely sure how you want to integrate this in the Similarity class. Payloads can not only be used for scoring. Consider for example XML search: the payloads can be used here to store in which element a term occurs. During search (e. g. an XPath query) the payloads would be used then to find hits, not for scoring. On the other hand if you want to store e. g. per-postions boosts in the payloads, you could use the norm en/decoding methods that are already in Similarity. You could use the following code in a TokenStream: byte[] payload = new byte[1]; payload[0] = Similari.encodeNorm(boost); token.setPayload(payload); and in a scorer you could get the boost then with: termPositions.getPayload(payloadBuffer); float boost = Similarity.decodeNorm(payloadBuffer[0]); But maybe you have something different in mind? Could you elaborate, please? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Flexible indexing
Grant Ingersoll wrote: In regard of FI and 662 however I really believe we should split it up and plan ahead (in a way I mentioned already), so that we have more isolated patches. It is really great that we have 662 already (Nicolas, thank you so much for your hard work, I hope you'll keep working with us on FI!!). We'll probably use some of that code, and it will definitely be helpful. +1 I think this makes a lot of sense. We have been deliberating these changes for some time, so no reason to hurry. I don't think they are urgent, yet they really will give us more flexibility and more capabilities for more people, so it will be a good thing to have. Right, we don't have to hurry. But still it would be cool to have some of the FI features in the next release and once we start (now!) we should try to keep the momentum going! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]