Re: Payloads, Tokenizers, and Filters. Oh My!

Grant Ingersoll Sat, 17 Nov 2007 04:42:19 -0800

Inline below

On Nov 16, 2007, at 6:03 PM, Tricia Williams wrote:

Hi All,
I'll explain what I'm working on, and then I'll ask my twoquestions.
I'm working on the issue https://issues.apache.org/jira/browse/SOLR-380which is a feature request that allows one to index a "StructuredDocument" which is anything that can be represented by XML in orderto provide more context to hits in the result set. This allows usto do things like query the index for "Canada" and be able to notonly say that that query matched a document titled "Some Nonsense"but also that the query term appeared on page 7 of chapter 1. Wecan then take this one step further and markup/highlight the imageof this page based on our OCR and position hit.
For example:
<book title='Some Nonsense'><chapter title='One'><page name='1'>Sometext from page one of a book.</page><page name='7'>Some more textfrom page seven of a book. Oh and I'm from Canada.</page></chapter></book>
I accomplished this by creating a custom Tokenizer which stripsthe xml elements and stores them as a Payload at each of the Tokenscreated from the character data in the input. The payload is thestring that describes the XPath at that location. So for <Canada>the payload is "/book[title='Some Nonsense']/chapter[title='One']/page[name='7']"
The other part of this work is the SolrHighlighter which is lessimportant to this list. I retrieve the TermPositions for theQuery's Terms and use the TermPosition functionality to get back thepayload for the hits and build output which shows hit positionscategorized by the payload they are associated with.
QUESTION 1: Applying TokenFilters to my Tokenizer creates somestrange (in my opinion) behavior. First of all the TermPositionschange and second the Payload is removed. Is this the expectedbehavior, or is this a bug? With the Payload being an "experimentalfeature" I can understand if this persistence just hasn't beenimplemented yet. But is it, or will it be?

Do you have other TokenFilters in your Analyzer? Are you reusing thesame Token or creating a new one in your TokenFilters? If creating anew one, you will have to set the payload as it won't be copied down.Perhaps we should add a constructor that takes a payload. On theother hand, I think we are going to remove the Payload object in favorof just using the byte array.

In the following example I will denote a token by {pos,<termtext>,<payload>}:
input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}
StopFilter:
{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}
WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}
QUESTION 2: As I explained I'm storing the String representing theXPath of the token as the Payload (well the ByteArray of the String)of each token. Is there a more efficient way to do this? Is thisexploiting Payload functionality and will it turn around and bite mewhen I get to indexing hundreds of thousands of documents? PerhapsI shouldn't be relying on the Payload functionality before it isdeemed not experimental?

I think this is reasonable. Micheal Busch had a nice talk atApacheCon on payloads that you can find at http://people.apache.org/~buschmi/apachecon/AdvancedIndexingLuceneAtlanta07.ppt


I guess you just want to be careful about how big your payloads get.

One of the original use cases for payloads was for doing XPath queries.

Also, the only thing experimental about Payloads is the actualsignature of the methods, not the need for them. If anything, I thinkyou will see an expansion of payload capability in the future. Alsonote, that you will probably be interested in adding more Payloadquerying capability. And also note, I am in the process of adding theability to get payloads from Spans, but I am not sure if this getsinto 2.3 or not.


Cheers,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Payloads, Tokenizers, and Filters. Oh My!

Reply via email to