Payloads, Tokenizers, and Filters. Oh My!

Tricia Williams Fri, 16 Nov 2007 19:37:49 -0800

Hi All,

  I'll explain what I'm working on, and then I'll ask my two questions.

I'm working on the issuehttps://issues.apache.org/jira/browse/SOLR-380 which is a featurerequest that allows one to index a "Structured Document" which isanything that can be represented by XML in order to provide more contextto hits in the result set. This allows us to do things like query theindex for "Canada" and be able to not only say that that query matched adocument titled "Some Nonsense" but also that the query term appeared onpage 7 of chapter 1. We can then take this one step further andmarkup/highlight the image of this page based on our OCR and position hit.

For example:

<book title='Some Nonsense'><chapter title='One'><page name='1'>Sometext from page one of a book.</page><page name='7'>Some more text frompage seven of a book. Oh and I'm from Canada.</page></chapter></book>

I accomplished this by creating a custom Tokenizer which strips thexml elements and stores them as a Payload at each of the Tokens createdfrom the character data in the input. The payload is the string thatdescribes the XPath at that location. So for <Canada> the payload is"/book[title='Some Nonsense']/chapter[title='One']/page[name='7']"

The other part of this work is the SolrHighlighter which is lessimportant to this list. I retrieve the TermPositions for the Query'sTerms and use the TermPosition functionality to get back the payload forthe hits and build output which shows hit positions categorized by thepayload they are associated with.

QUESTION 1: Applying TokenFilters to my Tokenizer creates some strange(in my opinion) behavior. First of all the TermPositions change andsecond the Payload is removed. Is this the expected behavior, or isthis a bug? With the Payload being an "experimental feature" I canunderstand if this persistence just hasn't been implemented yet. But isit, or will it be?

In the following example I will denote a token by {pos,<termtext>,<payload>}:


input: <class name='mammalia'>Dog, and Cat</class>

XmlPayloadTokenizer:

{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<and>,</class[name='mammalia'][startPos='0']>},{3,<Cat>,</class[name='mammalia'][startPos='0']>}

StopFilter:

{1,<Dog,>,</class[name='mammalia'][startPos='0']>},{2,<Cat>,</class[name='mammalia'][startPos='0']>}

WordDelimiterFilter:
{1,<Dog>,<>} {2,<Cat>,</class[name='mammalia'][startPos='0']>}
LowerCaseFilter:
{1,<dog>,<>} {2,<cat>,</class[name='mammalia'][startPos='0']>}

QUESTION 2: As I explained I'm storing the String representing theXPath of the token as the Payload (well the ByteArray of the String) ofeach token. Is there a more efficient way to do this? Is thisexploiting Payload functionality and will it turn around and bite mewhen I get to indexing hundreds of thousands of documents? Perhaps Ishouldn't be relying on the Payload functionality before it is deemednot experimental?

I feel these questions are both related to Lucene proper rather thanSolr, which is why I've posted here. If you think solr-user is a betterplace to post my questions let me know.


Thanks for your input!
Tricia


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Payloads, Tokenizers, and Filters. Oh My!

Reply via email to