[
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795386#action_12795386
]
Shairon Toledo commented on SOLR-380:
-------------------------------------
I have a project that involves words extracted by OCR, each page has words,
each word has its geometry to blink a highlight to end user.
I've been trying represent this document structure by xml
{code:xml}
<document>
<page num="1">
<term top='111' bottom='222' right='333' left='444'>foo</term>
<term top='211' bottom='322' right='833' left='944'>bar</term>
<term top='311' bottom='422' right='733' left='144'>baz</term>
<term top='411' bottom='522' right='633' left='244'>qux</term>
</page>
<page num="2">
<term .... />
</page>
</document>
{code}
Using the field 'fulltext_st' ,
{code:xml}
<field name="fulltext_st">
<document >
<page top='111' bottom='222' right='333' left='444' word='foo'
num='1'>foo</page>
<page top='211' bottom='322' right='833' left='944' word='bar'
num='1'>bar</page>
<page top='311' bottom='422' right='733' left='144' word='baz'
num='1'>baz</page>
<page top='411' bottom='522' right='633' left='244' word='qux'
num='1'>qux</page>
</document>
</field>
{code}
I can get all terms in my search result with them payloads.
But if I do search using phrase query I can't fetch any result.
Example:
*search?q=foo*
{code:xml}
<lst name="fulltext_st">
<int
name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
</lst>
{code}
*search?q=foo+bar*
{code:xml}
<lst name="fulltext_st">
<int
name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
<int
name="/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']">1</int>
</lst>
{code}
*/search?q="foo bar"*
{code:xml}
*nothing*
{code}
I was wondering if I could get your thoughts if xmlpayload supports sort of the
things or how easy is I update the code to provide a solution for do that.
thank you in advance
> There's no way to convert search results into page-level hits of a
> "structured document".
> -----------------------------------------------------------------------------------------
>
> Key: SOLR-380
> URL: https://issues.apache.org/jira/browse/SOLR-380
> Project: Solr
> Issue Type: New Feature
> Components: search
> Reporter: Tricia Williams
> Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch,
> xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph
> in Solr, there's no way to convert search results into page-level hits. The
> solution: have a "paged-text" fieldtype which keeps track of page divisions
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed
> the tokens (using its standard tokenizers and filters), it would concurrently
> build a structural map of the item, indicating which term position marked the
> beginning of which page: <page id="234" firstterm="14324"/>. This map would
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page ids
> for each term position. The results would imitate the results for
> highlighting, something like:
> <lst name="pages">
> <lst name="doc1">
> <int name="pageid">234</int>
> <int name="pageid">236</int>
> </lst>
> <lst name="doc2">
> <int name="pageid">19</int>
> </lst>
> </lst>
> <lst name="hitpos">
> <lst name="doc1">
> <lst name="234">
> <int
> name="pos">14325</int>
> </lst>
> </lst>
> ...
> </lst>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.