[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Shairon Toledo (JIRA) Wed, 30 Dec 2009 10:49:59 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795386#action_12795386
 ]


Shairon Toledo commented on SOLR-380:
-------------------------------------

I have a project that involves words extracted by OCR, each page has words, 
each word has its geometry to blink a highlight to end user. 
I've been trying represent this document structure by xml


{code:xml}
<document>
   <page num="1">
    <term top='111' bottom='222' right='333' left='444'>foo</term> 
    <term top='211' bottom='322' right='833' left='944'>bar</term> 
    <term top='311' bottom='422' right='733' left='144'>baz</term> 
    <term top='411' bottom='522' right='633' left='244'>qux</term> 
   </page>
   <page num="2">
        <term .... />
   </page>
   
</document>

{code}

Using the field 'fulltext_st' ,

{code:xml}
<field name="fulltext_st">
        &lt;document &gt;
        &lt;page top='111' bottom='222' right='333' left='444' word='foo' 
num='1'&gt;foo&lt;/page&gt;
        &lt;page top='211' bottom='322' right='833' left='944' word='bar' 
num='1'&gt;bar&lt;/page&gt;
        &lt;page top='311' bottom='422' right='733' left='144' word='baz' 
num='1'&gt;baz&lt;/page&gt;
        &lt;page top='411' bottom='522' right='633' left='244' word='qux' 
num='1'&gt;qux&lt;/page&gt;
        &lt;/document&gt;
</field>
{code}

I can get all terms in my search result with them payloads.
But if I do search using phrase query I can't fetch any result.

Example:

*search?q=foo* 

{code:xml}
<lst name="fulltext_st">
        <int 
name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
</lst>
{code}

*search?q=foo+bar*

{code:xml}
<lst name="fulltext_st">
        <int 
name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int>
        <int 
name="/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']">1</int>
</lst>
{code}

*/search?q="foo bar"*
{code:xml}
*nothing*
{code}

I was wondering if I could get your thoughts if xmlpayload supports sort of the 
things or how easy is I update the code to provide a solution for do that.  

thank you in advance

> There's no way to convert search results into page-level hits of a 
> "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>             Fix For: 1.5
>
>         Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, 
> xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: <page id="234" firstterm="14324"/>. This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int 
> name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Reply via email to