[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Tricia Williams (JIRA) Wed, 17 Oct 2007 14:26:38 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748
 ]


Tricia Williams commented on SOLR-380:
--------------------------------------

The discussion from 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 
gives one solution (which is more of a workaround in my opinion), but I don't 
think it is practical.  The number of pages of the monographs we index vary 
greatly (10s to 1000s of pages).  So while specifying each page_* 
(page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't 
think it is the cleanest solution because you have to infer page numbers from 
the highlighted samples.  Furthermore, in order to get the highlighted samples 
you need to know the values of the * in a dynamic field which sort of defeats 
the purpose of the dynamic field.  If you wanted to use the position numbers 
themselves (for example using positions and OCR information to create 
highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the 
page_* fields for highlighter snippets.  In the following example I have a 
dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=on&version=2.2&q=employ&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&explainOther=&hl=on&hl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

<lst name="highlighting">
&nbsp;<lst name="News.EFP.186500">
&nbsp;&nbsp;<arr name="fulltext_1">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; was <em>employed</em> on the G. T. R. as fireman met 
his death in an accident on that road some yeara ago but three
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_4">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; ^-f 6r-Ke.w-¥eaf!fl&apos;: Mr.-BradV whb is 
<em>employed</em> in Windsor, was also at his borne for jSew Year
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_6">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; <em>employed</em> at the Walkerville brewery op to a 
short time ago,when illness ecessilater! his resignation. He
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;&nbsp;<arr name="fulltext_7">
&nbsp;&nbsp;&nbsp;<str>
&nbsp;&nbsp;&nbsp;&nbsp; . have entered intoan agreement to <em>employ</em> the 
powerful tug Lntz to keep th&gt;e Detroit river between
&nbsp;&nbsp;&nbsp;</str>
&nbsp;&nbsp;</arr>
&nbsp;</lst>
</lst>

You will notice that only the pages with hits on them appear in the highlight 
section.  From this point it would take a little work to parse the /[EMAIL 
PROTECTED] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the 
difficulty I'm finding is the upfront part where we need to store the position 
to page mapping in a field while at the same time we need to analyze the full 
page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType 
looks like it is useful in controlling what the output of your defined field is 
(write()), and how it is sorted, but not how Fields with your FieldType will be 
indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the 
SOLR-247 problem, or continue hunting for a solution in the manner that I've 
been pursuing in this issue?

> There's no way to convert search results into page-level hits of a 
> "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: <page id="234" firstterm="14324"/>. This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> <lst name="pages">
> &nbsp;&nbsp;<lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">234</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">236</int>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        <lst name="doc2">
> &nbsp;&nbsp;&nbsp;&nbsp;                <int name="pageid">19</int>
> &nbsp;&nbsp;        </lst>
> </lst>
> <lst name="hitpos">
> &nbsp;&nbsp;        <lst name="doc1">
> &nbsp;&nbsp;&nbsp;&nbsp;                <lst name="234">
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;                        <int 
> name="pos">14325</int>
> &nbsp;&nbsp;&nbsp;&nbsp;                </lst>
> &nbsp;&nbsp;        </lst>
> &nbsp;&nbsp;        ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

Reply via email to