[ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-380:
---------------------------------

    Description: 
"Paged-Text" FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in 
Solr, there's no way to convert search results into page-level hits. The 
solution: have a "paged-text" fieldtype which keeps track of page divisions as 
it indexes, and reports page-level hits in the search results.

The input would contain page milestones: <page id="234"/>. As Solr processed 
the tokens (using its standard tokenizers and filters), it would concurrently 
build a structural map of the item, indicating which term position marked the 
beginning of which page: <page id="234" firstterm="14324"/>. This map would be 
stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are 
returned in the current request, and use the stored map to determine page ids 
for each term position. The results would imitate the results for highlighting, 
something like:

<lst name="pages">
        <lst name="doc1">
                <int name="pageid">234</int>
                <int name="pageid">236</int>
        </lst>
        <lst name="doc2">
                <int name="pageid">19</int>
        </lst>
</lst>
<lst name="hitpos">
        <lst name="doc1">
                <lst name="234">
                        <int name="pos">14325</int>
                </lst>
        </lst>
        ...
</lst>

  was:
"Paged-Text" FieldType for Solr
> 
> A chance to dig into the guts of Solr. The problem: If we index a
> monograph in Solr, there's no way to convert search results into
> page-level hits. The solution: have a "paged-text" fieldtype which keeps
> track of page divisions as it indexes, and reports page-level hits in the
> search results.
> 
> The input would contain page milestones: <page id="234"/>. As Solr
> processed the tokens (using its standard tokenizers and filters), it would
> concurrently build a structural map of the item, indicating which term
> position marked the beginning of which page: <page id="234"
> firstterm="14324"/>. This map would be stored in an unindexed field in
> some efficient format.
> 
> At search time, Solr would retrieve term positions for all hits that are
> returned in the current request, and use the stored map to determine page
> ids for each term position. The results would imitate the results for
> highlighting, something like:
> 
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

        Summary: There's no way to convert search results into page-level hits 
of a "structured document".  (was: The problem: If we index a monograph in 
Solr, there's no way to convert search results into page-level hits. The 
solution: have a "paged-text" fieldtype which keeps track of page divisions as 
it indexes, and reports page-level hits in the search results.)

> There's no way to convert search results into page-level hits of a 
> "structured document".
> -----------------------------------------------------------------------------------------
>
>                 Key: SOLR-380
>                 URL: https://issues.apache.org/jira/browse/SOLR-380
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: <page id="234" firstterm="14324"/>. This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to