[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795386#action_12795386 ]
Shairon Toledo commented on SOLR-380: ------------------------------------- I have a project that involves words extracted by OCR, each page has words, each word has its geometry to blink a highlight to end user. I've been trying represent this document structure by xml {code:xml} <document> <page num="1"> <term top='111' bottom='222' right='333' left='444'>foo</term> <term top='211' bottom='322' right='833' left='944'>bar</term> <term top='311' bottom='422' right='733' left='144'>baz</term> <term top='411' bottom='522' right='633' left='244'>qux</term> </page> <page num="2"> <term .... /> </page> </document> {code} Using the field 'fulltext_st' , {code:xml} <field name="fulltext_st"> <document > <page top='111' bottom='222' right='333' left='444' word='foo' num='1'>foo</page> <page top='211' bottom='322' right='833' left='944' word='bar' num='1'>bar</page> <page top='311' bottom='422' right='733' left='144' word='baz' num='1'>baz</page> <page top='411' bottom='522' right='633' left='244' word='qux' num='1'>qux</page> </document> </field> {code} I can get all terms in my search result with them payloads. But if I do search using phrase query I can't fetch any result. Example: *search?q=foo* {code:xml} <lst name="fulltext_st"> <int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int> </lst> {code} *search?q=foo+bar* {code:xml} <lst name="fulltext_st"> <int name="/document/page[word='foo'][num='1'][top='111'][bottom='222'][right='333'][left='444']">1</int> <int name="/document/page[word='baz'][num='1'][top='211'][bottom='322'][right='833'][left='944']">1</int> </lst> {code} */search?q="foo bar"* {code:xml} *nothing* {code} I was wondering if I could get your thoughts if xmlpayload supports sort of the things or how easy is I update the code to provide a solution for do that. thank you in advance > There's no way to convert search results into page-level hits of a > "structured document". > ----------------------------------------------------------------------------------------- > > Key: SOLR-380 > URL: https://issues.apache.org/jira/browse/SOLR-380 > Project: Solr > Issue Type: New Feature > Components: search > Reporter: Tricia Williams > Priority: Minor > Fix For: 1.5 > > Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, > xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar > > > "Paged-Text" FieldType for Solr > A chance to dig into the guts of Solr. The problem: If we index a monograph > in Solr, there's no way to convert search results into page-level hits. The > solution: have a "paged-text" fieldtype which keeps track of page divisions > as it indexes, and reports page-level hits in the search results. > The input would contain page milestones: <page id="234"/>. As Solr processed > the tokens (using its standard tokenizers and filters), it would concurrently > build a structural map of the item, indicating which term position marked the > beginning of which page: <page id="234" firstterm="14324"/>. This map would > be stored in an unindexed field in some efficient format. > At search time, Solr would retrieve term positions for all hits that are > returned in the current request, and use the stored map to determine page ids > for each term position. The results would imitate the results for > highlighting, something like: > <lst name="pages"> > <lst name="doc1"> > <int name="pageid">234</int> > <int name="pageid">236</int> > </lst> > <lst name="doc2"> > <int name="pageid">19</int> > </lst> > </lst> > <lst name="hitpos"> > <lst name="doc1"> > <lst name="234"> > <int > name="pos">14325</int> > </lst> > </lst> > ... > </lst> -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.