[jira] Created: (SOLR-1819) Upgrade to Tika 0.7
Upgrade to Tika 0.7 --- Key: SOLR-1819 URL: https://issues.apache.org/jira/browse/SOLR-1819 Project: Solr Issue Type: Improvement Reporter: Tricia Williams Assignee: Grant Ingersoll Priority: Minor Fix For: 1.5 See title. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1235) disallow period (.) in entity names
[ https://issues.apache.org/jira/browse/SOLR-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742526#action_12742526 ] Tricia Williams commented on SOLR-1235: --- This commit causes the example-DIH to fail with DataImportHandlerException: Entity must have name '. The reason is that the entity on line 3 of trunk/example/example-DIH/solr/mail/conf/data-config.xml is missing the name attribute which causes the condition on line 177 of org.apache.solr.handler.dataimport.DataConfig to fail. The simple solution is to add a name attribute to the offending entity. The complex solution would be to change the DataConfig test so that null is accepted as a name, but the period is not. What do you think? Other info: I start the example-DIH webapp as described: {code} java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar {code} And the error appears: {panel} HTTP ERROR: 500 Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml - org.apache.solr.common.SolrException: FATAL: Could not create importer. DataImporter config invalid at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:121) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:415) at org.apache.solr.core.SolrCore.init(SolrCore.java:574) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594) at org.mortbay.jetty.servlet.Context.startContext(Context.java:139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Exception occurred while initializing context at org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:182) at org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:99) at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113) ... 30 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Entity must have name ' at org.apache.solr.handler.dataimport.DataConfig$Entity.init(DataConfig.java:118) at org.apache.solr.handler.dataimport.DataConfig$Document.init(DataConfig.java:72) at org.apache.solr.handler.dataimport.DataConfig.readFromXml(DataConfig.java:240) at org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:178) ... 32 more - org.apache.solr.handler.dataimport.DataImportHandlerException: Exception occurred while initializing context
[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665195#action_12665195 ] Tricia Williams commented on SOLR-380: -- Hi Laurent, Thanks for your interest in my Solr PayloadComponent plugin. I want to address all of the questions you pose in your comment, but won't have time until early February. I apologize for the inconvenience but my priorities lay elsewhere right now. Feel free to look at the code and play in the meantime. The code that's up there is basically proof of concept. I've been slowly working at improving the robustness of the code and improving performance so hopefully there will be a improved version before the end of March. I'm sure there would be many people who would appreciate a Wiki page for this topic. Why don't you go ahead and set that up? I'll be happy to add my two cents when I'm available. All the best, Tricia There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Fix For: 1.4 Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int nbsp;nbsp;nbsp;nbsp;int name=pageid236/int nbsp;nbsp;/lst nbsp;nbsp;lst name=doc2 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int nbsp;nbsp;/lst /lst lst name=hitpos nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;lst name=234 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int name=pos14325/int nbsp;nbsp;nbsp;nbsp;/lst nbsp;nbsp;/lst nbsp;nbsp;... /lst -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-854) Add 'run example' to build.xml
[ https://issues.apache.org/jira/browse/SOLR-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647476#action_12647476 ] Tricia Williams commented on SOLR-854: -- Running the example is something I frequently do so having it in the ant script is something I'd find useful. Add 'run example' to build.xml -- Key: SOLR-854 URL: https://issues.apache.org/jira/browse/SOLR-854 Project: Solr Issue Type: New Feature Reporter: Mark Miller Priority: Trivial Attachments: SOLR-854.patch Working in eclipse, I find it really convenient for debugging/testing to have a 'run-example' target in the build file. Anyone else? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-532) WordDelimiterFilter ignores payloads
[ https://issues.apache.org/jira/browse/SOLR-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641694#action_12641694 ] Tricia Williams commented on SOLR-532: -- Thanks Grant. That's much cleaner using the new clone method. It works for me after catching up with the new slf4j logging. Thanks too for committing it! WordDelimiterFilter ignores payloads Key: SOLR-532 URL: https://issues.apache.org/jira/browse/SOLR-532 Project: Solr Issue Type: Bug Reporter: Tricia Williams Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-532-WordDelimiterFilter.patch When a WordDelimiterFilter ingests a token stream and creates a new token (newTok) it appears to copy most of the old token attributes, except the payload. I believe this is a bug. My solution is for the WordDelimiterFilter to use the Token clone() method to create a carbon copy and then modify the appropriate attributes (offsets and term text). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-380: - Attachment: (was: lucene-core-2.3-dev.jar) There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int nbsp;nbsp;nbsp;nbsp;int name=pageid236/int nbsp;nbsp;/lst nbsp;nbsp;lst name=doc2 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int nbsp;nbsp;/lst /lst lst name=hitpos nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;lst name=234 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int name=pos14325/int nbsp;nbsp;nbsp;nbsp;/lst nbsp;nbsp;/lst nbsp;nbsp;... /lst -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-532) WordDelimiterFilter ignores payloads
WordDelimiterFilter ignores payloads Key: SOLR-532 URL: https://issues.apache.org/jira/browse/SOLR-532 Project: Solr Issue Type: Bug Reporter: Tricia Williams Priority: Minor When a WordDelimiterFilter ingests a token stream and creates a new token (newTok) it appears to copy most of the old token attributes, except the payload. I believe this is a bug. My solution is for the WordDelimiterFilter to use the Token clone() method to create a carbon copy and then modify the appropriate attributes (offsets and term text). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-532) WordDelimiterFilter ignores payloads
[ https://issues.apache.org/jira/browse/SOLR-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-532: - Attachment: SOLR-532-WordDelimiterFilter.patch Quick fix. Does this need a unit test to go with it? WordDelimiterFilter ignores payloads Key: SOLR-532 URL: https://issues.apache.org/jira/browse/SOLR-532 Project: Solr Issue Type: Bug Reporter: Tricia Williams Priority: Minor Attachments: SOLR-532-WordDelimiterFilter.patch When a WordDelimiterFilter ingests a token stream and creates a new token (newTok) it appears to copy most of the old token attributes, except the payload. I believe this is a bug. My solution is for the WordDelimiterFilter to use the Token clone() method to create a carbon copy and then modify the appropriate attributes (offsets and term text). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-522) analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters
analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters -- Key: SOLR-522 URL: https://issues.apache.org/jira/browse/SOLR-522 Project: Solr Issue Type: Improvement Components: web gui Reporter: Tricia Williams Priority: Trivial Add payload content to the vebose output of the analysis.jsp page for debugging purposes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-522) analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters
[ https://issues.apache.org/jira/browse/SOLR-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-522: - Attachment: SOLR-522-analysis.jsp.patch Added if block to analysis.jsp which converts the Payload's byte stream directly to a String for display. This might not suit the use case of all payloads so this may need to be revisited as those emerge. analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters -- Key: SOLR-522 URL: https://issues.apache.org/jira/browse/SOLR-522 Project: Solr Issue Type: Improvement Components: web gui Reporter: Tricia Williams Priority: Trivial Attachments: SOLR-522-analysis.jsp.patch Original Estimate: 0.17h Remaining Estimate: 0.17h Add payload content to the vebose output of the analysis.jsp page for debugging purposes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation
[ https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-386: - Attachment: SOLR-386-SolrHighlighter.patch OK. So I think I fixed the whitespace problem. Thanks for explaining the problem with interfaces -- that makes sense now. I assume that EventListener and RequestHandler are interfaces because they've been thought long and hard about and are not going to change? My first try at the patch was just to include the public methods, which are the ones you (MIke Klaas) list: .initialize(Config) .isHighlightEnabled(SolrParams) .doHighlighting(...) .getHighlightFields(...) I discovered that the unit tests call the formatters and fragmenters directly so in the interface version I had made public get methods for these. Now that we're using an abstract class I am able to just include these as is - so no changes to HighlighterTest this time. But speaking of unit tests... Anecdotally I know that the SolrCore changes allow the highlighter to be configured (my custom highlighter). I wrote HighlighterConfigTest as a unit test for this functionality. I decided to leave the default implementation of isHighlightingEnabled(SolrParams), and getHighlightFields(...) in the abstract class because both methods deal with reading parameters. I can't think of a use case of a highlighter that wouldn't use this or at worst ignore/override this method. Is this a reasonable decision? I wasn't sure what to do with the logger, so I left it in the AbstractSolrHighlighter. This decision is based on the example of UpdateHandler. Hmm... I just realized that naming the abstraction of SolrHighlighter AbstractSolrHighlighter causes problems all over the code when you want your custom highlighter to plugin. I think the path of least resistance is to call the abstract class SolrHighlighter and the existing implementation DefaultSolrHighlighter. Thoughts? Add confuguration to specify SolrHighlighter implementation --- Key: SOLR-386 URL: https://issues.apache.org/jira/browse/SOLR-386 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.3 Reporter: Eli Levine Assignee: Mike Klaas Attachments: SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch It would be great if SolrCore allowed the highlighter class to be configurable. A good way would be to add a +class+ attribute to the highlighting element in solrconfig.xml, similar to how the RequestHandler instance is initialized in SorCore. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation
[ https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-386: - Attachment: SOLR-386-SolrHighlighter.patch I'd really like some feedback on this patch. I've just updated the patch based on changes that have been made to SolrHighlighter.java since revision 594314). Eli, does this meet your needs? This is all I need in SOLR-380 to plug in a custom highlighter. I would really appreciate if this could be committed by someone so that I can stop worrying about keeping up with revisions. It has been assigned to Mike Klass so his feedback in particular would be valuable to me. Thanks, Tricia Add confuguration to specify SolrHighlighter implementation --- Key: SOLR-386 URL: https://issues.apache.org/jira/browse/SOLR-386 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.3 Reporter: Eli Levine Assignee: Mike Klaas Attachments: SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch It would be great if SolrCore allowed the highlighter class to be configurable. A good way would be to add a +class+ attribute to the highlighting element in solrconfig.xml, similar to how the RequestHandler instance is initialized in SorCore. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-380: - Attachment: SOLR-380-XmlPayload.patch Functionality is improved. Tests are more complete. I have included an example (much like the example included in solr) which demonstrates the changes needed to solrconfig.xml and schema.xml. As well as some xml documents to start playing with. TODO: * Still have to track down what happens when filters are applied to the Tokenizer. * Implement error handling for bad xml input. There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int nbsp;nbsp;nbsp;nbsp;int name=pageid236/int nbsp;nbsp;/lst nbsp;nbsp;lst name=doc2 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int nbsp;nbsp;/lst /lst lst name=hitpos nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;lst name=234 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int name=pos14325/int nbsp;nbsp;nbsp;nbsp;/lst nbsp;nbsp;/lst nbsp;nbsp;... /lst -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-380: - Attachment: lucene-core-2.3-dev.jar SOLR-380-XmlPayload.patch This is a draft. Note that Payload and Token classes in particular have changed since lucene-core-2.2.0.jar. Users of this patch will need to replace lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar. I have created a test for XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not in SOLR's classpath in any form and it will break the build. The code works in theory and passes tests to that effect. However, in practice when I deploy the war created from the dist ant target several problems result from adding documents (which seems to work using a ![CDATA[...]] to contain the structured document and post.jar): * after adding a XmlPayload tokenized document, q=*:* causes 500 error: HTTP Status 500 - read past EOF java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ... * use of the highlight fields produces the same error. * queries that should match a XmlPayload tokenized document do not ( //[EMAIL PROTECTED]'0'])-- though queries matching un-XmlPayload tokenized document continue to return the expected results. * trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 (2007-06-20) ) returns: Unknown format version: -4 * Solr Statistics confirm that all the documents have been added. I will continue to finish this functionality but any suggestions or other input are welcomed. You will see how the functionality is intended to be used in src/test/org/apache/solr/highlight/XmlPayloadTest.java There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int nbsp;nbsp;nbsp;nbsp;int name=pageid236/int nbsp;nbsp;/lst nbsp;nbsp;lst name=doc2 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int nbsp;nbsp;/lst /lst lst name=hitpos nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;lst name=234 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int name=pos14325/int nbsp;nbsp;nbsp;nbsp;/lst nbsp;nbsp;/lst nbsp;nbsp;... /lst -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation
[ https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-386: - Attachment: SOLR-386-SolrHighlighter.patch Updated patch to work with recent changes made to SolrCore. Should apply against a clean trunk again. No further changes. Add confuguration to specify SolrHighlighter implementation --- Key: SOLR-386 URL: https://issues.apache.org/jira/browse/SOLR-386 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.3 Reporter: Eli Levine Attachments: SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch It would be great if SolrCore allowed the highlighter class to be configurable. A good way would be to add a +class+ attribute to the highlighting element in solrconfig.xml, similar to how the RequestHandler instance is initialized in SorCore. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation
[ https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-386: - Attachment: SOLR-386-SolrHighlighter.patch This patch allows highlighting to be plugged in. What I did: * Made SolrHighlighter an interface * The old SolrHighlighter became DefaultSolrHighlighter * Instantiate the highlighter in SolrCore based on what is in the solrconfig.xml So to roll your own * Implement SolrHighlighter (ie org.apache.solr.highlight.MySolrHighlighter) * find highlighting in solrconfig.xml and modify to highlighting class=org.apache.solr.highlight.MySolrHighlighter This patch builds on changes made to trunk by SOLR-281. This patch also contains these changes (meaning you should apply this patch to the trunk). I get the feeling that this is probably not the right way to build a dependent patch, but I don't know any better. Let me know if I should change how I built this patch. Add confuguration to specify SolrHighlighter implementation --- Key: SOLR-386 URL: https://issues.apache.org/jira/browse/SOLR-386 Project: Solr Issue Type: Improvement Components: highlighter Affects Versions: 1.3 Reporter: Eli Levine Attachments: SOLR-386-SolrHighlighter.patch It would be great if SolrCore allowed the highlighter class to be configurable. A good way would be to add a +class+ attribute to the highlighting element in solrconfig.xml, similar to how the RequestHandler instance is initialized in SorCore. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748 ] Tricia Williams commented on SOLR-380: -- The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical. The number of pages of the monographs we index vary greatly (10s to 1000s of pages). So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples. Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field. If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results. In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets. In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext: http://localhost:8080/solr/select?indent=onversion=2.2q=employstart=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl=onhl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9 gives the normal results, with the following at the end: lst name=highlighting nbsp;lst name=News.EFP.186500 nbsp;nbsp;arr name=fulltext_1 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; was ememployed/em on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_4 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; ^-f 6r-Ke.w-¥eaf!flapos;: Mr.-BradV whb is ememployed/em in Windsor, was also at his borne for jSew Year nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_6 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; ememployed/em at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_7 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; . have entered intoan agreement to ememploy/em the powerful tug Lntz to keep thgt;e Detroit river between nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;/lst /lst You will notice that only the pages with hits on them appear in the highlight section. From this point it would take a little work to parse the /[EMAIL PROTECTED] to get the * from fulltext_* for each document match. I agree that the highlighter is a good model of what we want to do. But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching. I don't think defining a FieldType will allow us to do this. The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried. Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue? There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages nbsp;nbsp;lst name=doc1 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
[jira] Issue Comment Edited: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748 ] pgwillia edited comment on SOLR-380 at 10/17/07 3:13 PM: The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical. The number of pages of the monographs we index vary greatly (10s to 1000s of pages). So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples. Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field. If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results. In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets. In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext: http://tinyurl.com/3xdshk (essentially shows the parameters and their values for this example -- pay attention to the hl.fl parameter) gives the normal results, with the following at the end: lst name=highlighting nbsp;lst name=News.EFP.186500 nbsp;nbsp;arr name=fulltext_1 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; was ememployed/em on the G. T. R. as fireman met his death in an accident on that road some yeara ago but three nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_4 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; ^-f 6r-Ke.w-¥eaf!flapos;: Mr.-BradV whb is ememployed/em in Windsor, was also at his borne for jSew Year nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_6 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; ememployed/em at the Walkerville brewery op to a short time ago,when illness ecessilater! his resignation. He nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;nbsp;arr name=fulltext_7 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; . have entered intoan agreement to ememploy/em the powerful tug Lntz to keep thgt;e Detroit river between nbsp;nbsp;nbsp;/str nbsp;nbsp;/arr nbsp;/lst /lst You will notice that only the pages with hits on them appear in the highlight section. From this point it would take a little work to parse the /[EMAIL PROTECTED] to get the * from fulltext_* for each document match. I agree that the highlighter is a good model of what we want to do. But the difficulty I'm finding is the upfront part where we need to store the position to page mapping in a field while at the same time we need to analyze the full page text into another field for searching. I don't think defining a FieldType will allow us to do this. The FieldType looks like it is useful in controlling what the output of your defined field is (write()), and how it is sorted, but not how Fields with your FieldType will be indexed or queried. Would someone more familiar with the innards of Solr recommend I pursue the SOLR-247 problem, or continue hunting for a solution in the manner that I've been pursuing in this issue? was (Author: pgwillia): The discussion from http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 gives one solution (which is more of a workaround in my opinion), but I don't think it is practical. The number of pages of the monographs we index vary greatly (10s to 1000s of pages). So while specifying each page_* (page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't think it is the cleanest solution because you have to infer page numbers from the highlighted samples. Furthermore, in order to get the highlighted samples you need to know the values of the * in a dynamic field which sort of defeats the purpose of the dynamic field. If you wanted to use the position numbers themselves (for example using positions and OCR information to create highlighting on an original image), they are not available in the results. In answer to your question Peter, one must enable highlighting and list all the page_* fields for highlighter snippets. In the following example I have a dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext: http://localhost:8080/solr/select?indent=onversion=2.2q=employstart=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl=onhl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9 gives the normal results, with the following at the end: lst name=highlighting nbsp;lst name=News.EFP.186500 nbsp;nbsp;arr name=fulltext_1 nbsp;nbsp;nbsp;str nbsp;nbsp;nbsp;nbsp; was
[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.
[ https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tricia Williams updated SOLR-380: - Description: Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages lst name=doc1 int name=pageid234/int int name=pageid236/int /lst lst name=doc2 int name=pageid19/int /lst /lst lst name=hitpos lst name=doc1 lst name=234 int name=pos14325/int /lst /lst ... /lst was: Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages lst name=doc1 int name=pageid234/int int name=pageid236/int /lst lst name=doc2 int name=pageid19/int /lst /lst lst name=hitpos lst name=doc1 lst name=234 int name=pos14325/int /lst /lst ... /lst Summary: There's no way to convert search results into page-level hits of a structured document. (was: The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results.) There's no way to convert search results into page-level hits of a structured document. - Key: SOLR-380 URL: https://issues.apache.org/jira/browse/SOLR-380 Project: Solr Issue Type: New Feature Components: search Reporter: Tricia Williams Priority: Minor Paged-Text FieldType for Solr A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr, there's no way to convert search results into page-level hits. The solution: have a paged-text fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in the search results. The input would contain page milestones: page id=234/. As Solr processed the tokens (using its standard tokenizers and filters), it would concurrently build a structural map of the item, indicating which term position marked the beginning of which page: page id=234 firstterm=14324/. This map would be stored in an unindexed field in some efficient format. At search time, Solr would retrieve term positions for all hits that are returned in the current request, and use the stored map to determine page ids for each term position. The results would imitate the results for highlighting, something like: lst name=pages lst name=doc1 int name=pageid234/int int name=pageid236/int /lst lst name=doc2 int name=pageid19/int /lst /lst lst name=hitpos lst name=doc1 lst name=234 int name=pos14325/int /lst /lst