[jira] Created: (SOLR-1819) Upgrade to Tika 0.7

2010-03-13 Thread Tricia Williams (JIRA)
Upgrade to Tika 0.7
---

 Key: SOLR-1819
 URL: https://issues.apache.org/jira/browse/SOLR-1819
 Project: Solr
  Issue Type: Improvement
Reporter: Tricia Williams
Assignee: Grant Ingersoll
Priority: Minor
 Fix For: 1.5


See title.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1235) disallow period (.) in entity names

2009-08-12 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742526#action_12742526
 ] 

Tricia Williams commented on SOLR-1235:
---

This commit causes the example-DIH to fail with DataImportHandlerException: 
Entity must have name '.  The reason is that the entity on line 3 of 
trunk/example/example-DIH/solr/mail/conf/data-config.xml is missing the name 
attribute which causes the condition on line 177 of 
org.apache.solr.handler.dataimport.DataConfig to fail.

The simple solution is to add a name attribute to the offending entity.  The 
complex solution would be to change the DataConfig test so that null is 
accepted as a name, but the period is not.  What do you think?

Other info:

I start the example-DIH webapp as described:
{code}
java -Dsolr.solr.home=./example-DIH/solr/ -jar start.jar
{code}

And the error appears:
{panel}
HTTP ERROR: 500

Severe errors in solr configuration.

Check your log files for more detailed information on what may be wrong.

If you want solr to continue after configuration errors, change: 

 abortOnConfigurationErrorfalse/abortOnConfigurationError

in solr.xml

-
org.apache.solr.common.SolrException: FATAL: Could not create importer. 
DataImporter config invalid
at 
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:121)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:415)
at org.apache.solr.core.SolrCore.init(SolrCore.java:574)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
Exception occurred while initializing context
at 
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:182)
at 
org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:99)
at 
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
... 30 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
Entity must have name '
at 
org.apache.solr.handler.dataimport.DataConfig$Entity.init(DataConfig.java:118)
at 
org.apache.solr.handler.dataimport.DataConfig$Document.init(DataConfig.java:72)
at 
org.apache.solr.handler.dataimport.DataConfig.readFromXml(DataConfig.java:240)
at 
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:178)
... 32 more
-
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception 
occurred while initializing context

[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2009-01-19 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665195#action_12665195
 ] 

Tricia Williams commented on SOLR-380:
--

Hi Laurent,

Thanks for your interest in my Solr PayloadComponent plugin.  I want to 
address all of the questions you pose in your comment, but won't have time 
until early February.  I apologize for the inconvenience but my priorities lay 
elsewhere right now.  Feel free to look at the code and play in the meantime.  
The code that's up there is basically proof of concept.  I've been slowly 
working at improving the robustness of the code and improving performance so 
hopefully there will be a improved version before the end of March.

I'm sure there would be many people who would appreciate a Wiki page for 
this topic.  Why don't you go ahead and set that up?  I'll be happy to add my 
two cents when I'm available.

All the best,
Tricia

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, 
 xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar


 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
 nbsp;nbsp;nbsp;nbsp;int name=pageid236/int
 nbsp;nbsp;/lst
 nbsp;nbsp;lst name=doc2
 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int
 nbsp;nbsp;/lst
 /lst
 lst name=hitpos
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;lst name=234
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int 
 name=pos14325/int
 nbsp;nbsp;nbsp;nbsp;/lst
 nbsp;nbsp;/lst
 nbsp;nbsp;...
 /lst

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-854) Add 'run example' to build.xml

2008-11-13 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647476#action_12647476
 ] 

Tricia Williams commented on SOLR-854:
--

Running the example is something I frequently do so having it in the ant script 
is something I'd find useful.

 Add 'run example' to build.xml
 --

 Key: SOLR-854
 URL: https://issues.apache.org/jira/browse/SOLR-854
 Project: Solr
  Issue Type: New Feature
Reporter: Mark Miller
Priority: Trivial
 Attachments: SOLR-854.patch


 Working in eclipse, I find it really convenient for debugging/testing to have 
 a 'run-example' target in the build file. Anyone else?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-532) WordDelimiterFilter ignores payloads

2008-10-21 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12641694#action_12641694
 ] 

Tricia Williams commented on SOLR-532:
--

Thanks Grant.  That's much cleaner using the new clone method.  It works for me 
after catching up with the new slf4j logging.  Thanks too for committing it!

 WordDelimiterFilter ignores payloads
 

 Key: SOLR-532
 URL: https://issues.apache.org/jira/browse/SOLR-532
 Project: Solr
  Issue Type: Bug
Reporter: Tricia Williams
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-532-WordDelimiterFilter.patch


 When a WordDelimiterFilter ingests a token stream and creates a new token 
 (newTok) it appears to copy most of the old token attributes, except the 
 payload.  I believe this is a bug.  My solution is for the 
 WordDelimiterFilter to use the Token clone() method to create a carbon copy 
 and then modify the appropriate attributes (offsets and term text). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2008-04-23 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-380:
-

Attachment: (was: lucene-core-2.3-dev.jar)

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor
 Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch


 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
 nbsp;nbsp;nbsp;nbsp;int name=pageid236/int
 nbsp;nbsp;/lst
 nbsp;nbsp;lst name=doc2
 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int
 nbsp;nbsp;/lst
 /lst
 lst name=hitpos
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;lst name=234
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int 
 name=pos14325/int
 nbsp;nbsp;nbsp;nbsp;/lst
 nbsp;nbsp;/lst
 nbsp;nbsp;...
 /lst

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-532) WordDelimiterFilter ignores payloads

2008-04-03 Thread Tricia Williams (JIRA)
WordDelimiterFilter ignores payloads


 Key: SOLR-532
 URL: https://issues.apache.org/jira/browse/SOLR-532
 Project: Solr
  Issue Type: Bug
Reporter: Tricia Williams
Priority: Minor


When a WordDelimiterFilter ingests a token stream and creates a new token 
(newTok) it appears to copy most of the old token attributes, except the 
payload.  I believe this is a bug.  My solution is for the WordDelimiterFilter 
to use the Token clone() method to create a carbon copy and then modify the 
appropriate attributes (offsets and term text). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-532) WordDelimiterFilter ignores payloads

2008-04-03 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-532:
-

Attachment: SOLR-532-WordDelimiterFilter.patch

Quick fix.  Does this need a unit test to go with it?

 WordDelimiterFilter ignores payloads
 

 Key: SOLR-532
 URL: https://issues.apache.org/jira/browse/SOLR-532
 Project: Solr
  Issue Type: Bug
Reporter: Tricia Williams
Priority: Minor
 Attachments: SOLR-532-WordDelimiterFilter.patch


 When a WordDelimiterFilter ingests a token stream and creates a new token 
 (newTok) it appears to copy most of the old token attributes, except the 
 payload.  I believe this is a bug.  My solution is for the 
 WordDelimiterFilter to use the Token clone() method to create a carbon copy 
 and then modify the appropriate attributes (offsets and term text). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-522) analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters

2008-03-31 Thread Tricia Williams (JIRA)
analysis.jsp doesn't show payloads created/modified by tokenizers and 
tokenfilters
--

 Key: SOLR-522
 URL: https://issues.apache.org/jira/browse/SOLR-522
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Tricia Williams
Priority: Trivial


Add payload content to the vebose output of the analysis.jsp page for debugging 
purposes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-522) analysis.jsp doesn't show payloads created/modified by tokenizers and tokenfilters

2008-03-31 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-522:
-

Attachment: SOLR-522-analysis.jsp.patch

Added if block to analysis.jsp which converts the Payload's byte stream 
directly to a String for display.  This might not suit the use case of all 
payloads so this may need to be revisited as those emerge.

 analysis.jsp doesn't show payloads created/modified by tokenizers and 
 tokenfilters
 --

 Key: SOLR-522
 URL: https://issues.apache.org/jira/browse/SOLR-522
 Project: Solr
  Issue Type: Improvement
  Components: web gui
Reporter: Tricia Williams
Priority: Trivial
 Attachments: SOLR-522-analysis.jsp.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 Add payload content to the vebose output of the analysis.jsp page for 
 debugging purposes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation

2008-03-05 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-386:
-

Attachment: SOLR-386-SolrHighlighter.patch

OK.  So I think I fixed the whitespace problem.

Thanks for explaining the problem with interfaces -- that makes sense now.  I 
assume that EventListener and RequestHandler are interfaces because they've 
been thought long and hard about and are not going to change?

My first try at the patch was just to include the public methods, which are the 
ones you (MIke Klaas) list:
 .initialize(Config)
 .isHighlightEnabled(SolrParams)
 .doHighlighting(...)
 .getHighlightFields(...) 

I discovered that the unit tests call the formatters and fragmenters directly 
so in the interface version I had made public get methods for these.  Now that 
we're using an abstract class I am able to just include these as is - so no 
changes to HighlighterTest this time.  But speaking of unit tests... 
Anecdotally I know that the SolrCore changes allow the highlighter to be 
configured (my custom highlighter).  I wrote HighlighterConfigTest as a unit 
test for this functionality.

I decided to leave the default implementation of 
isHighlightingEnabled(SolrParams), and getHighlightFields(...) in the abstract 
class because both methods deal with reading parameters.  I can't think of a 
use case of a highlighter that wouldn't use this or at worst ignore/override 
this method.  Is this a reasonable decision?

I wasn't sure what to do with the logger, so I left it in the 
AbstractSolrHighlighter.  This decision is based on the example of 
UpdateHandler. 

Hmm... I just realized that naming the abstraction of SolrHighlighter 
AbstractSolrHighlighter causes problems all over the code when you want your 
custom highlighter to plugin.  I think the path of least resistance is to call 
the abstract class SolrHighlighter and the existing implementation 
DefaultSolrHighlighter.

Thoughts?

 Add confuguration to specify SolrHighlighter implementation
 ---

 Key: SOLR-386
 URL: https://issues.apache.org/jira/browse/SOLR-386
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.3
Reporter: Eli Levine
Assignee: Mike Klaas
 Attachments: SOLR-386-SolrHighlighter.patch, 
 SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, 
 SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch


 It would be great if SolrCore allowed the highlighter class to be 
 configurable.  A good way would be to add a +class+ attribute to the 
 highlighting element in solrconfig.xml, similar to how the RequestHandler 
 instance is initialized in SorCore.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation

2008-02-20 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-386:
-

Attachment: SOLR-386-SolrHighlighter.patch

I'd really like some feedback on this patch.  I've just updated the patch based 
on changes that have been made to SolrHighlighter.java since revision 594314).

Eli, does this meet your needs?  This is all I need in SOLR-380 to plug in a 
custom highlighter.  I would really appreciate if this could be committed by 
someone so that I can stop worrying about keeping up with revisions.  It has 
been assigned to Mike Klass so his feedback in particular would be valuable to 
me.

Thanks,
Tricia

 Add confuguration to specify SolrHighlighter implementation
 ---

 Key: SOLR-386
 URL: https://issues.apache.org/jira/browse/SOLR-386
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.3
Reporter: Eli Levine
Assignee: Mike Klaas
 Attachments: SOLR-386-SolrHighlighter.patch, 
 SOLR-386-SolrHighlighter.patch, SOLR-386-SolrHighlighter.patch, 
 SOLR-386-SolrHighlighter.patch


 It would be great if SolrCore allowed the highlighter class to be 
 configurable.  A good way would be to add a +class+ attribute to the 
 highlighting element in solrconfig.xml, similar to how the RequestHandler 
 instance is initialized in SorCore.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-11-15 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-380:
-

Attachment: SOLR-380-XmlPayload.patch

Functionality is improved.  Tests are more complete.  I have included an 
example (much like the example included in solr) which demonstrates the changes 
needed to solrconfig.xml and schema.xml.  As well as some xml documents to 
start playing with. 

TODO: 
 * Still have to track down what happens when filters are applied to the 
Tokenizer.
 * Implement error handling for bad xml input. 

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor
 Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch, 
 SOLR-380-XmlPayload.patch


 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
 nbsp;nbsp;nbsp;nbsp;int name=pageid236/int
 nbsp;nbsp;/lst
 nbsp;nbsp;lst name=doc2
 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int
 nbsp;nbsp;/lst
 /lst
 lst name=hitpos
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;lst name=234
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int 
 name=pos14325/int
 nbsp;nbsp;nbsp;nbsp;/lst
 nbsp;nbsp;/lst
 nbsp;nbsp;...
 /lst

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-11-11 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-380:
-

Attachment: lucene-core-2.3-dev.jar
SOLR-380-XmlPayload.patch

This is a draft.  Note that Payload and Token classes in particular have 
changed since lucene-core-2.2.0.jar.  Users of this patch will need to replace 
lucene-core-2.2.0.jar with lucene-core-2.3-dev.jar.  I have created a test for 
XmlPayloadCharTokenizer but not attached it here because LuceneTestCase is not 
in SOLR's classpath in any form and it will break the build.

 The code works in theory and passes tests to that effect.  However, in 
practice when I deploy the war created from the dist ant target several 
problems result from adding documents (which seems to work using a 
![CDATA[...]] to contain the structured document and post.jar):

 * after adding a XmlPayload tokenized document, q=*:* causes 500 error: HTTP 
Status 500 - read past EOF java.io.IOException: read past EOF at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) 
at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) 
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at 
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:153) at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:408) at 
org.apache.lucene.index.MultiSegmentReader.document(MultiSegmentReader.java:129)
 at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at ...
 * use of the highlight fields produces the same error.
 * queries that should match a XmlPayload tokenized document do not ( //[EMAIL 
PROTECTED]'0'])-- though queries matching un-XmlPayload tokenized document 
continue to return the expected results.
 * trying to view the index using Luke (Lucene Index Toolbox, v 0.7.1 
(2007-06-20) ) returns: Unknown format version: -4
 * Solr Statistics confirm that all the documents have been added.


I will continue to finish this functionality but any suggestions or other input 
are welcomed.  You will see how the functionality is intended to be used in 
src/test/org/apache/solr/highlight/XmlPayloadTest.java

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor
 Attachments: lucene-core-2.3-dev.jar, SOLR-380-XmlPayload.patch


 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
 nbsp;nbsp;nbsp;nbsp;int name=pageid236/int
 nbsp;nbsp;/lst
 nbsp;nbsp;lst name=doc2
 nbsp;nbsp;nbsp;nbsp;int name=pageid19/int
 nbsp;nbsp;/lst
 /lst
 lst name=hitpos
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;lst name=234
 nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;int 
 name=pos14325/int
 nbsp;nbsp;nbsp;nbsp;/lst
 nbsp;nbsp;/lst
 nbsp;nbsp;...
 /lst

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation

2007-11-10 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-386:
-

Attachment: SOLR-386-SolrHighlighter.patch

Updated patch to work with recent changes made to SolrCore.  Should apply 
against a clean trunk again.  No further changes.

 Add confuguration to specify SolrHighlighter implementation
 ---

 Key: SOLR-386
 URL: https://issues.apache.org/jira/browse/SOLR-386
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.3
Reporter: Eli Levine
 Attachments: SOLR-386-SolrHighlighter.patch, 
 SOLR-386-SolrHighlighter.patch


 It would be great if SolrCore allowed the highlighter class to be 
 configurable.  A good way would be to add a +class+ attribute to the 
 highlighting element in solrconfig.xml, similar to how the RequestHandler 
 instance is initialized in SorCore.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-386) Add confuguration to specify SolrHighlighter implementation

2007-11-01 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-386:
-

Attachment: SOLR-386-SolrHighlighter.patch

This patch allows highlighting to be plugged in.

What I did:
 * Made SolrHighlighter an interface
 * The old SolrHighlighter became DefaultSolrHighlighter
 * Instantiate the highlighter in SolrCore based on what is in the 
solrconfig.xml

So to roll your own
 * Implement SolrHighlighter (ie org.apache.solr.highlight.MySolrHighlighter)
 * find highlighting in solrconfig.xml and modify to highlighting 
class=org.apache.solr.highlight.MySolrHighlighter

This patch builds on changes made to trunk by SOLR-281.  This patch also 
contains these changes (meaning you should apply this patch to the trunk).  I 
get the feeling that this is probably not the right way to build a dependent 
patch, but I don't know any better.  Let me know if I should change how I built 
this patch.

 Add confuguration to specify SolrHighlighter implementation
 ---

 Key: SOLR-386
 URL: https://issues.apache.org/jira/browse/SOLR-386
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.3
Reporter: Eli Levine
 Attachments: SOLR-386-SolrHighlighter.patch


 It would be great if SolrCore allowed the highlighter class to be 
 configurable.  A good way would be to add a +class+ attribute to the 
 highlighting element in solrconfig.xml, similar to how the RequestHandler 
 instance is initialized in SorCore.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-10-17 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748
 ] 

Tricia Williams commented on SOLR-380:
--

The discussion from 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 
gives one solution (which is more of a workaround in my opinion), but I don't 
think it is practical.  The number of pages of the monographs we index vary 
greatly (10s to 1000s of pages).  So while specifying each page_* 
(page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't 
think it is the cleanest solution because you have to infer page numbers from 
the highlighted samples.  Furthermore, in order to get the highlighted samples 
you need to know the values of the * in a dynamic field which sort of defeats 
the purpose of the dynamic field.  If you wanted to use the position numbers 
themselves (for example using positions and OCR information to create 
highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the 
page_* fields for highlighter snippets.  In the following example I have a 
dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=onversion=2.2q=employstart=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl=onhl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

lst name=highlighting
nbsp;lst name=News.EFP.186500
nbsp;nbsp;arr name=fulltext_1
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; was ememployed/em on the G. T. R. as fireman met 
his death in an accident on that road some yeara ago but three
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_4
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; ^-f 6r-Ke.w-¥eaf!flapos;: Mr.-BradV whb is 
ememployed/em in Windsor, was also at his borne for jSew Year
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_6
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; ememployed/em at the Walkerville brewery op to a 
short time ago,when illness ecessilater! his resignation. He
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_7
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; . have entered intoan agreement to ememploy/em the 
powerful tug Lntz to keep thgt;e Detroit river between
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;/lst
/lst

You will notice that only the pages with hits on them appear in the highlight 
section.  From this point it would take a little work to parse the /[EMAIL 
PROTECTED] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the 
difficulty I'm finding is the upfront part where we need to store the position 
to page mapping in a field while at the same time we need to analyze the full 
page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType 
looks like it is useful in controlling what the output of your defined field is 
(write()), and how it is sorted, but not how Fields with your FieldType will be 
indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the 
SOLR-247 problem, or continue hunting for a solution in the manner that I've 
been pursuing in this issue?

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor

 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 nbsp;nbsp;lst name=doc1
 nbsp;nbsp;nbsp;nbsp;int name=pageid234/int
 

[jira] Issue Comment Edited: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-10-17 Thread Tricia Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535748
 ] 

pgwillia edited comment on SOLR-380 at 10/17/07 3:13 PM:


The discussion from 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 
gives one solution (which is more of a workaround in my opinion), but I don't 
think it is practical.  The number of pages of the monographs we index vary 
greatly (10s to 1000s of pages).  So while specifying each page_* 
(page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't 
think it is the cleanest solution because you have to infer page numbers from 
the highlighted samples.  Furthermore, in order to get the highlighted samples 
you need to know the values of the * in a dynamic field which sort of defeats 
the purpose of the dynamic field.  If you wanted to use the position numbers 
themselves (for example using positions and OCR information to create 
highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the 
page_* fields for highlighter snippets.  In the following example I have a 
dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://tinyurl.com/3xdshk
(essentially shows the parameters and their values for this example -- pay 
attention to the hl.fl parameter)
gives the normal results, with the following at the end:

lst name=highlighting
nbsp;lst name=News.EFP.186500
nbsp;nbsp;arr name=fulltext_1
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; was ememployed/em on the G. T. R. as fireman met 
his death in an accident on that road some yeara ago but three
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_4
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; ^-f 6r-Ke.w-¥eaf!flapos;: Mr.-BradV whb is 
ememployed/em in Windsor, was also at his borne for jSew Year
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_6
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; ememployed/em at the Walkerville brewery op to a 
short time ago,when illness ecessilater! his resignation. He
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;nbsp;arr name=fulltext_7
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; . have entered intoan agreement to ememploy/em the 
powerful tug Lntz to keep thgt;e Detroit river between
nbsp;nbsp;nbsp;/str
nbsp;nbsp;/arr
nbsp;/lst
/lst

You will notice that only the pages with hits on them appear in the highlight 
section.  From this point it would take a little work to parse the /[EMAIL 
PROTECTED] to get the * from fulltext_* for each document match.

I agree that the highlighter is a good model of what we want to do.  But the 
difficulty I'm finding is the upfront part where we need to store the position 
to page mapping in a field while at the same time we need to analyze the full 
page text into another field for searching.  

I don't think defining a FieldType will allow us to do this.  The FieldType 
looks like it is useful in controlling what the output of your defined field is 
(write()), and how it is sorted, but not how Fields with your FieldType will be 
indexed or queried.

Would someone more familiar with the innards of Solr recommend I pursue the 
SOLR-247 problem, or continue hunting for a solution in the manner that I've 
been pursuing in this issue?

  was (Author: pgwillia):
The discussion from 
http://www.nabble.com/Structured-Lucene-documents-tf4234661.html#a12048390 
gives one solution (which is more of a workaround in my opinion), but I don't 
think it is practical.  The number of pages of the monographs we index vary 
greatly (10s to 1000s of pages).  So while specifying each page_* 
(page_1,page_2,page_3,...,page_N) as a field to highlight will work, I don't 
think it is the cleanest solution because you have to infer page numbers from 
the highlighted samples.  Furthermore, in order to get the highlighted samples 
you need to know the values of the * in a dynamic field which sort of defeats 
the purpose of the dynamic field.  If you wanted to use the position numbers 
themselves (for example using positions and OCR information to create 
highlighting on an original image), they are not available in the results.

In answer to your question Peter, one must enable highlighting and list all the 
page_* fields for highlighter snippets.  In the following example I have a 
dynamic field fulltext_*, copyfield fulltext, and defaultSearchField=fulltext:
http://localhost:8080/solr/select?indent=onversion=2.2q=employstart=0rows=10fl=*%2Cscoreqt=standardwt=standardexplainOther=hl=onhl.fl=fulltext_1%2Cfulltext_2%2Cfulltext_3%2Cfulltext_4%2Cfulltext_5%2Cfulltext_6%2Cfulltext_7%2Cfulltext_8%2Cfulltext_9
gives the normal results, with the following at the end:

lst name=highlighting
nbsp;lst name=News.EFP.186500
nbsp;nbsp;arr name=fulltext_1
nbsp;nbsp;nbsp;str
nbsp;nbsp;nbsp;nbsp; was 

[jira] Updated: (SOLR-380) There's no way to convert search results into page-level hits of a structured document.

2007-10-15 Thread Tricia Williams (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tricia Williams updated SOLR-380:
-

Description: 
Paged-Text FieldType for Solr

A chance to dig into the guts of Solr. The problem: If we index a monograph in 
Solr, there's no way to convert search results into page-level hits. The 
solution: have a paged-text fieldtype which keeps track of page divisions as 
it indexes, and reports page-level hits in the search results.

The input would contain page milestones: page id=234/. As Solr processed 
the tokens (using its standard tokenizers and filters), it would concurrently 
build a structural map of the item, indicating which term position marked the 
beginning of which page: page id=234 firstterm=14324/. This map would be 
stored in an unindexed field in some efficient format.

At search time, Solr would retrieve term positions for all hits that are 
returned in the current request, and use the stored map to determine page ids 
for each term position. The results would imitate the results for highlighting, 
something like:

lst name=pages
lst name=doc1
int name=pageid234/int
int name=pageid236/int
/lst
lst name=doc2
int name=pageid19/int
/lst
/lst
lst name=hitpos
lst name=doc1
lst name=234
int name=pos14325/int
/lst
/lst
...
/lst

  was:
Paged-Text FieldType for Solr
 
 A chance to dig into the guts of Solr. The problem: If we index a
 monograph in Solr, there's no way to convert search results into
 page-level hits. The solution: have a paged-text fieldtype which keeps
 track of page divisions as it indexes, and reports page-level hits in the
 search results.
 
 The input would contain page milestones: page id=234/. As Solr
 processed the tokens (using its standard tokenizers and filters), it would
 concurrently build a structural map of the item, indicating which term
 position marked the beginning of which page: page id=234
 firstterm=14324/. This map would be stored in an unindexed field in
 some efficient format.
 
 At search time, Solr would retrieve term positions for all hits that are
 returned in the current request, and use the stored map to determine page
 ids for each term position. The results would imitate the results for
 highlighting, something like:
 
 lst name=pages
 lst name=doc1
 int name=pageid234/int
 int name=pageid236/int
 /lst
 lst name=doc2
 int name=pageid19/int
 /lst
 /lst
 lst name=hitpos
 lst name=doc1
 lst name=234
 int name=pos14325/int
 /lst
 /lst
 ...
 /lst

Summary: There's no way to convert search results into page-level hits 
of a structured document.  (was: The problem: If we index a monograph in 
Solr, there's no way to convert search results into page-level hits. The 
solution: have a paged-text fieldtype which keeps track of page divisions as 
it indexes, and reports page-level hits in the search results.)

 There's no way to convert search results into page-level hits of a 
 structured document.
 -

 Key: SOLR-380
 URL: https://issues.apache.org/jira/browse/SOLR-380
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Tricia Williams
Priority: Minor

 Paged-Text FieldType for Solr
 A chance to dig into the guts of Solr. The problem: If we index a monograph 
 in Solr, there's no way to convert search results into page-level hits. The 
 solution: have a paged-text fieldtype which keeps track of page divisions 
 as it indexes, and reports page-level hits in the search results.
 The input would contain page milestones: page id=234/. As Solr processed 
 the tokens (using its standard tokenizers and filters), it would concurrently 
 build a structural map of the item, indicating which term position marked the 
 beginning of which page: page id=234 firstterm=14324/. This map would 
 be stored in an unindexed field in some efficient format.
 At search time, Solr would retrieve term positions for all hits that are 
 returned in the current request, and use the stored map to determine page ids 
 for each term position. The results would imitate the results for 
 highlighting, something like:
 lst name=pages
 lst name=doc1
 int name=pageid234/int
 int name=pageid236/int
 /lst
 lst name=doc2
 int name=pageid19/int
 /lst
 /lst
 lst name=hitpos
 lst name=doc1
 lst name=234
 int name=pos14325/int
 /lst
 /lst