[jira] Updated: (SOLR-1871) Function Query "map" variant that allows "target" to be an arbitrary ValueSource

2010-04-16 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1871:
---

Attachment: SOLR-1871.patch

Revised patch, with two changes:
 * Now we modify the behavior of "map", rather than introducing a new "mapf" 
function.
 * Now not only the "target" but also the "defaultValue" can be a ValueSource.

> Function Query "map" variant that allows "target" to be an arbitrary 
> ValueSource
> 
>
> Key: SOLR-1871
> URL: https://issues.apache.org/jira/browse/SOLR-1871
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1871.patch, SOLR-1871.patch, SOLR-1871.patch
>
>
> Currently, as documented at http://wiki.apache.org/solr/FunctionQuery#map, 
> the "target" of a map must be a floating point constant. I propose that you 
> should have at least the option of doing a map where the target is an 
> arbitrary ValueSource.
> The particular use case that inspired this is that I want to be able to 
> control how missing date fields affected boosting. In particular, I want to 
> be able to use something like this in my function queries:
> {code}
> map(mydatefield,0,0,ms(NOW))
> {code}
> But this might have other uses.
> I'll attach an initial implementation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-1871) Function Query "map" variant that allows "target" to be an arbitrary ValueSource

2010-04-08 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855133#action_12855133
 ] 

Chris Harris commented on SOLR-1871:


bq. Yep, makes sense. Some of the older FunctionQueries predate the ability of 
the parser to generalize more (one had to know if one was parsing a constant or 
a function). It might make sense to generalize the map function rather than 
have two functions?

Yes, having a single map sounds less confusing from a user perspective.

The reason I started by forking map into map and mapf was that I was worried 
about performance; I assumed (apparently incorrectly?) that hard-coding certain 
things as float constants had been done as a performance optimization. Are you 
implying that the approach taken by this fmap code is probably fast enough to 
be the standard implementation of map?

And what if it were taken to the extreme of making all arguments for all 
function queries ValueSource compatible? (That is, what if all functions used 
fp.parseValueSource() for all their arguments, and parsed none of them with 
fp.parseFloat()?) Does _that_ start to sound inefficient?


> Function Query "map" variant that allows "target" to be an arbitrary 
> ValueSource
> 
>
> Key: SOLR-1871
> URL: https://issues.apache.org/jira/browse/SOLR-1871
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1871.patch, SOLR-1871.patch
>
>
> Currently, as documented at http://wiki.apache.org/solr/FunctionQuery#map, 
> the "target" of a map must be a floating point constant. I propose that you 
> should have at least the option of doing a map where the target is an 
> arbitrary ValueSource.
> The particular use case that inspired this is that I want to be able to 
> control how missing date fields affected boosting. In particular, I want to 
> be able to use something like this in my function queries:
> {code}
> map(mydatefield,0,0,ms(NOW))
> {code}
> But this might have other uses.
> I'll attach an initial implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1871) Function Query "map" variant that allows "target" to be an arbitrary ValueSource

2010-04-08 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1871:
---

Attachment: SOLR-1871.patch

Clean up some sloppiness in RangeMapfFloatFunction, where target was still 
being treated as if it were a value type.

> Function Query "map" variant that allows "target" to be an arbitrary 
> ValueSource
> 
>
> Key: SOLR-1871
> URL: https://issues.apache.org/jira/browse/SOLR-1871
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1871.patch, SOLR-1871.patch
>
>
> Currently, as documented at http://wiki.apache.org/solr/FunctionQuery#map, 
> the "target" of a map must be a floating point constant. I propose that you 
> should have at least the option of doing a map where the target is an 
> arbitrary ValueSource.
> The particular use case that inspired this is that I want to be able to 
> control how missing date fields affected boosting. In particular, I want to 
> be able to use something like this in my function queries:
> {code}
> map(mydatefield,0,0,ms(NOW))
> {code}
> But this might have other uses.
> I'll attach an initial implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1871) Function Query "map" variant that allows "target" to be an arbitrary ValueSource

2010-04-08 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1871:
---

Attachment: SOLR-1871.patch

A first stab at this. In this implementation, we use copy and paste and create 
a wholly separate function, "mapf" ("f" because this map is more 
function-oriented).

mapf is not as efficient as normal map when the target is a constant. If the 
target is a constant, you now have something like a LongConstValueSource where 
you used to have just a float. I haven't tried to measure the performance 
difference.

RangeMapfFloatFunction.hashCode() may be messed up. It's based on 
RangeMapFloatFunction.hashCode(), but I threw this patch together without 
stopping to properly understand that. method first.

> Function Query "map" variant that allows "target" to be an arbitrary 
> ValueSource
> 
>
> Key: SOLR-1871
> URL: https://issues.apache.org/jira/browse/SOLR-1871
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1871.patch
>
>
> Currently, as documented at http://wiki.apache.org/solr/FunctionQuery#map, 
> the "target" of a map must be a floating point constant. I propose that you 
> should have at least the option of doing a map where the target is an 
> arbitrary ValueSource.
> The particular use case that inspired this is that I want to be able to 
> control how missing date fields affected boosting. In particular, I want to 
> be able to use something like this in my function queries:
> {code}
> map(mydatefield,0,0,ms(NOW))
> {code}
> But this might have other uses.
> I'll attach an initial implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1871) Function Query "map" variant that allows "target" to be an arbitrary ValueSource

2010-04-08 Thread Chris Harris (JIRA)
Function Query "map" variant that allows "target" to be an arbitrary ValueSource


 Key: SOLR-1871
 URL: https://issues.apache.org/jira/browse/SOLR-1871
 Project: Solr
  Issue Type: Improvement
  Components: search
Affects Versions: 1.4
Reporter: Chris Harris


Currently, as documented at http://wiki.apache.org/solr/FunctionQuery#map, the 
"target" of a map must be a floating point constant. I propose that you should 
have at least the option of doing a map where the target is an arbitrary 
ValueSource.

The particular use case that inspired this is that I want to be able to control 
how missing date fields affected boosting. In particular, I want to be able to 
use something like this in my function queries:

{code}
map(mydatefield,0,0,ms(NOW))
{code}

But this might have other uses.

I'll attach an initial implementation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1856:
---

Description: 
I propose that ExtractingRequestHandler / SolrCell literals should take 
precedence over Tika-parsed metadata in all situations, including where 
multiValued="true". (Compare SOLR-1633?)

My personal motivation is that I have several fields (e.g. "title", "date") 
where my own metadata is much superior to what Tika offers, and I want to throw 
those Tika values away. (I actually wouldn't mind throwing away _all_ 
Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential 
approach to this, but the fix here might be simpler.

I'll attach a patch shortly.

  was:
I propose that ExtractingRequestHandler / SolrCell literals should take 
precedence over Tika-parsed metadata in all situations, including where 
multiValued="false". (Compare SOLR-1633.)

My personal motivation is that I have several fields (e.g. "title", "date") 
where my own metadata is much superior to what Tika offers, and I want to throw 
those Tika values away. (I actually wouldn't mind throwing away _all_ 
Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential 
approach to this, but the fix here might be simpler.

I'll attach a patch shortly.


> In Solr Cell, literals should override Tika-parsed values
> -
>
> Key: SOLR-1856
> URL: https://issues.apache.org/jira/browse/SOLR-1856
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1856.patch
>
>
> I propose that ExtractingRequestHandler / SolrCell literals should take 
> precedence over Tika-parsed metadata in all situations, including where 
> multiValued="true". (Compare SOLR-1633?)
> My personal motivation is that I have several fields (e.g. "title", "date") 
> where my own metadata is much superior to what Tika offers, and I want to 
> throw those Tika values away. (I actually wouldn't mind throwing away _all_ 
> Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential 
> approach to this, but the fix here might be simpler.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1633) Solr Cell should be smarter about literal and multiValued="false"

2010-03-30 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851667#action_12851667
 ] 

Chris Harris commented on SOLR-1633:


bq. It seems like a possible improvement here would be for SolrCell to ignore 
the value from Tika if it already has one that was explicitly provided (as 
opposed to the current behavior of letting hte add fail because of multiple 
values in a single valued field).

I've implemented this, or at least something pretty similar, at SOLR-1856.

> Solr Cell should be smarter about literal and multiValued="false"
> -
>
> Key: SOLR-1633
> URL: https://issues.apache.org/jira/browse/SOLR-1633
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Reporter: Hoss Man
>
> As noted on solr-user, SolrCell has less then ideal behavior when "foo" is a 
> single value field, and literal.foo=bar is specified in the request, but Tika 
> also produces a value for the "foo" field from the document.  It seems like a 
> possible improvement here would be for SolrCell to ignore the value from Tika 
> if it already has one that was explicitly provided (as opposed to the current 
> behavior of letting hte add fail because of multiple values in a single 
> valued field).
> It seems pretty clear that in cases like this, the users intention is to have 
> their one literal field used as the value.
> http://old.nabble.com/Re%3A-WELCOME-to-solr-user%40lucene.apache.org-to26650071.html#a26650071

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1856:
---

Attachment: SOLR-1856.patch

Initial patch. Notes:
 * We allow literal values to override all other Tika/SolrCell stuff, including 
1) fields in the Tika metadata object, 2) the Tika content field, and 3) any 
"captured content" fields
 * Currently literalValuesOverrideOtherValues is always true. This could be 
made a config option, but my intuition so far is that it's not worth the 
complication.
 * Includes an initial unit test
 * Interestingly, all the old (and unmodified) unit tests still pass.


> In Solr Cell, literals should override Tika-parsed values
> -
>
> Key: SOLR-1856
> URL: https://issues.apache.org/jira/browse/SOLR-1856
> Project: Solr
>  Issue Type: Improvement
>  Components: contrib - Solr Cell (Tika extraction)
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1856.patch
>
>
> I propose that ExtractingRequestHandler / SolrCell literals should take 
> precedence over Tika-parsed metadata in all situations, including where 
> multiValued="false". (Compare SOLR-1633.)
> My personal motivation is that I have several fields (e.g. "title", "date") 
> where my own metadata is much superior to what Tika offers, and I want to 
> throw those Tika values away. (I actually wouldn't mind throwing away _all_ 
> Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential 
> approach to this, but the fix here might be simpler.
> I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1856) In Solr Cell, literals should override Tika-parsed values

2010-03-30 Thread Chris Harris (JIRA)
In Solr Cell, literals should override Tika-parsed values
-

 Key: SOLR-1856
 URL: https://issues.apache.org/jira/browse/SOLR-1856
 Project: Solr
  Issue Type: Improvement
  Components: contrib - Solr Cell (Tika extraction)
Affects Versions: 1.4
Reporter: Chris Harris


I propose that ExtractingRequestHandler / SolrCell literals should take 
precedence over Tika-parsed metadata in all situations, including where 
multiValued="false". (Compare SOLR-1633.)

My personal motivation is that I have several fields (e.g. "title", "date") 
where my own metadata is much superior to what Tika offers, and I want to throw 
those Tika values away. (I actually wouldn't mind throwing away _all_ 
Tika-parsed values, but let's set that aside.) SOLR-1634 is one potential 
approach to this, but the fix here might be simpler.

I'll attach a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

2010-02-22 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12837035#action_12837035
 ] 

Chris Harris commented on SOLR-380:
---

This is an interesting patch. One current limitation seems to be that proximity 
search queries (PhraseQueries and SpanQueries) may result in false positives. 
For example, if I query

bq. "audit trail"~10

then I think I'd expect Solr to return only the page #s where audit and trail 
are near one another. (Yes, what I just said leaves some wiggle room for 
implementation details.) The current code, in contrast, looks like it will 
report all the pages where "audit" and "trail" occur, regardless of proximity 
to the other term.

Has anyone thought about how to add proximity awareness?

> There's no way to convert search results into page-level hits of a 
> "structured document".
> -
>
> Key: SOLR-380
> URL: https://issues.apache.org/jira/browse/SOLR-380
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Tricia Williams
>Priority: Minor
> Fix For: 1.5
>
> Attachments: SOLR-380-XmlPayload.patch, SOLR-380-XmlPayload.patch, 
> xmlpayload-example.zip, xmlpayload-src.jar, xmlpayload.jar
>
>
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph 
> in Solr, there's no way to convert search results into page-level hits. The 
> solution: have a "paged-text" fieldtype which keeps track of page divisions 
> as it indexes, and reports page-level hits in the search results.
> The input would contain page milestones: . As Solr processed 
> the tokens (using its standard tokenizers and filters), it would concurrently 
> build a structural map of the item, indicating which term position marked the 
> beginning of which page: . This map would 
> be stored in an unindexed field in some efficient format.
> At search time, Solr would retrieve term positions for all hits that are 
> returned in the current request, and use the stored map to determine page ids 
> for each term position. The results would imitate the results for 
> highlighting, something like:
> 
>   
> 234
> 236
>   
>   
> 19
>   
> 
> 
>   
> 
>    name="pos">14325
> 
>   
>   ...
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1624) Highlighter bug with MultiValued field + TermPositions optimization

2009-12-04 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1624:
---

Attachment: SOLR-1624.patch

> Highlighter bug with MultiValued field + TermPositions optimization
> ---
>
> Key: SOLR-1624
> URL: https://issues.apache.org/jira/browse/SOLR-1624
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Chris Harris
> Attachments: SOLR-1624.patch
>
>
> When TermPositions are stored, then 
> DefaultSolrHighlighter.doHighlighting(DocList docs, Query query, 
> SolrQueryRequest req, String[] defaultFields) currently initializes tstream 
> only for the first value of a multi-valued field. (Subsequent times through 
> the loop reinitialization is preempted by tots being non-null.) This means 
> that the 2nd/3rd/etc. values are not considered for highlighting purposes, 
> resulting in missed highlights.
> I'm attaching a patch with a test case to demonstrate the problem 
> (testTermVecMultiValuedHighlight2), as well as a proposed fix. All 
> highlighter tests pass with this applied. The patch should apply cleanly 
> against the latest trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1624) Highlighter bug with MultiValued field + TermPositions optimization

2009-12-04 Thread Chris Harris (JIRA)
Highlighter bug with MultiValued field + TermPositions optimization
---

 Key: SOLR-1624
 URL: https://issues.apache.org/jira/browse/SOLR-1624
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.4
Reporter: Chris Harris


When TermPositions are stored, then 
DefaultSolrHighlighter.doHighlighting(DocList docs, Query query, 
SolrQueryRequest req, String[] defaultFields) currently initializes tstream 
only for the first value of a multi-valued field. (Subsequent times through the 
loop reinitialization is preempted by tots being non-null.) This means that the 
2nd/3rd/etc. values are not considered for highlighting purposes, resulting in 
missed highlights.

I'm attaching a patch with a test case to demonstrate the problem 
(testTermVecMultiValuedHighlight2), as well as a proposed fix. All highlighter 
tests pass with this applied. The patch should apply cleanly against the latest 
trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1475) Java-based replication doesn't properly reserve its commit point during backups

2009-09-29 Thread Chris Harris (JIRA)
Java-based replication doesn't properly reserve its commit point during backups
---

 Key: SOLR-1475
 URL: https://issues.apache.org/jira/browse/SOLR-1475
 Project: Solr
  Issue Type: Bug
  Components: replication (java)
Affects Versions: 1.4
Reporter: Chris Harris


The issue title reflects Mark Miller's initial diagnosis of the problem.

Here are my symptoms:

This is regarding the backup feature of replication, as opposed to replication. 
Backups seem to work fine on toy indexes. When trying backups out on a copy of 
my production index (300GB-ish), though, I'm getting FileNotFoundExceptions. 
These cancel the backup, and delete the snapshot.mmdd* directory. It seems 
reproducible, in that every time I try to make a backup of my large index it 
will fail the same way.

This is Solr r815830. I'm not sure if this is something that would potentially 
be addressed by SOLR-1458? (That patch is from after r815830.)

For now I'm not using any event-based backup triggers; instead I'm manually 
hitting

http://master_host:port/solr/replication?command=backup

This successfully sets off a snapshot, as seen in a thread dump.  However, 
after a while the snapshot fails. I'll paste in a couple of stack traces below.

I haven't seen any other evidence that my index is corrupt; in particular, 
searching the index and Java-based replication seem to be working fine, and the 
Lucene CheckIndex tool did not report any problems with the index.



{code}
Sep 28, 2009 9:32:18 AM org.apache.solr.handler.SnapShooter createSnapshot
SEVERE: Exception while creating snapshot
java.io.FileNotFoundException: Source
'E:\tomcat\solrstuff\solr\filingcore\data\index\_y0w.fnm' does not
exist
   at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:637)
   at 
org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:587)
   at 
org.apache.solr.handler.SnapShooter.createSnapshot(SnapShooter.java:83)
   at org.apache.solr.handler.SnapShooter$1.run(SnapShooter.java:61)

Sep 28, 2009 10:39:43 AM org.apache.solr.handler.SnapShooter createSnapshot
SEVERE: Exception while creating snapshot
java.io.FileNotFoundException: Source
'E:\tomcat\solrstuff\solr\filingcore\data\index\segments_by' does not
exist
   at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:637)
   at 
org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:587)
   at 
org.apache.solr.handler.SnapShooter.createSnapshot(SnapShooter.java:83)
   at org.apache.solr.handler.SnapShooter$1.run(SnapShooter.java:61)


Sep 28, 2009 11:52:08 AM org.apache.solr.handler.SnapShooter createSnapshot
SEVERE: Exception while creating snapshot
java.io.FileNotFoundException: Source
'E:\tomcat\solrstuff\solr\filingcore\data\index\_yby.nrm' does not
exist
   at org.apache.commons.io.FileUtils.copyFile(FileUtils.java:637)
   at 
org.apache.commons.io.FileUtils.copyFileToDirectory(FileUtils.java:587)
   at 
org.apache.solr.handler.SnapShooter.createSnapshot(SnapShooter.java:83)
   at org.apache.solr.handler.SnapShooter$1.run(SnapShooter.java:61)
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2009-09-16 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756259#action_12756259
 ] 

Chris Harris commented on SOLR-284:
---

Grant and company: I just noticed that the example solrconfig.xml at the head 
of SVN trunk still uses map, not fmap. (In particular, there's "map.content", 
"map.a", and "map.div".) I assume this should be fixed for the 1.4 release. 
Interestingly, this doesn't seem to make any unit tests fail.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, 
> SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2009-09-16 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12756154#action_12756154
 ] 

Chris Harris commented on SOLR-284:
---

This caught me by surprise, so I'm noting it here in case it helps anyone else:

In SVN r815830 (September 16, 2009), Grant renamed the field name mapping 
argument "map" to "fmap". The reason was to make naming more consistent with 
the CSV handler. For more info on this see the following thread:

http://www.nabble.com/Fwd%3A-CSV-Update---Need-help-mapping-csv-field-to-schema%27s-ID-td25463942.html



> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, schema_update.patch, 
> SOLR-284-no-key-gen.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2009-07-01 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12726123#action_12726123
 ] 

Chris Harris commented on SOLR-284:
---

{quote}
bq. My only request is that, if you're changing how field mapping works and 
maybe removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I 
don't care about any of your parsed metadata.

Map unknown fields to an ignored fieldtype.
uprefix=ignored_
{quote}

That seems fine.

Tangentially, I wonder how fast Tika's metadata extraction is, compared to its 
main body text extraction. If the latter doesn't dwarf the former, there might 
be value in adding a "Solr, don't even ask Tika to calculate for metadata at 
all; just have it extract the body text" flag; this could potentially speed 
things up for people that don't need the metadata. Maybe it would make sense to 
benchmark things before adding such a flag, though. I also don't have a good 
sense of how many people will want to use the metadata feature vs how many 
don't.


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, 
> test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2009-06-29 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725237#action_12725237
 ] 

Chris Harris commented on SOLR-284:
---

bq. Apologies for not reviewing this sooner after it was committed - but this 
is the last/best chance to improve the interface before 1.4 is released (and 
this is very important new functionality).

My only request is that, if you're changing how field mapping works and maybe 
removing ext.ignore.und.fl, you make sure it stays easy to say, "Tika, I don't 
care about any of your parsed metadata. Please leave it out of my Solr index." 
In my current use case I already know all the metadata I want, and including 
the Tika-parsed fields would result in index bloat. (My temptation would be to 
make excluding Tika-parsed fields the default, though it sounds like other 
people have the opposite inclination.)


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284-no-key-gen.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, 
> test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-06-16 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720172#action_12720172
 ] 

Chris Harris commented on SOLR-1145:


bq. So I think it should be more like this - per core settings.

That works fine for me. I didn't actually think doing it per core would be that 
easy, which is part of why I started out with the process-wide approach.

Any ideas about if the TimeLoggingPrintStream is going to slow things down too 
much? I didn't do any kind of benchmarking with that.

> Patch to set IndexWriter.defaultInfoStream from solr.xml
> 
>
> Key: SOLR-1145
> URL: https://issues.apache.org/jira/browse/SOLR-1145
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Fix For: 1.4
>
> Attachments: SOLR-1145.patch, SOLR-1145.patch, SOLR-1145.patch
>
>
> Lucene IndexWriters use an infoStream to log detailed info about indexing 
> operations for debugging purpose. This patch is an extremely simple way to 
> allow logging this info to a file from within Solr: After applying the patch, 
> set the new "defaultInfoStreamFilePath" attribute of the solr element in 
> solr.xml to the path of the file where you'd like to save the logging 
> information.
> Note that, in a multi-core setup, all cores will end up logging to the same 
> infoStream log file. This may not be desired. (But it does justify putting 
> the setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-05-21 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1145:
---

Attachment: (was: SOLR-1145.patch)

> Patch to set IndexWriter.defaultInfoStream from solr.xml
> 
>
> Key: SOLR-1145
> URL: https://issues.apache.org/jira/browse/SOLR-1145
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Fix For: 1.4
>
> Attachments: SOLR-1145.patch, SOLR-1145.patch
>
>
> Lucene IndexWriters use an infoStream to log detailed info about indexing 
> operations for debugging purpose. This patch is an extremely simple way to 
> allow logging this info to a file from within Solr: After applying the patch, 
> set the new "defaultInfoStreamFilePath" attribute of the solr element in 
> solr.xml to the path of the file where you'd like to save the logging 
> information.
> Note that, in a multi-core setup, all cores will end up logging to the same 
> infoStream log file. This may not be desired. (But it does justify putting 
> the setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-05-21 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1145:
---

Attachment: SOLR-1145.patch

Oops. Infostream logging should not be enabled by default in example/multicore

> Patch to set IndexWriter.defaultInfoStream from solr.xml
> 
>
> Key: SOLR-1145
> URL: https://issues.apache.org/jira/browse/SOLR-1145
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Fix For: 1.4
>
> Attachments: SOLR-1145.patch, SOLR-1145.patch, SOLR-1145.patch
>
>
> Lucene IndexWriters use an infoStream to log detailed info about indexing 
> operations for debugging purpose. This patch is an extremely simple way to 
> allow logging this info to a file from within Solr: After applying the patch, 
> set the new "defaultInfoStreamFilePath" attribute of the solr element in 
> solr.xml to the path of the file where you'd like to save the logging 
> information.
> Note that, in a multi-core setup, all cores will end up logging to the same 
> infoStream log file. This may not be desired. (But it does justify putting 
> the setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-05-21 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1145:
---

Attachment: SOLR-1145.patch

I found it confusing not to have any timestamp data in the infostream log. This 
new version prints a timestamp along with each infoStream message. You can 
configure the timestamp format, though there's currently no way to disable 
timestamping.

> Patch to set IndexWriter.defaultInfoStream from solr.xml
> 
>
> Key: SOLR-1145
> URL: https://issues.apache.org/jira/browse/SOLR-1145
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Fix For: 1.4
>
> Attachments: SOLR-1145.patch, SOLR-1145.patch
>
>
> Lucene IndexWriters use an infoStream to log detailed info about indexing 
> operations for debugging purpose. This patch is an extremely simple way to 
> allow logging this info to a file from within Solr: After applying the patch, 
> set the new "defaultInfoStreamFilePath" attribute of the solr element in 
> solr.xml to the path of the file where you'd like to save the logging 
> information.
> Note that, in a multi-core setup, all cores will end up logging to the same 
> infoStream log file. This may not be desired. (But it does justify putting 
> the setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-05-04 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-1145:
---

Attachment: SOLR-1145.patch

> Patch to set IndexWriter.defaultInfoStream from solr.xml
> 
>
> Key: SOLR-1145
> URL: https://issues.apache.org/jira/browse/SOLR-1145
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Attachments: SOLR-1145.patch
>
>
> Lucene IndexWriters use an infoStream to log detailed info about indexing 
> operations for debugging purpose. This patch is an extremely simple way to 
> allow logging this info to a file from within Solr: After applying the patch, 
> set the new "defaultInfoStreamFilePath" attribute of the solr element in 
> solr.xml to the path of the file where you'd like to save the logging 
> information.
> Note that, in a multi-core setup, all cores will end up logging to the same 
> infoStream log file. This may not be desired. (But it does justify putting 
> the setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1145) Patch to set IndexWriter.defaultInfoStream from solr.xml

2009-05-04 Thread Chris Harris (JIRA)
Patch to set IndexWriter.defaultInfoStream from solr.xml


 Key: SOLR-1145
 URL: https://issues.apache.org/jira/browse/SOLR-1145
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Harris


Lucene IndexWriters use an infoStream to log detailed info about indexing 
operations for debugging purpose. This patch is an extremely simple way to 
allow logging this info to a file from within Solr: After applying the patch, 
set the new "defaultInfoStreamFilePath" attribute of the solr element in 
solr.xml to the path of the file where you'd like to save the logging 
information.

Note that, in a multi-core setup, all cores will end up logging to the same 
infoStream log file. This may not be desired. (But it does justify putting the 
setting in solr.xml rather than solrconfig.xml.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-744) Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available in Solr schema files

2009-03-13 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681962#action_12681962
 ] 

Chris Harris commented on SOLR-744:
---

Tom,

The Lucene half of this patch pair adds unit tests to 
src/test/org/apache/lucene/analysis/shingle/ShingleFilterTest.java. Do those 
tests pass when you run them on your custom lucene build, after applying 
LUCENE-1370? (cd to the top-level of lucene and then run "ant test 
-Dtestcase=ShingleFilterTest".) I didn't add any tests for the Solr half of the 
patch pair, but I also don't know how you would test it in a productive manner.

> Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available 
> in Solr schema files
> 
>
> Key: SOLR-744
> URL: https://issues.apache.org/jira/browse/SOLR-744
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Attachments: SOLR-744.patch
>
>
> See LUCENE-1370

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2009-01-12 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663150#action_12663150
 ] 

Chris Harris commented on SOLR-284:
---

bq. I could, however, see adding a flag to specify whether one wants "silent 
success" or not. I think the use case for content extraction is different than 
the normal XML message path. Often times, these files are quite large and the 
cost of sending them to the system is significant.

In my own use case of the handler, I imagine the fail-on-missing-key policy 
would be the more helpful policy. This is because I want to be in control of my 
own key, and if Solr fails as soon as I don't provide one, that's going to help 
me find the bug in my indexing code right away, whereas "silent success" will 
allow that bug to fester. I'm not sure there would be significant 
countervailing advantages to the other policy. It's true that transferring a 
large file when you're just going to get an error message wastes some time, but 
I feel like in debugging there's potential to waste a lot more time.

My first choice would be for fail-on-missing-key to be the default, followed by 
having an easy-to-set flag. In any case, though, it would be nice not to have 
to create a custom SolrContentHandler just to get this one sanity check.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-953) Small simplification for LuceneGapFragmenter.isNewFragment

2009-01-12 Thread Chris Harris (JIRA)
Small simplification for LuceneGapFragmenter.isNewFragment
--

 Key: SOLR-953
 URL: https://issues.apache.org/jira/browse/SOLR-953
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Affects Versions: 1.4
Reporter: Chris Harris
Priority: Minor
 Attachments: SOLR-953.patch

This little patch makes the code for LuceneGapFragmenter.isNewFragment(Token) 
slightly more intuitive.

The method currently features the line

{code}
fragOffsetAccum += token.endOffset() - fragOffsetAccum;
{code}

This can be simplified, though, to just

{code}
fragOffsetAccum = token.endOffset();
{code}

Maybe it's just me, but I find the latter expression's intent to be 
sufficiently clearer than the former to warrant committing such a change.

This patch makes this simplification. Also, if you do make this simplification, 
then it doesn't really make sense to think of fragOffsetAccum as an accumulator 
anymore, so in the patch we rename the variable to just fragOffset.

Tests from HighlighterTest.java pass with the patch applied.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-953) Small simplification for LuceneGapFragmenter.isNewFragment

2009-01-12 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-953:
--

Attachment: SOLR-953.patch

> Small simplification for LuceneGapFragmenter.isNewFragment
> --
>
> Key: SOLR-953
> URL: https://issues.apache.org/jira/browse/SOLR-953
> Project: Solr
>  Issue Type: Improvement
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Chris Harris
>Priority: Minor
> Attachments: SOLR-953.patch
>
>
> This little patch makes the code for LuceneGapFragmenter.isNewFragment(Token) 
> slightly more intuitive.
> The method currently features the line
> {code}
> fragOffsetAccum += token.endOffset() - fragOffsetAccum;
> {code}
> This can be simplified, though, to just
> {code}
> fragOffsetAccum = token.endOffset();
> {code}
> Maybe it's just me, but I find the latter expression's intent to be 
> sufficiently clearer than the former to warrant committing such a change.
> This patch makes this simplification. Also, if you do make this 
> simplification, then it doesn't really make sense to think of fragOffsetAccum 
> as an accumulator anymore, so in the patch we rename the variable to just 
> fragOffset.
> Tests from HighlighterTest.java pass with the patch applied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-952) TokenOrderingFilter class is defined in more than one java file

2009-01-09 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662555#action_12662555
 ] 

ryguasu edited comment on SOLR-952 at 1/9/09 3:39 PM:
---

I did a brief look through the svn logs:

 * In r422248, the class was first added. It was added to 
org/apache/solr/util/SolrPluginUtils.java. The purpose of this commit was: 
"order tokens by startOffset when highlighting"
 * In r510338 it was moved moved to org/apache/solr/util/HighlightingUtils.java.
 * In r639490 org/apache/solr/highlight/DefaultSolrHighlighter.java was created 
and the econd copy was added there; this revision did not, however, touch 
org/apache/solr/util/HighlightingUtils.java


  was (Author: ryguasu):
I did a brief look through the svn logs:

 * In r448, the class was first added. It was added to 
org/apache/solr/util/SolrPluginUtils.java. The purpose of this commit was: 
"order tokens by startOffset when highlighting"
 * In r510338 it was moved moved to org/apache/solr/util/HighlightingUtils.java.
 * In r639490 org/apache/solr/highlight/DefaultSolrHighlighter.java was created 
and the econd copy was added there; this revision did not, however, touch 
org/apache/solr/util/HighlightingUtils.java

  
> TokenOrderingFilter class is defined in more than one java file
> ---
>
> Key: SOLR-952
> URL: https://issues.apache.org/jira/browse/SOLR-952
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Chris Harris
>Priority: Minor
>
> The class TokenOrderingFilter is defined, with identical text, both in 
> org/apache/solr/highlight/DefaultSolrHighlighter.java and 
> org/apache/solr/util/HighlightingUtils.java. I assume this is not good, from 
> a code maintenance perspective.
> Verified this in Solr trunk r733155.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-952) TokenOrderingFilter class is defined in more than one java file

2009-01-09 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662555#action_12662555
 ] 

ryguasu edited comment on SOLR-952 at 1/9/09 3:37 PM:
---

I did a brief look through the svn logs:

 * In r448, the class was first added. It was added to 
org/apache/solr/util/SolrPluginUtils.java. The purpose of this commit was: 
"order tokens by startOffset when highlighting"
 * In r510338 it was moved moved to org/apache/solr/util/HighlightingUtils.java.
 * In r639490 org/apache/solr/highlight/DefaultSolrHighlighter.java was created 
and the econd copy was added there; this revision did not, however, touch 
org/apache/solr/util/HighlightingUtils.java


  was (Author: ryguasu):
I did a brief look through the svn logs:

 * The class started (I think) in org/apache/solr/util/SolrPluginUtils.java
 * In r510338 it was moved moved to org/apache/solr/util/HighlightingUtils.java.
 * In r639490 org/apache/solr/highlight/DefaultSolrHighlighter.java was created 
and the econd copy was added there; this revision did not, however, touch 
org/apache/solr/util/HighlightingUtils.java

  
> TokenOrderingFilter class is defined in more than one java file
> ---
>
> Key: SOLR-952
> URL: https://issues.apache.org/jira/browse/SOLR-952
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Chris Harris
>Priority: Minor
>
> The class TokenOrderingFilter is defined, with identical text, both in 
> org/apache/solr/highlight/DefaultSolrHighlighter.java and 
> org/apache/solr/util/HighlightingUtils.java. I assume this is not good, from 
> a code maintenance perspective.
> Verified this in Solr trunk r733155.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-952) TokenOrderingFilter class is defined in more than one java file

2009-01-09 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662555#action_12662555
 ] 

Chris Harris commented on SOLR-952:
---

I did a brief look through the svn logs:

 * The class started (I think) in org/apache/solr/util/SolrPluginUtils.java
 * In r510338 it was moved moved to org/apache/solr/util/HighlightingUtils.java.
 * In r639490 org/apache/solr/highlight/DefaultSolrHighlighter.java was created 
and the econd copy was added there; this revision did not, however, touch 
org/apache/solr/util/HighlightingUtils.java


> TokenOrderingFilter class is defined in more than one java file
> ---
>
> Key: SOLR-952
> URL: https://issues.apache.org/jira/browse/SOLR-952
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Affects Versions: 1.4
>Reporter: Chris Harris
>Priority: Minor
>
> The class TokenOrderingFilter is defined, with identical text, both in 
> org/apache/solr/highlight/DefaultSolrHighlighter.java and 
> org/apache/solr/util/HighlightingUtils.java. I assume this is not good, from 
> a code maintenance perspective.
> Verified this in Solr trunk r733155.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-952) TokenOrderingFilter class is defined in more than one java file

2009-01-09 Thread Chris Harris (JIRA)
TokenOrderingFilter class is defined in more than one java file
---

 Key: SOLR-952
 URL: https://issues.apache.org/jira/browse/SOLR-952
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.4
Reporter: Chris Harris
Priority: Minor


The class TokenOrderingFilter is defined, with identical text, both in 
org/apache/solr/highlight/DefaultSolrHighlighter.java and 
org/apache/solr/util/HighlightingUtils.java. I assume this is not good, from a 
code maintenance perspective.

Verified this in Solr trunk r733155.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-896) Solr Query Parser Plugin for Mark Miller's Qsol Parser

2009-01-07 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-896:
--

Attachment: SOLR-896.patch

Update common.classpath variable in contrib\qsol\build.xml to to be compatible 
with trunk r732229.

> Solr Query Parser Plugin for Mark Miller's Qsol Parser
> --
>
> Key: SOLR-896
> URL: https://issues.apache.org/jira/browse/SOLR-896
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Chris Harris
> Attachments: SOLR-896.patch, SOLR-896.patch
>
>
> An extremely basic plugin to get the Qsol query parser 
> (http://www.myhardshadow.com/qsol.php) working in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-896) Solr Query Parser Plugin for Mark Miller's Qsol Parser

2008-12-04 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653535#action_12653535
 ] 

ryguasu edited comment on SOLR-896 at 12/4/08 3:06 PM:


I don't know if this first stab will be useful to anyone else or not, but it 
might be slightly easier to get started with than writing your own. Limitations 
include:

* No ability to configure qsol (even though qsol is highly configurable) -- 
you're stuck with the defaults
* This doesn't alter qsol itself at all, so you don't get support for certain 
Solr goodies, like function queries

Usage:

* This patch creates /contrib/qsol.
* Download qsol from the qsol home page and put qsol jar into 
/contrib/qsol/lib
* cd /contrib/qsol
* Run ant (no args needed) to create the qsol Solr plugin 
(/contrib/qsol/build/apache-solr-qsol-1.4-dev.jar or some such)
* To deploy, copy both the qsol Solr plugin jar and qsol.jar to your solr lib 
directory. In the example jetty setup that comes with solr, that should be 
/example/solr/lib/. In a multicore setup, you can specify where the 
lib directory is in solr.xml.
* There are a few different ways to make qsol accessible from Solr now. One is 
to add  to your solrconfig.xml, and 
then to prepend "{!qsol}" to your queries URLs, e.g. "...?q={!qsol}term1 | 
term2". See http://wiki.apache.org/solr/SolrPlugins for more info.


  was (Author: ryguasu):
I don't know if this first stab will be useful to anyone else or not, but 
it might be slightly easier to get started with than writing your own. 
Limitations include:

* No ability to configure qsol (even though qsol is highly configurable) -- 
you're stuck with the defaults
* This doesn't alter qsol itself at all, so you don't get support for certain 
Solr goodies, like function queries

Usage:

* This patch creates /contrib/qsol.
* Download qsol from the qsol home page and put qsol jar into 
/contrib/qsol/lib
* cd /contrib/qsol
* Run ant (no args needed) to create the qsol Solr plugin 
(/contrib/qsol/build/apache-solr-qsol-1.4-dev.jar or some such)
* To deploy, copy both the qsol Solr plugin jar and qsol.jar to your solr lib 
directory. In the example jetty setup that comes with solr, that should be 
/example/solr/lib/. In a multicore setup, you can specify where the 
lib directory is in solr.xml.

  
> Solr Query Parser Plugin for Mark Miller's Qsol Parser
> --
>
> Key: SOLR-896
> URL: https://issues.apache.org/jira/browse/SOLR-896
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Chris Harris
> Attachments: SOLR-896.patch
>
>
> An extremely basic plugin to get the Qsol query parser 
> (http://www.myhardshadow.com/qsol.php) working in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-896) Solr Query Parser Plugin for Mark Miller's Qsol Parser

2008-12-04 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-896:
--

Attachment: SOLR-896.patch

I don't know if this first stab will be useful to anyone else or not, but it 
might be slightly easier to get started with than writing your own. Limitations 
include:

* No ability to configure qsol (even though qsol is highly configurable) -- 
you're stuck with the defaults
* This doesn't alter qsol itself at all, so you don't get support for certain 
Solr goodies, like function queries

Usage:

* This patch creates /contrib/qsol.
* Download qsol from the qsol home page and put qsol jar into 
/contrib/qsol/lib
* cd /contrib/qsol
* Run ant (no args needed) to create the qsol Solr plugin 
(/contrib/qsol/build/apache-solr-qsol-1.4-dev.jar or some such)
* To deploy, copy both the qsol Solr plugin jar and qsol.jar to your solr lib 
directory. In the example jetty setup that comes with solr, that should be 
/example/solr/lib/. In a multicore setup, you can specify where the 
lib directory is in solr.xml.


> Solr Query Parser Plugin for Mark Miller's Qsol Parser
> --
>
> Key: SOLR-896
> URL: https://issues.apache.org/jira/browse/SOLR-896
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Chris Harris
> Attachments: SOLR-896.patch
>
>
> An extremely basic plugin to get the Qsol query parser 
> (http://www.myhardshadow.com/qsol.php) working in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-896) Solr Query Parser Plugin for Mark Miller's Qsol Parser

2008-12-04 Thread Chris Harris (JIRA)
Solr Query Parser Plugin for Mark Miller's Qsol Parser
--

 Key: SOLR-896
 URL: https://issues.apache.org/jira/browse/SOLR-896
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Chris Harris


An extremely basic plugin to get the Qsol query parser 
(http://www.myhardshadow.com/qsol.php) working in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-12-03 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12652993#action_12652993
 ] 

Chris Harris commented on SOLR-284:
---

Currently this patch deploys the Tika libs to /trunk/example/solr/lib. I'm 
curious where the Tika handler's lib/ directory is supposed to go in a 
multicore deployment. I created my own multicore setup more or less like this:

* ant example
* Copy /trunk/example to /trunk/solr-1
* Copy /trunk/solr-1/multicore/* to /trunk/solr-1/solr.

(Solr-1 means "copy of Solr I plan to run on port 1.")

This seems to be the easiest way to set things up so that I can cd to 
/trunk/solr-1 and run start.jar to get multicore Solr running.

Or rather, that *would* get multicore Solr running, except that Solr gets a 
can't-find-the-Tika-classes exception. So I guess /trunk/solr-1/solr/lib is 
not where the lib directory goes for multicore deployment.

So I tried putting Tika libs instead in /trunk/solr-1/solr/core0/lib, and 
that loaded fine. That doesn't seem like the right place for the directory, 
though; it seems like each core shouldn't have to have its own separate copy of 
the Tika libs.

So where *do* the Tika libs go?


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-12-03 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: SOLR-284.patch

As I mentioned before, tests for these 

  solr.client.solrj.embedded.JettyWebappTest
  solr.client.solrj.embedded.LargeVolumeJettyTest
  solr.client.solrj.embedded.SolrExampleJettyTest
  solr.client.solrj.embedded.TestSpellCheckResponse

were failing, with Solr giving a classnotfoundexception for one of the 
extracting document loader (ie Solr Cell) classes.

This revision fixes this by removing all references to this Tika handler from 
/trunk/example/conf/solrconfig.xml and /trunk/example/conf/schema.xml. Note 
that these references still exist (and are still used for testing) in 
/trunk/contrib/extraction/src/test/resources/solr/conf.

There are probably other ways to make these tests pass, perhaps involving 
changing the setUp() methods for the above mentioned tests' java files. (For 
example, maybe you could fiddle with the path parameter passed to the 
WebAppContext constructor in JettyWebappTest.java? I don't really know anything 
about this embedded stuff.) I like the current approach, though, because it 
avoids further changes to code that's logically independent of this handler.


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-12-02 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: SOLR-284.patch

Changes since my previous upload:

* sync CHANGES.txt with trunk
* test cases for adding plain text data
* you aren't forced to map a field if you use the resource.name parameter


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, 
> test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-11-26 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: SOLR-284.patch

This should be the last change for today.

This change adds a resource.name parameter that you can pass to the handler. 
(I'm guessing you'll probably typically pass a filename, though Tika does use 
the more general term "resource name".) If you provide it, Tika can take 
advantage of it when applying its heuristics to determine the MIME type.

Affected files:

 * ExtractingParams.java
 * ExtractingDocumentLoader.java


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, 
> solr-word.pdf, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-538) CopyField maxLength property

2008-11-26 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-538:
--

Attachment: SOLR-538.patch

Same as prev SOLR-538.patch (2008-11-21 02:21 PM), except with some unneeded 
carriage return characters removed. (This may be overly cautious, but I don't 
want those to cause problems for anyone.)

> CopyField maxLength property
> 
>
> Key: SOLR-538
> URL: https://issues.apache.org/jira/browse/SOLR-538
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Nicolas Dessaigne
>Priority: Minor
> Attachments: CopyFieldMaxLength.patch, CopyFieldMaxLength.patch, 
> SOLR-538-for-1.3.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, 
> SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch
>
>
> As discussed shortly on the mailing list (http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg09807.html), the objective of this task is to add a maxLength 
> property to the CopyField "command". This property simply limits the number 
> of characters that are copied.
> This is particularly useful to avoid very slow highlighting when the index 
> contains big documents.
> Example :
> 
> This approach has also the advantage of limiting the index size for large 
> documents (the original text field does not need to be stored and to have 
> term vectors). However, the index is bigger for small documents...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-11-26 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: SOLR-284.patch

Small change to the 2008-11-26 09:18 AM SOLR-284.patch (my previous one), this 
time adding an "example" ant target to contrib/javascript/build.xml. (Without 
this top-level "ant example" was failing.)

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, 
> test-files.zip, test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-11-26 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651073#action_12651073
 ] 

Chris Harris commented on SOLR-284:
---

On r720403B, I'm noticing that before I apply this patch tests pass, whereas 
after I apply this patch the following tests fail:

solr.client.solrj.embedded.JettyWebappTest
solr.client.solrj.embedded.LargeVolumeJettyTest
solr.client.solrj.embedded.SolrExampleJettyTest
solr.client.solrj.response.TestSpellCheckResponse

In each case Solr outputs this exception: "On Solr startup: SEVERE: 
org.apache.solr.common.SolrException: Error loading class 
'org.apache.solr.handler.ExtractingRequestHandler'"

I'm not sure the best way to get the ExtractingRequestHandler into the 
classpath here.

Sort of related, I've noticed that ExtractingRequestHandler doesn't currently 
get built into the .war file when you run "ant example", in contrast to 
DataImportHandler, which *does* get put into the .war by means of this target 
in its build.xml (among other targets):

  






  

Should ExtractingRequestHandler's build.xml perhaps have an analogous "dist" 
target, along these lines:

  



  


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-11-26 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: SOLR-284.patch

The 2008-11-15 01:12 PM SOLR-284.patch wasn't applying cleanly to trunk r720403 
for me. (One of the hunks for 
client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java 
wouldn't apply.) With this very small update, it does apply cleanly.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-11-24 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650363#action_12650363
 ] 

Chris Harris commented on SOLR-284:
---

The 2008-11-15 01:12 PM version of SOLR-284.patch contains modifications to 
client/java/solrj/src/org/apache/solr/client/solrj/util/ClientUtils.java 
related to date handling. That's not intentional, is it?

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, solr-word.pdf, source.zip, test-files.zip, test-files.zip, 
> test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-538) CopyField maxLength property

2008-11-21 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-538:
--

Attachment: SOLR-538.patch

Small change to bring Lars' 2008-11-08 version of SOLR-538.patch in sync with 
trunk r719187

> CopyField maxLength property
> 
>
> Key: SOLR-538
> URL: https://issues.apache.org/jira/browse/SOLR-538
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Nicolas Dessaigne
>Priority: Minor
> Attachments: CopyFieldMaxLength.patch, CopyFieldMaxLength.patch, 
> SOLR-538-for-1.3.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, 
> SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch
>
>
> As discussed shortly on the mailing list (http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg09807.html), the objective of this task is to add a maxLength 
> property to the CopyField "command". This property simply limits the number 
> of characters that are copied.
> This is particularly useful to avoid very slow highlighting when the index 
> contains big documents.
> Example :
> 
> This approach has also the advantage of limiting the index size for large 
> documents (the original text field does not need to be stored and to have 
> term vectors). However, the index is bigger for small documents...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-11-20 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649551#action_12649551
 ] 

Chris Harris commented on SOLR-284:
---

A few comment on the ExtractingDocumentLoader:

* I think I like where this is going.

* Currently the default is ext.ignore.und.fl (IGNORE_UNDECLARED_FIELDS) == 
false, which means that if Tika returns a metadata field and you haven't made 
an explicit mapping from the Tika fieldname to your Solr fieldname, then Solr 
will throw an exception and your document add will fail. This doesn't seem 
sound very robust for a production environment, unless Tika will only ever use 
a finite list of metadata field names. (That doesn't sound plausible, though I 
admit I haven't looked into it.) Even in that case, I think I'd rather not have 
to set up a mapping for every possible field name in order to get started with 
this handler. Would true perhaps be a better default?

* ext.capture / CAPTURE_FIELDS: Do you have a use case in mind for this 
feature, Grant? The example in the patch is of routing text from  tags to 
one Solr field while routing text from other tags to a different Solr field. 
I'm kind of curious when this would be useful, especially keeping in mind that, 
in general, Tika source documents are not HTML, and so when  tags are 
generated they're as much artifacts of Tika as reflecting anything in the 
underlying document. (You could maybe ask a similar question about ext.inx.attr 
/ INDEX_ATTRIBUTES.)


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-11-20 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649542#action_12649542
 ] 

Chris Harris commented on SOLR-284:
---

Is the latest patch supposed to contain a file "solr-word.pdf"? I don't see 
one, and my "ant test" is failing along these lines:

org.apache.solr.common.SolrException: java.io.FileNotFoundException: 
solr-word.pdf (The system cannot find the file specified)
at 
org.apache.solr.handler.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:160)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313)
at 
org.apache.solr.util.TestHarness.queryAndResponse(TestHarness.java:331)
at 
org.apache.solr.handler.ExtractingRequestHandlerTest.loadLocal(ExtractingRequestHandlerTest.java:97)
at 
org.apache.solr.handler.ExtractingRequestHandlerTest.testExtraction(ExtractingRequestHandlerTest.java:27)


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, SOLR-284.patch, 
> SOLR-284.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-11-14 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12647632#action_12647632
 ] 

Chris Harris commented on SOLR-284:
---

Grant,

I don't really care if you take over the old wiki page's name or start a new 
one; maybe it depends on if the updated handler is still going to have a 
similar name or be called something else. I do think, though, that it might be 
handy nice to have *some* wiki page (and maybe some JIRA issue) to maintain the 
older patch on a temporary basis.

Thanks,
Chris

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
>Assignee: Grant Ingersoll
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-538) CopyField maxLength property

2008-11-07 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-538:
--

Attachment: SOLR-538.patch

Syncing w/ trunk r712067

> CopyField maxLength property
> 
>
> Key: SOLR-538
> URL: https://issues.apache.org/jira/browse/SOLR-538
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Nicolas Dessaigne
>Priority: Minor
> Attachments: CopyFieldMaxLength.patch, CopyFieldMaxLength.patch, 
> SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch, SOLR-538.patch
>
>
> As discussed shortly on the mailing list (http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg09807.html), the objective of this task is to add a maxLength 
> property to the CopyField "command". This property simply limits the number 
> of characters that are copied.
> This is particularly useful to avoid very slow highlighting when the index 
> contains big documents.
> Example :
> 
> This approach has also the advantage of limiting the index size for large 
> documents (the original text field does not need to be stored and to have 
> term vectors). However, the index is bigger for small documents...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-805) DisMax queries are not being cached in QueryResultCache

2008-11-07 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645876#action_12645876
 ] 

Chris Harris commented on SOLR-805:
---

I was trying to figure out exactly which revision of the Lucene trunk Koji's 
commit here was built against, since it doesn't seem to be stated here in JIRA 
or in the SVN log. If anyone else else is curious, the answer is Lucene r707499 
-- at least according to the MANIFEST.MF file in lib/lucene-core-2.9-dev.jar.


> DisMax queries are not being cached in QueryResultCache
> ---
>
> Key: SOLR-805
> URL: https://issues.apache.org/jira/browse/SOLR-805
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: 1.3
> Environment: Using Sun JDK 1.5 and Solr 1.3.0 release on Windows XP
>Reporter: Todd Feak
>Priority: Critical
> Fix For: 1.4
>
>
> I have a DisMax Search Handler set up in my solrconfig.xml to weight results 
> based on which field a hit was found in. Results seem to be coming back fine, 
> but the exact same query issued twice will *not* result in a cache hit.
> I have run far enough in the debugger to determine that the hashCode for the 
> BooleanQuery object is returning a different value each time for the same 
> query. This leads me to believe there is some random factor involved in it's 
> calculation, such as a default Object hashCode() implementation somewhere in 
> the chain. Non DisMax queries seem to be caching just fine.
> Where I see this behavior exhibited is on line 47 of the QueryResultKey 
> constructor. I have not dug in far enough to determine exactly where the 
> hashCode is being incorrectly calculated. I will try and dig in further 
> tomorrow, but wanted to get some attention on the bug. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-195) Wildcard/prefix queries not highlighted

2008-09-08 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629248#action_12629248
 ] 

Chris Harris commented on SOLR-195:
---

I just rediscovered this bug for myself, and was about to re-report it, but 
then I found this JIRA issue. Even though it's a bit redundant, I'm going to 
paste my bug report here, since A) I think it's a good summary of the problem 
B) it has a remark for when usePhraseHighlighter=true, and C) it includes a few 
test cases.



Highlighting with wildcards (whether * is in the middle of a term or at
the end) doesn't work right now for the standard request handler.
The high-level view of the problem is as follows:

1. Extracting terms is central to highlighting
2. Wildcard queries get parsed into ConstantScoreQuery objects
3. It's not currently possible to extract terms from
   ConstantScoreQuery objects



Wildcard queries get turned into ConstantScoreQuery objects. For non-prefix
wildcards (e.g. "l*g"), the query parser directly returns a
ConstantScoreQuery with filter = WildcardFilter. For prefix wildcards
(e.g. "lon*"), the query parser returns a ConstantScorePrefixQuery,
but it gets rewritten (by Query.rewrite(), which gets called in the
highlighting component) into a ConstantScoreQuery with
filter = PrefixFilter.

If usePhraseHighlighter=false, then a key part of highlighting is
Query.extractTerms(). However, ConstantScoreQuery.extractTerms()
is an empty method. The source itself notes that this may not
be good for highlighting: "OK to not add any terms when used for
MultiSearcher, but may not be OK for highlighting."

If usePhraseHighlighter=true, then a key part of highlighting is
WeightedSpanTermExtractor.extract(Query, Map). Now extract() has
a number of different instanceof clauses, each with knowledge about
how to extract terms from a particular kind of query. However, there
is no instanceof clause that matches ConstantScoreQuery.



Here are four variants on testDefaultFieldHighlight() that all fail, even
though I think they should pass. (The differences from
testDefaultFieldHighlight are the hl.usePhraseHighlighter param and the
use of wildcard in sumLRF.makeRequest.) When I run them, they each return
a document, as expected, but they don't find any highlight blocks.

{code}
  public void testDefaultFieldPrefixWildcardHighlight() {

// do summarization using re-analysis of the field
HashMap args = new HashMap();
args.put("hl", "true");
args.put("df", "t_text");
args.put("hl.fl", "");
args.put("hl.usePhraseHighlighter", "false");
TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
  "standard", 0, 200, args);

assertU(adoc("t_text", "a long day's night", "id", "1"));
assertU(commit());
assertU(optimize());
assertQ("Basic summarization",
sumLRF.makeRequest("lon*"),
"//[EMAIL PROTECTED]'highlighting']/[EMAIL PROTECTED]'1']",
"//[EMAIL PROTECTED]'1']/[EMAIL PROTECTED]'t_text']/str"
);

  }

  public void testDefaultFieldPrefixWildcardHighlight2() {

// do summarization using re-analysis of the field
HashMap args = new HashMap();
args.put("hl", "true");
args.put("df", "t_text");
args.put("hl.fl", "");
args.put("hl.usePhraseHighlighter", "true");
TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
  "standard", 0, 200, args);

assertU(adoc("t_text", "a long day's night", "id", "1"));
assertU(commit());
assertU(optimize());
assertQ("Basic summarization",
sumLRF.makeRequest("lon*"),
"//[EMAIL PROTECTED]'highlighting']/[EMAIL PROTECTED]'1']",
"//[EMAIL PROTECTED]'1']/[EMAIL PROTECTED]'t_text']/str"
);

  }

  public void testDefaultFieldNonPrefixWildcardHighlight() {

// do summarization using re-analysis of the field
HashMap args = new HashMap();
args.put("hl", "true");
args.put("df", "t_text");
args.put("hl.fl", "");
args.put("hl.usePhraseHighlighter", "false");
TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
  "standard", 0, 200, args);

assertU(adoc("t_text", "a long day's night", "id", "1"));
assertU(commit());
assertU(optimize());
assertQ("Basic summarization",
sumLRF.makeRequest("l*g"),
"//[EMAIL PROTECTED]'highlighting']/[EMAIL PROTECTED]'1']",
"//[EMAIL PROTECTED]'1']/[EMAIL PROTECTED]'t_text']/str"
);

  }

  public void testDefaultFieldNonPrefixWildcardHighlight2() {

// do summarization using re-analysis of the field
HashMap args = new HashMap();
args.put("hl", "true");
args.put("df", "t_text");
args.put("hl.fl", "");
args.put("hl.usePhraseHighlighter", "true");
TestHarness.LocalRequestFactory sumLRF = h.getRequestFactory(
  "standard", 0, 200, args);

assertU(adoc("t_text

[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-09-05 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

THIS IS A BREAKING CHANGE TO RICH.PATCH! CLIENT URLs NEED TO BE UPDATED!

All unit tests pass.

Changes:

* As suggested earlier, the "id" parameter is no longer treated as a special 
case; it is not required, and it does not need to be an int. If you *do* use a 
field called "id", you *must* now declare it in the fieldnames parameter, as 
you would any other field

* Do updates with with UpdateRequestProcessor and SolrInputDocument, rather 
than UpdateHandler and DocumentBuilder. (The latter pair appear to be obsolete.)

* Previously if you declared a field in the fieldnames parameter but did not 
then did not specify a value for that field, you would get a 
NullPointerException. Now you can specify any nonnegative number of values for 
a declared field, including zero. (I've added a unit test for this.)

* In SolrPDFParser, properly close PDDocument when PDF parsing throws an 
exception

* Log the stream type in the solr log, rather than on the console

* Some not-very-thorough conversion of tabs to spaces

As an aside, I've noticed that I failed in my earlier efforts to incorporate 
Juri Kuehn's change to allow the id field to be non-integer. Sorry about that, 
Juri; that was not at all intentional.


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-09-03 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

This update is just to make a tiny refactoring, bringing all the handler's 
parsing classes under 

src\java\org\apache\solr\handler\rich

and all the testing classes under 

src\test\org\apache\solr\handler\rich

All tests pass.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, rich.patch, source.zip, test-files.zip, 
> test-files.zip, test.zip, un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-538) CopyField maxLength property

2008-09-03 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628045#action_12628045
 ] 

Chris Harris commented on SOLR-538:
---

Thanks, Lars; that was fast. I think this patch is going to be handy.

I'm wondering what people thought about an alternative approach to keeping 
stored fields from being too large, which would require mucking around with 
Lucene. In particular, the idea would be to allow field definitions like this:



Here we've made the normal Lucene maxFieldLength (i.e. # tokens to analyze) 
configurable a field-by-field basis. And in this declaration we've also made it 
so that what is stored is a function of what is analyzed. (Here if the first 
2,000 tokens correspond to the first, say, 8,000 characters, then those 8,000 
characters are what's going to be actually stored in the stored field.) This 
seems a little more natural than lopping off the text after a fixed number of 
characters.

If I could do the above, I'm thinking I would use that single field for both 
searching and highlighting. But if you wanted a separate field for highlighting 
(and were willing to have things run slower than with the current patch), then 
you could do this:






> CopyField maxLength property
> 
>
> Key: SOLR-538
> URL: https://issues.apache.org/jira/browse/SOLR-538
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Nicolas Dessaigne
>Priority: Minor
> Attachments: CopyFieldMaxLength.patch, CopyFieldMaxLength.patch, 
> SOLR-538.patch, SOLR-538.patch, SOLR-538.patch
>
>
> As discussed shortly on the mailing list (http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg09807.html), the objective of this task is to add a maxLength 
> property to the CopyField "command". This property simply limits the number 
> of characters that are copied.
> This is particularly useful to avoid very slow highlighting when the index 
> contains big documents.
> Example :
> 
> This approach has also the advantage of limiting the index size for large 
> documents (the original text field does not need to be stored and to have 
> term vectors). However, the index is bigger for small documents...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-09-02 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627882#action_12627882
 ] 

Chris Harris commented on SOLR-284:
---

A couple of Tika things:

I glanced at Tika yesterday, and it looks like switching this patch over to it 
wouldn't be too hard. (The only thing half-worthy of note is that 
org.apache.tika.parser.Parser.parse outputs XHTML [via a SAX interface], which 
we would probably then need to turn into plaintext.) I haven't yet looked into 
Eric's code to see if it does anything special that Tika doesn't do.

I also noticed something else, though. Earlier comments say that Nutch uses 
Tika, but when I looked through Nutch trunk this seemed to only sort of be the 
case. In particular, Nutch definitely uses the stuff in the 
org.apache.tika.mime namepsace, to do things like auto-detect content types, 
but it doesn't seem to use the stuff in org.apache.tika.parser to do the actual 
document parsing; instead, it uses its own separate 
org.apache.nutch.parse.Parser class (and subclasses thereof). For example, 
org.apache.nutch.parse.html.HtmlParser does not delegate to 
org.apache.tika.parser.html.HtmlParser but rather does its own direct 
manipulation of the tagsoup and/or nekohtml libraries. (Things are similar with 
the Nutch PDF parser.) Nor does there seem to be an alternative class along the 
lines of 
org.apache.nutch.parse.TikaBasedParserThatCanParseLotsOfDifferentContentTypesIncludingHtml.
 And the string "org.apache.tika.parser" doesn't seem to occur in the Nutch 
source.

I'm wondering if anyone knows why Nutch does not seem to make use of all of 
Tika's functionality. Are they planning to switch everything over to Tika 
eventually?


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-538) CopyField maxLength property

2008-09-01 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627558#action_12627558
 ] 

Chris Harris commented on SOLR-538:
---

When I apply this patch, the last line of 
test-files/solr/conf/schema-copyfield-test.xml seems to get cut off. That is, 
rather than

[...]
 
 
{{blank line}}

{{EOF}}

I get

[...]
 
 
{{blank line}}
{{EOF}}

This makes the XML invalid, and makes "ant test" fail.

I thought I was just being inept with Windows/TortoiseSVN, but now I've had the 
same thing happen when applying the patch with the patch command on the OS X 
command line. This makes me suspicious that there might be something wrong with 
the patch -- though I can't find anything wrong by looking at it manually. Any 
thoughts?


> CopyField maxLength property
> 
>
> Key: SOLR-538
> URL: https://issues.apache.org/jira/browse/SOLR-538
> Project: Solr
>  Issue Type: Improvement
>  Components: update
>Reporter: Nicolas Dessaigne
>Priority: Minor
> Attachments: CopyFieldMaxLength.patch, CopyFieldMaxLength.patch, 
> SOLR-538.patch, SOLR-538.patch
>
>
> As discussed shortly on the mailing list (http://www.mail-archive.com/[EMAIL 
> PROTECTED]/msg09807.html), the objective of this task is to add a maxLength 
> property to the CopyField "command". This property simply limits the number 
> of characters that are copied.
> This is particularly useful to avoid very slow highlighting when the index 
> contains big documents.
> Example :
> 
> This approach has also the advantage of limiting the index size for large 
> documents (the original text field does not need to be stored and to have 
> term vectors). However, the index is bigger for small documents...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-744) Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available in Solr schema files

2008-08-31 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-744:
--

Attachment: SOLR-744.patch

> Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available 
> in Solr schema files
> 
>
> Key: SOLR-744
> URL: https://issues.apache.org/jira/browse/SOLR-744
> Project: Solr
>  Issue Type: Improvement
>Reporter: Chris Harris
> Attachments: SOLR-744.patch
>
>
> See LUCENE-1370

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-744) Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available in Solr schema files

2008-08-31 Thread Chris Harris (JIRA)
Patch to make ShingleFilter.outputUnigramIfNoNgrams (LUCENE-1370) available in 
Solr schema files


 Key: SOLR-744
 URL: https://issues.apache.org/jira/browse/SOLR-744
 Project: Solr
  Issue Type: Improvement
Reporter: Chris Harris
 Attachments: SOLR-744.patch

See LUCENE-1370

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-08-31 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627333#action_12627333
 ] 

Chris Harris commented on SOLR-284:
---

While we're on the subject of breaking changes, I'm now seeing some merit in 
replacing the fieldnames parameter with a field-specifying prefix.

Currently when you want to set a non-body field, you introduce the field name 
in the fieldnames parameter and then specify its value in another parameter, 
like so:

   /update/rich/...fieldnames=f1,f2,f3&f1=val1&f2=val2&f3=val3

The alternative would be to to signal the fields f1, f2, and f3 by a field 
prefix, like so:

  /update/rich/...f.f1=val1&f.f2=val2&f.f3=val3

Because the f prefix says "this is a field", there's no need for the fieldnames 
parameter.

This isn't an Earth-shattering improvement, but there are three things I like 
about it:

1. The URLs are shorter

2. If you rename a field (e.g. rename f3 to g3), you can't accidentally 
half-update the URL in the client code, like this:

  /update/rich/...fieldnames=f1,f2,g3&f1=val1&f2=val2&f3=val3

3. Currently there are certain reserved words (e.g. "fieldnames", "commit") 
that you can't use, because they have special meaning to the handler. But with 
this change they become legitimate field names. For example, maybe I want each 
of my documents to have a "commit" field that describes who made the most 
recent relevant commit in a version control system.

  /update/rich/...commit=true&f.commit=chris

I can't think of any downsides right now, other than breaking people's code. (I 
do admit that is a downside.)

Any comments?


> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-08-29 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: un-hardcode-id.diff

The patch, as currently stands, treats a field called "id" as a special case. 
First, it is a required field. Second, unlike any other field, you don't need 
to declare it in the fieldnames parameter. Finally, since the 
fieldSolrParams.getInt(), that field is required to be an int.

This special-case treatment seems a little too particular to me; not everyone 
wants to have a field called "id", and not everyone who does wants that field 
to be an int. So what I propose is to eliminate the special treatment of "id". 
See un-hardcode-id.diff for what this might mean in particular. (That file is 
not complete; to correctly make this change, I'd have to update the test cases.)

This is a breaking change, because if you *are* using an id field, you'll now 
have to specifically indicate that fact in the fieldnames parameter. Thus, 
instead of

http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=subject,author&subject=mysubject&author=eric

you'll have to put

http://localhost:8983/solr/update/rich?stream.file=myfile.doc&stream.type=doc&id=100&stream.fieldname=text&fieldnames=id,subject,author&subject=mysubject&author=eric

I think asking users of this patch to make this slight change in their client 
code is not an unreasonable burden, but I'm curious what Eric and others have 
to say.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Fix For: 1.4
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip, 
> un-hardcode-id.diff
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-08-12 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

Trivial update to merge cleanly against r685275.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Reporter: Eric Pugh
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-05-08 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595372#action_12595372
 ] 

Chris Harris commented on SOLR-284:
---

I'm on the fence about whether this patch makes sense to include in Solr right 
now. One thing I'm wondering, though: Can we assess the odds at this point 
whether it could make sense for a Tika-based handler to offer the same public 
interface that the handler in this patch presents? That is, even if the 
underlying implementation were switched to Tika at some point, could we avoid 
changing the URL schema and such that Solr clients would use to interact with 
it?

If it's likely that the public interface could indeed remain the same for the 
first Tika-based handler release (or at least more or less the same), would 
this alleviate any of Grant's concerns?

Also, would putting this handler into a contrib directory rather than in the 
main code base, as has been mentioned on the mailing list, make committing it 
any less problematic?

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-05-07 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

Attaching another patch revision. I've been totally asleep at the wheel today, 
and my previous one contained not only the feature described in this JIRA issue 
but also the Data Import RequestHandler patch (SOLR-469). Hopefully I've 
finally made a patch that's actually correct. I can at least promise that the 
unit tests pass when applied to r654253.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> rich.patch, source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-05-07 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595007#action_12595007
 ] 

Chris Harris commented on SOLR-284:
---

I'm not sure this patch entirely reinvents the wheel, as it does most of the 
heavy lifting with preexisting components, namely PDFBox, POI, and Solr's own 
HTMLStripReader. It also has the advantage of already existing, whereas tying 
Solr to Tika or Aperture would take additional effort.

Tika or Aperture do look really nice, though. The most obvious advantage these 
projects have over this patch is that they can already extract text from more 
file formats than this patch, and that the developers will probably continue to 
add more file formats over time. Are you thinking of additional advantages on 
top of this, Grant? Do you have any cool ideas about how Tika/Aperture's 
metadata extraction facilities might be integrated into Solr? Is there a 
potentially interesting interface between Aperture's crawling facilities and 
Solr?

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-05-07 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: test-files.zip

New version of test-files.zip. Contains new file, simple.txt, that is used by a 
new unit test for plaintext files.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> source.zip, test-files.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-05-07 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

Here's a new version of rich.patch. My previous attempt didn't actually include 
all the necessary files! (Curses upon you, TortoiseSVN.) This one also includes 
preliminary support for plaintext and HTML files. (HTML support is done by 
running the input through the HTMLStripReader.)

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, rich.patch, 
> source.zip, test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-284) Parsing Rich Document Types

2008-04-09 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-284:
--

Attachment: rich.patch

Replacing rich.patch. The new one:

1) Rolls together into one handy package all of these:

  * the old rich.patch
  * the contents of source.zip and test.zip
  * Pompo's multivalued fields patch.

Note: It does *not* include the contents of libs.zip or test-files.zip. I'm not 
sure what the protocol is around those larger files.

Note: The old rich.patch included a change to Config.java that  searched for an 
alternative config file in "src/test/test-files/solr/conf/". I've removed that 
change because I think it's debugging code that we don't want in an official 
patch. Let me know if I'm wrong, though.

2) Makes things work against the latest revision in trunk, r646483. (It had 
stopped working with the latest version.)

I haven't added any new test cases, but the old ones all pass.

I grant my modifications to ASF according to the Apache License. Someone might 
want to check that the underlying contributions have been appropriately 
licensed as well.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, rich.patch, source.zip, 
> test-files.zip, test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-284) Parsing Rich Document Types

2008-03-25 Thread Chris Harris (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582062#action_12582062
 ] 

Chris Harris commented on SOLR-284:
---

I'm thinking it would be handy if RichDocumentRequestHandler could support 
indexing text and HTML files, in addition to the fancier formats (pdf, doc, 
etc.). That way I could use RichDocumentRequestHandler for all my indexing 
needs (except commits and optimizes), rather than use it for for some doc types 
but still have to use XmlUpdateRequestHandler for text and HTML docs. Would 
anyone else find this useful?

I skimmed the source, and adding support for text files looks trivial. (It's 
just a pass-through.) And if you had this, then I guess you'd have at least one 
version of HTML support for free; in particular, you could upload your HTML 
file to RichDocumentRequestHandler, telling the handler that the document is in 
plain text format, and then strip off the HTML tags later by using the 
HTMLStripStandardTokenizer in your schema.xml.

Alternatively, RichDocumentRequestHandler could provide its own explicit HTML 
to text conversion. There would probably be some advantages to this, but I'm 
not sure exactly what they would be. One, I guess, would be that you could use 
tokenizers that didn't make use of HTMLStripReader.

> Parsing Rich Document Types
> ---
>
> Key: SOLR-284
> URL: https://issues.apache.org/jira/browse/SOLR-284
> Project: Solr
>  Issue Type: New Feature
>  Components: update
>Affects Versions: 1.3
>Reporter: Eric Pugh
> Fix For: 1.3
>
> Attachments: libs.zip, rich.patch, source.zip, test-files.zip, 
> test.zip
>
>
> I have developed a RichDocumentRequestHandler based on the CSVRequestHandler 
> that supports streaming a PDF, Word, Powerpoint, Excel, or PDF document into 
> Solr.
> There is a wiki page with information here: 
> http://wiki.apache.org/solr/UpdateRichDocuments
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

2008-02-19 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-42:
-

Attachment: HtmlStripReaderTestXmlProcessing.patch

Updating test case to reflect the fact that offset info still gets screwed up 
in the XML processing instruction case even if there are no XML elements in the 
source XML

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --
>
> Key: SOLR-42
> URL: https://issues.apache.org/jira/browse/SOLR-42
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Reporter: Andrew May
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, 
> HtmlStripReaderTestXmlProcessing.patch, 
> HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, 
> SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting 
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
> from being searchable).
> Example title field:
> 40Ar/39Ar laserprobe dating of mylonitic fabrics in a 
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has 
> the  tags in the wrong place - 22 characters to the left of where they 
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

2008-02-14 Thread Chris Harris (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Harris updated SOLR-42:
-

Attachment: TokenPrinter.java
HtmlStripReaderTestXmlProcessing.patch

The committed HtmlStripReader doesn't seem to handle offsets correctly for XML 
processing instructions such as this:



I'm attaching two files:

HtmlStripReaderTestXmlProcessing.patch adds an HtmlStripReader test case to 
catch the problem. (The test currently fails.)

TokenPrinter.java can help make it a little clearer what the problem actually 
is. Here is the output if I run it against against the analysis code in trunk. 
Note that the offsets are basically what one would expect, except in the XML 
processing instructions case, where the start position is off by one:

-
String to test: id
 Token info:
   token 'id'
 startOffset: 11
 char at startOffset, and next few: 'id
id
 Token info:
   token 'id'
 startOffset: 99
 char at startOffset, and next few: 'id one
 two
 Token info:
   token 'one'
 startOffset: 41
 char at startOffset, and next few: 'oneid
 Token info:
   token 'id'
 startOffset: 49
 char at startOffset, and next few: '>id Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --
>
> Key: SOLR-42
> URL: https://issues.apache.org/jira/browse/SOLR-42
> Project: Solr
>  Issue Type: Bug
>  Components: highlighter
>Reporter: Andrew May
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, 
> HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, 
> SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting 
> if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names 
> from being searchable).
> Example title field:
> 40Ar/39Ar laserprobe dating of mylonitic fabrics in a 
> polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has 
> the  tags in the wrong place - 22 characters to the left of where they 
> should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.