[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Peter Sturge (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12797957#action_12797957
 ] 

Peter Sturge commented on SOLR-1709:


I've heard of Tortoise, I'll give that a try, thanks.

On the time-zone/skew issue, perhaps a more efficient approach would be a 
'push' rather than 'pull' - i.e.:

Requesters would include an optional parameter that told remote shards what 
time to use as 'NOW', and which TZ to use for date faceting.
This would avoid having to translate loads of time strings at merge time.

Thanks,
Peter


 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor

 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Peter Sturge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Sturge updated SOLR-1709:
---

Attachment: ResponseBuilder.java
FacetComponent.java

Sorry, guys, can't get svn to create a patch file correctly on windows, so I'm 
attaching the source files here. With some time, which at the moment I don't 
have, I'm sure I could get svn working. Rather than anyone have to wait for me 
to get the patch file created, I thought it best to get the source uploaded, so 
people can start using it.
Thanks, Peter


 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)
convert worddelimiterfilter to new tokenstream API
--

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir


This one was a doozy, attached is a patch to convert it to the new tokenstream 
API.

Some of the logic was split into WordDelimiterIterator (exposes a 
BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, rename the existing WordDelimiterFilter to 
OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random 
strings from various subword combinations.
For each random string, it compares output against the existing 
WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
combinations. The bugs discovered in SOLR-1706 are fixed here.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1710:
--

Attachment: SOLR-1710.patch

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, rename the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

2010-01-08 Thread Attila Babo (JIRA)
Race condition in 
org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
--

 Key: SOLR-1711
 URL: https://issues.apache.org/jira/browse/SOLR-1711
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4, 1.5
Reporter: Attila Babo
Priority: Critical
 Fix For: 1.5


While inserting a large pile of documents using StreamingUpdateSolrServer there 
is a race condition as all Runner instances stop processing while the blocking 
queue is full. With a high performance client this could happen quite often, 
there is no way to recover from it at the client side.

In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
UpdateRequests, there are up to threadCount number of workers threads from 
StreamingUpdateSolrServer.Runner to read that queue and push requests to a Solr 
instance. If at one point the BlockingQueue is empty all workers stop 
processing it and pushing the collected content to Solr which could be a time 
consuming process, sometimes all worker threads are waiting for Solr. If at 
this time the client fills the BlockingQueue to full all worker threads will 
quit without processing any further and the main thread will block forever.

There is a simple, well tested patch handle this situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

2010-01-08 Thread Attila Babo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Babo updated SOLR-1711:
--

Attachment: StreamingUpdateSolrServer.patch

Patch 1, 2:
Inside the Runner.run method I've added a do while loop to prevent the Runner 
to quit while there are new requests, this handles the problem of new requests 
added while Runner is sending the previous batch.

Patch 3
Validity check of method variable is not strictly necessary, just a code clean 
up.

Patch 4
The last part of the patch is to move synchronized outside of conditional to 
avoid a situation where runners change while evaluating it.

To minify the patch all indentation has been removed.

 Race condition in 
 org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
 --

 Key: SOLR-1711
 URL: https://issues.apache.org/jira/browse/SOLR-1711
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4, 1.5
Reporter: Attila Babo
Priority: Critical
 Fix For: 1.5

 Attachments: StreamingUpdateSolrServer.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 While inserting a large pile of documents using StreamingUpdateSolrServer 
 there is a race condition as all Runner instances stop processing while the 
 blocking queue is full. With a high performance client this could happen 
 quite often, there is no way to recover from it at the client side.
 In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
 UpdateRequests, there are up to threadCount number of workers threads from 
 StreamingUpdateSolrServer.Runner to read that queue and push requests to a 
 Solr instance. If at one point the BlockingQueue is empty all workers stop 
 processing it and pushing the collected content to Solr which could be a time 
 consuming process, sometimes all worker threads are waiting for Solr. If at 
 this time the client fills the BlockingQueue to full all worker threads will 
 quit without processing any further and the main thread will block forever.
 There is a simple, well tested patch handle this situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1711) Race condition in org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java

2010-01-08 Thread Attila Babo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Babo updated SOLR-1711:
--

Description: 
While inserting a large pile of documents using StreamingUpdateSolrServer there 
is a race condition as all Runner instances stop processing while the blocking 
queue is full. With a high performance client this could happen quite often, 
there is no way to recover from it at the client side.

In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
UpdateRequests, there are up to threadCount number of workers threads from 
StreamingUpdateSolrServer.Runner to read that queue and push requests to a Solr 
instance. If at one point the BlockingQueue is empty all workers stop 
processing it and pushing the collected content to Solr which could be a time 
consuming process, sometimes all worker threads are waiting for Solr. If at 
this time the client fills the BlockingQueue to full all worker threads will 
quit without processing any further and the main thread will block forever.

There is a simple, well tested patch attached to handle this situation.

  was:
While inserting a large pile of documents using StreamingUpdateSolrServer there 
is a race condition as all Runner instances stop processing while the blocking 
queue is full. With a high performance client this could happen quite often, 
there is no way to recover from it at the client side.

In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
UpdateRequests, there are up to threadCount number of workers threads from 
StreamingUpdateSolrServer.Runner to read that queue and push requests to a Solr 
instance. If at one point the BlockingQueue is empty all workers stop 
processing it and pushing the collected content to Solr which could be a time 
consuming process, sometimes all worker threads are waiting for Solr. If at 
this time the client fills the BlockingQueue to full all worker threads will 
quit without processing any further and the main thread will block forever.

There is a simple, well tested patch handle this situation.


 Race condition in 
 org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java
 --

 Key: SOLR-1711
 URL: https://issues.apache.org/jira/browse/SOLR-1711
 Project: Solr
  Issue Type: Bug
  Components: clients - java
Affects Versions: 1.4, 1.5
Reporter: Attila Babo
Priority: Critical
 Fix For: 1.5

 Attachments: StreamingUpdateSolrServer.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 While inserting a large pile of documents using StreamingUpdateSolrServer 
 there is a race condition as all Runner instances stop processing while the 
 blocking queue is full. With a high performance client this could happen 
 quite often, there is no way to recover from it at the client side.
 In StreamingUpdateSolrServer there is a BlockingQueue called queue to store 
 UpdateRequests, there are up to threadCount number of workers threads from 
 StreamingUpdateSolrServer.Runner to read that queue and push requests to a 
 Solr instance. If at one point the BlockingQueue is empty all workers stop 
 processing it and pushing the collected content to Solr which could be a time 
 consuming process, sometimes all worker threads are waiting for Solr. If at 
 this time the client fills the BlockingQueue to full all worker threads will 
 quit without processing any further and the main thread will block forever.
 There is a simple, well tested patch attached to handle this situation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



idea to speed up indexing defaults

2010-01-08 Thread Robert Muir
Hello,

I have been running some tests with english and I noticed that Solr
uses the very slow Porter2 snowball stemmer by default.
In LUCENE-2194 i have proposed a patch to speed this up, of course it
will never be picked up by solr due to the way snowball is
reimplemented here.
This would increased the default for type text, etc by about 10%, not much.

But actually i would like to propose instead that the PorterStemFilter
(Porter 1) from lucene core be defined as the default instead.
This is significantly faster (my indexing speed was like 2x as fast!)
as this Porter2 snowball stemmer.
I did some relevance tests on a test collection and it actually came
out on top as far as relevance, too.

I suppose the thing blocking the use of PorterStemFilter is protWords
functionality, but in LUCENE-1515 i proposed adding this to all lucene
stemmers, so maybe we could remove the snowball duplication and
possibly change the default stemmer to the faster PorterStemFilter in
lucene core.

so basically, i am asking: is there a specific reason this slower
Snowball(English) Porter2 filter is defined as a default?

If there isn't, i'd like to suggest we move in these directions,
although it will take some time and not really work until solr and
lucene are synced up again.

thanks in advance for any ideas.

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (SOLR-64) strict hierarchical facets

2010-01-08 Thread Thibaut Lassalle (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798053#action_12798053
 ] 

Thibaut Lassalle commented on SOLR-64:
--

Hi

I did the same patch for the solr-1.4 release
http://dev.lutece.paris.fr/svn/lutece/contribs/atoswordline/trunk/config-SOLR/SOLR-64-byParis.patch


 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Fix For: 1.5

 Attachments: SOLR-64.patch, SOLR-64.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-792) Tree Faceting Component

2010-01-08 Thread Thibaut Lassalle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thibaut Lassalle updated SOLR-792:
--

Attachment: SOLR-792.patch

Update to apply cleanly against release 1.4 

 Tree Faceting Component
 ---

 Key: SOLR-792
 URL: https://issues.apache.org/jira/browse/SOLR-792
 Project: Solr
  Issue Type: New Feature
Reporter: Erik Hatcher
Assignee: Erik Hatcher
Priority: Minor
 Attachments: SOLR-792.patch, SOLR-792.patch, SOLR-792.patch, 
 SOLR-792.patch, SOLR-792.patch


 A component to do multi-level faceting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-64) strict hierarchical facets

2010-01-08 Thread Thibaut Lassalle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thibaut Lassalle updated SOLR-64:
-

Attachment: SOLR-64.patch

Update to apply cleanly against release 1.4

 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Fix For: 1.5

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-64) strict hierarchical facets

2010-01-08 Thread Thibaut Lassalle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thibaut Lassalle updated SOLR-64:
-

Comment: was deleted

(was: Hi

I did the same patch for the solr-1.4 release
http://dev.lutece.paris.fr/svn/lutece/contribs/atoswordline/trunk/config-SOLR/SOLR-64-byParis.patch
)

 strict hierarchical facets
 --

 Key: SOLR-64
 URL: https://issues.apache.org/jira/browse/SOLR-64
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Yonik Seeley
 Fix For: 1.5

 Attachments: SOLR-64.patch, SOLR-64.patch, SOLR-64.patch


 Strict Facet Hierarchies... each tag has at most one parent (a tree).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: idea to speed up indexing defaults

2010-01-08 Thread Grant Ingersoll

On Jan 8, 2010, at 11:17 AM, Robert Muir wrote:

 Hello,
 
 I have been running some tests with english and I noticed that Solr
 uses the very slow Porter2 snowball stemmer by default.
 In LUCENE-2194 i have proposed a patch to speed this up, of course it
 will never be picked up by solr due to the way snowball is
 reimplemented here.
 This would increased the default for type text, etc by about 10%, not much.
 
 But actually i would like to propose instead that the PorterStemFilter
 (Porter 1) from lucene core be defined as the default instead.
 This is significantly faster (my indexing speed was like 2x as fast!)
 as this Porter2 snowball stemmer.
 I did some relevance tests on a test collection and it actually came
 out on top as far as relevance, too.
 
 I suppose the thing blocking the use of PorterStemFilter is protWords
 functionality, but in LUCENE-1515 i proposed adding this to all lucene
 stemmers, so maybe we could remove the snowball duplication and
 possibly change the default stemmer to the faster PorterStemFilter in
 lucene core.
 
 so basically, i am asking: is there a specific reason this slower
 Snowball(English) Porter2 filter is defined as a default?

It's a bit odd, but Solr doesn't really have a default.  What it has is an 
example schema.  Unfortunately, everyone treats the example as the default, 
so...

Yes, it would make sense to speed up the default schema as much as possible.  
There are probably other token filters in there that could be removed, too.

It's very good that you are doing this, as I've been wondering lately if it 
doesn't make sense to seriously evaluate speeding up all the snowball stuff.


 
 If there isn't, i'd like to suggest we move in these directions,
 although it will take some time and not really work until solr and
 lucene are synced up again.

It shouldn't be that far off, right?  I think there is movement underway to put 
Solr on 3.x.

[jira] Created: (SOLR-1712) option to supress facet constraints when count is == numFound

2010-01-08 Thread Hoss Man (JIRA)
option to supress facet constraints when count is == numFound
-

 Key: SOLR-1712
 URL: https://issues.apache.org/jira/browse/SOLR-1712
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man


It would be handy to have an easy option to suppress (on the server side) any 
facet contraint values whose count is the same as numFound (ie: filtering on 
that constraint would not reduce the result size)

this should be a corollary to facet.mincount=1 and happen prior to facet.limit 
being applied.

http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-to27026359.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1712) option to supress facet constraints when count is == numFound

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798147#action_12798147
 ] 

Yonik Seeley commented on SOLR-1712:


Also keep in mind that the docset used for faceting may not be the one used to 
return results (this is true for multi-select).
And yes, if you still want the top 10 constraints *after* eliminating those 
with count=facet.maxcount, it makes distributed search *much* harder (and 
probably makes future per-segment faceting harder too).

 option to supress facet constraints when count is == numFound
 -

 Key: SOLR-1712
 URL: https://issues.apache.org/jira/browse/SOLR-1712
 Project: Solr
  Issue Type: Improvement
Reporter: Hoss Man

 It would be handy to have an easy option to suppress (on the server side) any 
 facet contraint values whose count is the same as numFound (ie: filtering on 
 that constraint would not reduce the result size)
 this should be a corollary to facet.mincount=1 and happen prior to 
 facet.limit being applied.
 http://old.nabble.com/Removing-facets-which-frequency-match-the-result-count-to27026359.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [Solr Wiki] Update of PacktBook2009 by HossMan

2010-01-08 Thread Chris Hostetter

: + Available For Purchase...
: +* 
[[http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=http%3A%2F%2Flucene.apache.org%2Fsolr%2Futm_medium=sponsutm_content=podutm_campaign=mdb_000275|Directly
 from Packt]] (A portion of proceeds are donated to the ASF)

David / Eric: I copied that URL from the main site, I'm only guessing that 
the referal/tracking info works ok even when coming from wiki.apache.org, 
would one of you mind checking with someone at Packt to see if we need 
a different one?


-Hoss



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798163#action_12798163
 ] 

Hoss Man commented on SOLR-1709:


bq. Requesters would include an optional parameter that told remote shards what 
time to use as 'NOW', and which TZ to use for date faceting. This would avoid 
having to translate loads of time strings at merge time.

I was thinking the same thing ... as long as the coordinator evaluated any 
DateMath in the facet.date.start and facet.date.end params before executing the 
sub-requests to the shards, the ranges coming back from the individual shards 
should all be in sync.

 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798170#action_12798170
 ] 

Yonik Seeley commented on SOLR-1709:


I haven't checked the patch, but it seems like we should take a generic 
approach to NOW...
The first time NOW is used anywhere in the request (and is not passed in as a 
request argument), either a thread local or something in the request context 
should be set to the current time.  Subsequent references to NOW would yield 
the first value set.
This would allow NOW to be referenced more than once in the same request with 
consistent results.

Passing in NOW as a request parameter would simply set it explicitly... the 
question is, who (which solr component) should be responsible for that?

 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798188#action_12798188
 ] 

Yonik Seeley commented on SOLR-1657:


What about preserving the attributes for just the first token?  That makes a 
lot of sense in many cases (say when WDF is just removing punctuation).
So if preserveOriginal==true, the first token would always be the original.  
This should also be the most performant since it's just a modification to the 
first token (offset and termText)?


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch


 org.apache.solr.analysis:
 BufferedTokenStream
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 WordDelimiterFilter
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.search:
 FieldQParserPlugin
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



RE: [jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Peter S

The time skew/TZ is really the 'other half' of what the patch would/should 
ultimately be.

Since the current patch only deals with dist responses, it will be perfectly 
happy to receive facet_dates that have been generated in sync with the 
requester.

 

I'm not really familiar with the distributed sending part of the code, but I 
would suspect that whatever component is delegated the task of fanning out 
shard requests would be a good candidate for 'owning' the marking of 'NOW' and 
adding the appropriate parameters to send to the shards (might this be the very 
same FacetComponent in distributedProcess()?).

 

Then there's the task of the remote shard digesting the new parameters and 
adjusting its dates accordingly. Presumably this would be handled by 
SimpleFacets?

 

For facet.date.start/facet.date.end, I guess if these are/can only be relative 
times (is it allowed to set an explicit start/end time?), then the remote shard 
can simply interpret NOW as the passed-in NOW, rather than its own NOW. Are 
there any options for facet.date.start/end that don't involve NOW at all?

 

Peter

 

 

 

 Date: Fri, 8 Jan 2010 20:35:54 +
 From: j...@apache.org
 To: solr-dev@lucene.apache.org
 Subject: [jira] Commented: (SOLR-1709) Distributed Date Faceting
 
 
 [ 
 https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798170#action_12798170
  ] 
 
 Yonik Seeley commented on SOLR-1709:
 
 
 I haven't checked the patch, but it seems like we should take a generic 
 approach to NOW...
 The first time NOW is used anywhere in the request (and is not passed in as a 
 request argument), either a thread local or something in the request context 
 should be set to the current time. Subsequent references to NOW would yield 
 the first value set.
 This would allow NOW to be referenced more than once in the same request with 
 consistent results.
 
 Passing in NOW as a request parameter would simply set it explicitly... the 
 question is, who (which solr component) should be responsible for that?
 
  Distributed Date Faceting
  -
 
  Key: SOLR-1709
  URL: https://issues.apache.org/jira/browse/SOLR-1709
  Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
  Affects Versions: 1.4
  Reporter: Peter Sturge
  Priority: Minor
  Attachments: FacetComponent.java, ResponseBuilder.java
 
 
  This patch is for adding support for date facets when using distributed 
  searches.
  Date faceting across multiple machines exposes some time-based issues that 
  anyone interested in this behaviour should be aware of:
  Any time and/or time-zone differences are not accounted for in the patch 
  (i.e. merged date facets are at a time-of-day, not necessarily at a 
  universal 'instant-in-time', unless all shards are time-synced to the exact 
  same time).
  The implementation uses the first encountered shard's facet_dates as the 
  basis for subsequent shards' data to be merged in.
  This means that if subsequent shards' facet_dates are skewed in relation to 
  the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
  in.
  There are several reasons for this:
  * Performance: It's faster to check facet_date lists against a single map's 
  data, rather than against each other, particularly if there are many shards
  * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
  time range larger than that which was requested
  (e.g. a request for one hour's worth of facets could bring back 2, 3 or 
  more hours of data)
  This could be dealt with if timezone and skew information was added, and 
  the dates were normalized.
  One possibility for adding such support is to [optionally] add 'timezone' 
  and 'now' parameters to the 'facet_dates' map. This would tell requesters 
  what time and TZ the remote server thinks it is, and so multiple shards' 
  time data can be normalized.
  The patch affects 2 files in the Solr core:
  org.apache.solr.handler.component.FacetComponent.java
  org.apache.solr.handler.component.ResponseBuilder.java
  The main changes are in FacetComponent - ResponseBuilder is just to hold 
  the completed SimpleOrderedMap until the finishStage.
  One possible enhancement is to perhaps make this an optional parameter, but 
  really, if facet.date parameters are specified, it is assumed they are 
  desired.
  Comments  suggestions welcome.
  As a favour to ask, if anyone could take my 2 source files and create a 
  PATCH file from it, it would be greatly appreciated, as I'm having a bit of 
  trouble with svn (don't shoot me, but my environment is a Redmond-based os 
  company).
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 

  
_
Do you 

[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798199#action_12798199
 ] 

Robert Muir commented on SOLR-1657:
---

Yonik, I agree, this is almost what the current patch does (take a look if you 
want, SOLR-1710).

There is one difference i must change, the 'when WDF is just removing 
punctuation' case. Current patch does not preserve attributes for this case 
(you must use preserveOriginal=true)

But the odd thing about this will be, when 'WDF is just removing punctuation' 
 preserveOriginal == true, obviously the attributes will only apply to the 
original... does this make sense? 

I will make the change to the SOLR-1710 patch.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch


 org.apache.solr.analysis:
 BufferedTokenStream
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 WordDelimiterFilter
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.search:
 FieldQParserPlugin
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798203#action_12798203
 ] 

Yonik Seeley commented on SOLR-1709:


Date formatting and parsing also tend to be surprisingly expensive.
So *if* we support passing NOW as a date string, it would be nice to also 
support standard milliseconds.  That can also be easier for clients to generate 
rather than trying to figure out how to get the correct date format.  Perhaps 
that should even be an addition to the standard datemath syntax.

 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1710:
--

Attachment: SOLR-1710.patch

for the 'wdf is only modifying single word with punctuation', don't 
clearAttributes() if its the first token, even though its modified... unless 
preserveOriginal is on (in this case the preserved original contained the 
attributes already, and we must clear).

this is a little confusing since the behavior for custom attributes depends on 
this preserveOriginal value, but i think it makes sense.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, rename the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Peter Sturge (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798233#action_12798233
 ] 

Peter Sturge commented on SOLR-1709:


Definitely true! -- messing about with Date strings isn't great for performance.

As the NOW parameter would be for internal request use only (i.e. not for the 
indexer, not for human consumption), could it not just be an epoch long? The 
adjustment math should then be nice and quick (no string/date 
parsing/formatting; at worst just one Date.getTimeInMillis() call if the time 
is stored locally as a string).

 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798234#action_12798234
 ] 

Yonik Seeley commented on SOLR-1710:


bq. For each random string, it compares output against the existing 
WordDelimiterFilter for all 512 combinations of boolean parameters

Whew... nice thorough work.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, rename the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798239#action_12798239
 ] 

Robert Muir commented on SOLR-1710:
---

Yonik, thanks. Again i have a hesitation: the SOLR-1706 problem.

If i could fix this bug in the original code, i would be able to enable the 
problematic combinations in backwards testing:
* catenateNumbers != catenateWords
* generateWordParts != generateNumberParts

I was unable to figure this one out though, so excluding these from the test 
makes me a little nervous... what is there to do? 


 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, rename the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798241#action_12798241
 ] 

Robert Muir commented on SOLR-1710:
---

Chris, not really, if you see the description i say:
before applying the patch, rename the existing WordDelimiterFilter to 
OriginalWordDelimiterFilter

I guess this should say instead: make a copy of... I will fix.

obviously OriginalWordDelimiterFilter should not be committed, nor this random 
test that compares results against it.

but for now its convenient while working the issue to simply blast random 
strings against the old filter for testing.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, rename the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1710:
--

Description: 
This one was a doozy, attached is a patch to convert it to the new tokenstream 
API.

Some of the logic was split into WordDelimiterIterator (exposes a 
BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, copy the existing WordDelimiterFilter to 
OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random 
strings from various subword combinations.
For each random string, it compares output against the existing 
WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
combinations. The bugs discovered in SOLR-1706 are fixed here.


  was:
This one was a doozy, attached is a patch to convert it to the new tokenstream 
API.

Some of the logic was split into WordDelimiterIterator (exposes a 
BreakIterator-like api for iterating subwords)
the filter is much more efficient now, no cloning.

before applying the patch, rename the existing WordDelimiterFilter to 
OriginalWordDelimiterFilter
the patch includes a testcase (TestWordDelimiterBWComp) which generates random 
strings from various subword combinations.
For each random string, it compares output against the existing 
WordDelimiterFilter for all 512 combinations of boolean parameters.

NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
combinations. The bugs discovered in SOLR-1706 are fixed here.



 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1709) Distributed Date Faceting

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798243#action_12798243
 ] 

Yonik Seeley commented on SOLR-1709:


Seems useful enough that setting NOW should be advertised (i.e. not just an 
internal call).  For example, it would be a convenient way to keep the rest of 
your request the same, but check how the current date affects your date 
boosting strategies.  NOW isn't just for date faceting, but for anything that 
uses date math.

As for the format, 20091231 is ambiguous if you want flexible dates... is it a 
date or milliseconds?
I first thought of a prefix (ms:123456789) but it makes it look like a field 
query.
It might be safest to make it unambiguous somehow... postfix with ms?  
123456789ms


 Distributed Date Faceting
 -

 Key: SOLR-1709
 URL: https://issues.apache.org/jira/browse/SOLR-1709
 Project: Solr
  Issue Type: Improvement
  Components: SearchComponents - other
Affects Versions: 1.4
Reporter: Peter Sturge
Priority: Minor
 Attachments: FacetComponent.java, ResponseBuilder.java


 This patch is for adding support for date facets when using distributed 
 searches.
 Date faceting across multiple machines exposes some time-based issues that 
 anyone interested in this behaviour should be aware of:
 Any time and/or time-zone differences are not accounted for in the patch 
 (i.e. merged date facets are at a time-of-day, not necessarily at a universal 
 'instant-in-time', unless all shards are time-synced to the exact same time).
 The implementation uses the first encountered shard's facet_dates as the 
 basis for subsequent shards' data to be merged in.
 This means that if subsequent shards' facet_dates are skewed in relation to 
 the first by 1 'gap', these 'earlier' or 'later' facets will not be merged 
 in.
 There are several reasons for this:
   * Performance: It's faster to check facet_date lists against a single map's 
 data, rather than against each other, particularly if there are many shards
   * If 'earlier' and/or 'later' facet_dates are added in, this will make the 
 time range larger than that which was requested
 (e.g. a request for one hour's worth of facets could bring back 2, 3 
 or more hours of data)
 This could be dealt with if timezone and skew information was added, and 
 the dates were normalized.
 One possibility for adding such support is to [optionally] add 'timezone' and 
 'now' parameters to the 'facet_dates' map. This would tell requesters what 
 time and TZ the remote server thinks it is, and so multiple shards' time data 
 can be normalized.
 The patch affects 2 files in the Solr core:
   org.apache.solr.handler.component.FacetComponent.java
   org.apache.solr.handler.component.ResponseBuilder.java
 The main changes are in FacetComponent - ResponseBuilder is just to hold the 
 completed SimpleOrderedMap until the finishStage.
 One possible enhancement is to perhaps make this an optional parameter, but 
 really, if facet.date parameters are specified, it is assumed they are 
 desired.
 Comments  suggestions welcome.
 As a favour to ask, if anyone could take my 2 source files and create a PATCH 
 file from it, it would be greatly appreciated, as I'm having a bit of trouble 
 with svn (don't shoot me, but my environment is a Redmond-based os company).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798245#action_12798245
 ] 

Chris Male commented on SOLR-1710:
--

Ah right, sorry missed that description.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798248#action_12798248
 ] 

Robert Muir commented on SOLR-1710:
---

Chris, no problem, I created this confusion until the patch is OK'ed.

once this happens, i can include some additional testcases that I had problems 
with.
i have all 7 revisions i made of this filter locally so i can see which 
scenarios fail on each previous iteration, I think these are good tests.


 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1706) wrong tokens output from WordDelimiterFilter depending upon options

2010-01-08 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798251#action_12798251
 ] 

Yonik Seeley commented on SOLR-1706:


Yep, certainly bugs.  IMO, no need to worry about trying to match (even for 
compat) - these look like real configuration edge cases to me.

 wrong tokens output from WordDelimiterFilter depending upon options
 ---

 Key: SOLR-1706
 URL: https://issues.apache.org/jira/browse/SOLR-1706
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Robert Muir

 below you can see that when I have requested to only output numeric 
 concatenations (not words), some words are still sometimes output, ignoring 
 the options i have provided, and even then, in a very inconsistent way.
 {code}
   assertWdf(Super-Duper-XL500-42-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder },
 new int[] { 18, 21 },
 new int[] { 20, 30 },
 new int[] { 1, 1 });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-56, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42, AutoCoder, 56 },
 new int[] { 18, 21, 33 },
 new int[] { 20, 30, 35 },
 new int[] { 1, 1, 1 });
   assertWdf(Super-Duper-XL500-AB-AutoCoder's, 0,0,0,1,0,0,0,0,1, null,
 new String[] {  },
 new int[] {  },
 new int[] {  },
 new int[] {  });
   assertWdf(Super-Duper-XL500-42-AutoCoder's-BC, 0,0,0,1,0,0,0,0,1, null,
 new String[] { 42 },
 new int[] { 18 },
 new int[] { 20 },
 new int[] { 1 });
 {code}
 where assertWdf is 
 {code}
   void assertWdf(String text, int generateWordParts, int generateNumberParts,
   int catenateWords, int catenateNumbers, int catenateAll,
   int splitOnCaseChange, int preserveOriginal, int splitOnNumerics,
   int stemEnglishPossessive, CharArraySet protWords, String expected[],
   int startOffsets[], int endOffsets[], String types[], int posIncs[])
   throws IOException {
 TokenStream ts = new WhitespaceTokenizer(new StringReader(text));
 WordDelimiterFilter wdf = new WordDelimiterFilter(ts, generateWordParts,
 generateNumberParts, catenateWords, catenateNumbers, catenateAll,
 splitOnCaseChange, preserveOriginal, splitOnNumerics,
 stemEnglishPossessive, protWords);
 assertTokenStreamContents(wdf, expected, startOffsets, endOffsets, types,
 posIncs);
   }
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798255#action_12798255
 ] 

Robert Muir commented on SOLR-1657:
---

bq. Not sure... I guess it depends on the attribute and what it does.

me neither! well there are 2 patches now on SOLR-1710, so if we don't want this 
we can just use the first one.
i thought about this one a lot and came to the conclusion that if you really 
care about your custom attributes making sense, you will use preserveOriginal, 
but i think both versions work well with that line of reasoning.


 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch


 org.apache.solr.analysis:
 BufferedTokenStream
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 WordDelimiterFilter
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.search:
 FieldQParserPlugin
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Chris Male (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798260#action_12798260
 ] 

Chris Male commented on SOLR-1710:
--

I am working with this patch with the goal of simplifying its logic and 
increasing readability.  Seems great thus far though.

 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1657) convert the rest of solr to use the new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated SOLR-1657:
--

Description: 
org.apache.solr.analysis:
BufferedTokenStream
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
-WordDelimiterFilter-

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.search:
FieldQParserPlugin

org.apache.solr.spelling:
SpellingQueryConverter


  was:
org.apache.solr.analysis:
BufferedTokenStream
 - -CommonGramsFilter-
 - -CommonGramsQueryFilter-
 - -RemoveDuplicatesTokenFilter-
-CapitalizationFilterFactory-
-HyphenatedWordsFilter-
-LengthFilter (deprecated, remove)-
SynonymFilter
SynonymFilterFactory
WordDelimiterFilter

org.apache.solr.handler:
AnalysisRequestHandler
AnalysisRequestHandlerBase

org.apache.solr.handler.component:
QueryElevationComponent
SpellCheckComponent

org.apache.solr.highlight:
DefaultSolrHighlighter

org.apache.solr.search:
FieldQParserPlugin

org.apache.solr.spelling:
SpellingQueryConverter



striking thru WDF since i think its at least close.

 convert the rest of solr to use the new tokenstream API
 ---

 Key: SOLR-1657
 URL: https://issues.apache.org/jira/browse/SOLR-1657
 Project: Solr
  Issue Type: Task
Reporter: Robert Muir
 Attachments: SOLR-1657.patch, SOLR-1657.patch


 org.apache.solr.analysis:
 BufferedTokenStream
  - -CommonGramsFilter-
  - -CommonGramsQueryFilter-
  - -RemoveDuplicatesTokenFilter-
 -CapitalizationFilterFactory-
 -HyphenatedWordsFilter-
 -LengthFilter (deprecated, remove)-
 SynonymFilter
 SynonymFilterFactory
 -WordDelimiterFilter-
 org.apache.solr.handler:
 AnalysisRequestHandler
 AnalysisRequestHandlerBase
 org.apache.solr.handler.component:
 QueryElevationComponent
 SpellCheckComponent
 org.apache.solr.highlight:
 DefaultSolrHighlighter
 org.apache.solr.search:
 FieldQParserPlugin
 org.apache.solr.spelling:
 SpellingQueryConverter

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798261#action_12798261
 ] 

Robert Muir commented on SOLR-1710:
---

thanks in advance chris, I will help with testing and benchmarking anything you 
can do. 
I think i may have taken it as far as I can go, my head almost exploded.


 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1704) Include google collections jar

2010-01-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798264#action_12798264
 ] 

Grant Ingersoll commented on SOLR-1704:
---

Is there an actual use for them or are we just doing it for some future benefit?

 Include google collections jar
 --

 Key: SOLR-1704
 URL: https://issues.apache.org/jira/browse/SOLR-1704
 Project: Solr
  Issue Type: Improvement
Reporter: Noble Paul
Assignee: Noble Paul
Priority: Minor
 Fix For: 1.5

 Attachments: google-collect-1.0.jar


 Clustering already ships the google collections jar. We can add it to the 
 core and all components can benefit from it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1710) convert worddelimiterfilter to new tokenstream API

2010-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798268#action_12798268
 ] 

Robert Muir commented on SOLR-1710:
---

chris yeah, its supposed to be similar to 
http://java.sun.com/j2se/1.4.2/docs/api/java/text/BreakIterator.html#next%28%29

i started by mimicing this api somewhat, i guess a future improvement would be 
if somehow this truly was a real BreakIterator.
Then say, you could create a RuleBasedBreakIterator or 
DictionaryBasedBreakIterator (which are fast compiled DFAs), and customize how 
words are delimited.
currently, you can only do this with by customizing the charTypeTable, which 
cannot take any context into account, so its rather limited.

all of the above is really just theoretical and not anything we should worry 
about, for practical purposes i mimiced BreakIterator api (but diverged 
somewhat), just because I am used to working with it and found it was one way 
to separate a lot of the logic.


 convert worddelimiterfilter to new tokenstream API
 --

 Key: SOLR-1710
 URL: https://issues.apache.org/jira/browse/SOLR-1710
 Project: Solr
  Issue Type: Improvement
  Components: Schema and Analysis
Reporter: Robert Muir
 Attachments: SOLR-1710.patch, SOLR-1710.patch


 This one was a doozy, attached is a patch to convert it to the new 
 tokenstream API.
 Some of the logic was split into WordDelimiterIterator (exposes a 
 BreakIterator-like api for iterating subwords)
 the filter is much more efficient now, no cloning.
 before applying the patch, copy the existing WordDelimiterFilter to 
 OriginalWordDelimiterFilter
 the patch includes a testcase (TestWordDelimiterBWComp) which generates 
 random strings from various subword combinations.
 For each random string, it compares output against the existing 
 WordDelimiterFilter for all 512 combinations of boolean parameters.
 NOTE: due to bugs found (SOLR-1706), this currently only tests 256 of these 
 combinations. The bugs discovered in SOLR-1706 are fixed here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1653) add PatternReplaceCharFilter

2010-01-08 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798271#action_12798271
 ] 

Koji Sekiguchi commented on SOLR-1653:
--

Thanks, Paul! I've just committed revision 897357.

 add PatternReplaceCharFilter
 

 Key: SOLR-1653
 URL: https://issues.apache.org/jira/browse/SOLR-1653
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 1.4
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1653.patch, SOLR-1653.patch


 Add a new CharFilter that uses a regular expression for the target of replace 
 string in char stream.
 Usage:
 {code:title=schema.xml}
 fieldType name=textCharNorm class=solr.TextField 
 positionIncrementGap=100 
   analyzer
 charFilter class=solr.PatternReplaceCharFilterFactory
 groupedPattern=([nN][oO]\.)\s*(\d+)
 replaceGroups=1,2 blockDelimiters=:;/
 charFilter class=solr.MappingCharFilterFactory 
 mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (SOLR-1268) Incorporate Lucene's FastVectorHighlighter

2010-01-08 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved SOLR-1268.
--

Resolution: Fixed

Committed revision 897383.

 Incorporate Lucene's FastVectorHighlighter
 --

 Key: SOLR-1268
 URL: https://issues.apache.org/jira/browse/SOLR-1268
 Project: Solr
  Issue Type: New Feature
  Components: highlighter
Reporter: Koji Sekiguchi
Assignee: Koji Sekiguchi
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1268.patch, SOLR-1268.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1696) Deprecate old highlighting syntax and move configuration to HighlightComponent

2010-01-08 Thread Koji Sekiguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798312#action_12798312
 ] 

Koji Sekiguchi commented on SOLR-1696:
--

I've just committed SOLR-1268. Now I'm trying to contribute a patch for this to 
sync with trunk...

 Deprecate old highlighting syntax and move configuration to 
 HighlightComponent
 

 Key: SOLR-1696
 URL: https://issues.apache.org/jira/browse/SOLR-1696
 Project: Solr
  Issue Type: Improvement
  Components: highlighter
Reporter: Noble Paul
 Fix For: 1.5

 Attachments: SOLR-1696.patch


 There is no reason why we should have a custom syntax for highlighter 
 configuration.
 It can be treated like any other SearchComponent and all the configuration 
 can go in there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.