[jira] Updated: (SOLR-342) Add support for Lucene's new Indexing and merge features (excluding Document/Field/Token reuse)

2008-02-07 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-342:
-

Attachment: SOLR-342.patch

Update of patch to account for the fact that mergeFactor is only for Log based 
merges.  I left it as the mergeFactor tag, but put in an instanceof clause in 
the init method of the SolrIndexWriter to check to see if the mergeFactor is 
settable.

 Add support for Lucene's new Indexing and merge features (excluding 
 Document/Field/Token reuse)
 ---

 Key: SOLR-342
 URL: https://issues.apache.org/jira/browse/SOLR-342
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: copyLucene.sh, SOLR-342.patch, SOLR-342.patch, 
 SOLR-342.patch, SOLR-342.tar.gz


 LUCENE-843 adds support for new indexing capabilities using the 
 setRAMBufferSizeMB() method that should significantly speed up indexing for 
 many applications.  To fix this, we will need trunk version of Lucene (or 
 wait for the next official release of Lucene)
 Side effect of this is that Lucene's new, faster StandardTokenizer will also 
 be incorporated.  
 Also need to think about how we want to incorporate the new merge scheduling 
 functionality (new default in Lucene is to do merges in a background thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-330) Use new Lucene Token APIs (reuse and char[] buff)

2008-02-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566610#action_12566610
 ] 

Grant Ingersoll commented on SOLR-330:
--

Note, this patch also includes SOLR-468

 Use new Lucene Token APIs (reuse and char[] buff)
 -

 Key: SOLR-330
 URL: https://issues.apache.org/jira/browse/SOLR-330
 Project: Solr
  Issue Type: Improvement
Reporter: Yonik Seeley
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: SOLR-330.patch


 Lucene is getting new Token APIs for better performance.
 - token reuse
 - char[] offset + len instead of String
 Requires a new version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-471) Distributed Solr Client

2008-02-07 Thread Nguyen Kien Trung (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1256#action_1256
 ] 

Nguyen Kien Trung commented on SOLR-471:


Thanks Yonik. Actually I did have a glance at SOLR-303
As I'm doing a Java project which requires interaction with multiple 
customized-solr instances and it happened to me that the requirement was not 
meet with the solution which SOLR-303 offers, so I made the workaround with the 
thought that the patch may be helpful to those who are having same situation 
like me. 

I'm quite new to solr but very excited with the promising features that solr is 
going to achieve

 Distributed Solr Client
 ---

 Key: SOLR-471
 URL: https://issues.apache.org/jira/browse/SOLR-471
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Affects Versions: 1.3
Reporter: Nguyen Kien Trung
Priority: Minor
 Attachments: distributedclient.patch


 Inspired by memcached java clients.
 The ability to update/search/delete among many solr instances
 Client parametters:
 - List of solr servers
 - Number of replicas
 Client functions:
 - Update: using consistent hashing to determine what documents are going to 
 be stored in what server. Get the list of servers (equal to number of 
 replicas) and issue parallel UPDATE
 - Search: parallel search all servers, aggregate distinct results
 - Delete: parallel delete in all servers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-469) DB Import RequestHandler

2008-02-07 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566599#action_12566599
 ] 

Noble Paul commented on SOLR-469:
-

We are planning to eliminate the schema creation step. So we may not need to 
put in those details which are already present in schema.xml and we can 
simplify the data-config and eliminate the copyField also. So we must 
introduce a verifier which ensures that the data-config is in sync with the 
schema.xml. 


 DB Import RequestHandler
 

 Key: SOLR-469
 URL: https://issues.apache.org/jira/browse/SOLR-469
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 1.3
Reporter: Noble Paul
Priority: Minor
 Fix For: 1.3

 Attachments: SOLR-469.patch


 We need a RequestHandler Which can import data from a DB or other dataSources 
 into the Solr index .Think of it as an advanced form of SqlUpload Plugin 
 (SOLR-103).
 The way it works is as follows.
 * Provide a configuration file (xml) to the Handler which takes in the 
 necessary SQL queries and mappings to a solr schema
   - It also takes in a properties file for the data source 
 configuraution
 * Given the configuration it can also generate the solr schema.xml
 * It is registered as a RequestHandler which can take two commands 
 do-full-import, do-delta-import
   -  do-full-import - dumps all the data from the Database into the 
 index (based on the SQL query in configuration)
   - do-delta-import - dumps all the data that has changed since last 
 import. (We assume a modified-timestamp column in tables)
 * It provides a admin page
   - where we can schedule it to be run automatically at regular 
 intervals
   - It shows the status of the Handler (idle, full-import, 
 delta-import)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-342) Add support for Lucene's new Indexing and merge features (excluding Document/Field/Token reuse)

2008-02-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566643#action_12566643
 ] 

Grant Ingersoll commented on SOLR-342:
--

I did some benchmarking of the autocommit functionality in Lucene (as opposed 
to in Solr, which is different).  Currently, in Lucene autocommit is true by 
default, meaning that every time there is a flush, it is also committed.  Solr 
adds its own layer on top of this with its commit semantics.  There is a 
noticeable difference in memory used and speed in Lucene performance between 
autocommit = false and autocommit = true.  

Some rough numbers using the autocommit.alg in Lucene's benchmark contrib (from 
trunk):  
 Operation   round ac ram   runCnt   recsPerRunrec/s  elapsedSec
avgUsedMemavgTotalMem
 [java] MAddDocs_20 0rue2.001   20400.1 
 499.9061,322,608 68,780,032
 [java] MAddDocs_20 -   1lse2.00 -  -   1 -  -  20 -  -   499.9 -  
- 400.08 -  49,373,632 -   75,018,240
 [java] MAddDocs_20 2rue2.001   20383.7 
 521.2770,716,096 75,018,240
 [java] MAddDocs_20 -   3lse2.00 -  -   1 -  -  20 -  -   552.7 -  
- 361.89 -  68,069,464 -   75,018,240

The first row has autocommit = true, second is false, and then alternating.  
The key value is the rec/s, which is:
1. ac = true 400.1
2. ac = false 499.9
3. ac = true 383.7
4. ac = false 552.7

Notice also the diff in avgUsedMem.  Adding this functionality may, perhaps, be 
more important to Solr's performance than the flush by RAM capability.

 Add support for Lucene's new Indexing and merge features (excluding 
 Document/Field/Token reuse)
 ---

 Key: SOLR-342
 URL: https://issues.apache.org/jira/browse/SOLR-342
 Project: Solr
  Issue Type: Improvement
  Components: update
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: copyLucene.sh, SOLR-342.patch, SOLR-342.patch, 
 SOLR-342.tar.gz


 LUCENE-843 adds support for new indexing capabilities using the 
 setRAMBufferSizeMB() method that should significantly speed up indexing for 
 many applications.  To fix this, we will need trunk version of Lucene (or 
 wait for the next official release of Lucene)
 Side effect of this is that Lucene's new, faster StandardTokenizer will also 
 be incorporated.  
 Also need to think about how we want to incorporate the new merge scheduling 
 functionality (new default in Lucene is to do merges in a background thread)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-236) Field collapsing

2008-02-07 Thread Oleg Gnatovskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566864#action_12566864
 ] 

oleg_gnatovskiy edited comment on SOLR-236 at 2/7/08 4:15 PM:
--

Hello everyone. I am planning to implement chain collapsing on a high traffic 
production environment, so I'd like to use a stable version of Solr. It doesn't 
seem like you have a chain collapse patch for Solr 1.2, so I tried the Solr 1.1 
patch. It seems to work fine at collapsing, but how do I get a countt for the 
documents other then the one being displayed?

As a result I see:

lst name=collapse_counts
int name=Restaurant2414/int
int name=Bar/Club9/int
int name=Directory  Services37/int
/lst

Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more 
Directory  Services? If so, then that's great.

However when I collapse on some integer fields I get an empty list for 
collapse_counts. Do counts only work for text fields?

Thanks in advance for any help you can provide!

  was (Author: oleg_gnatovskiy):
Hello everyone. I am planning to implement chain collapsing on a high 
traffic production environment, so I'd like to use a stable version of Solr. It 
doesn't seem like you have a chain collapse patch for Solr 1.2, so I tried the 
Solr 1.1 patch. It seems to work fine at collapsing, but how do I get a countt 
for the documents other then the one being displayed?

As a result I see:
code
lst name=collapse_counts
int name=Restaurant2414/int
int name=Bar/Club9/int
int name=Directory  Services37/int
/lst
/code

Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more 
Directory  Services? If so, then that's great.

However when I collapse on some integer fields I get an empty list for 
collapse_counts. Do counts only work for text fields?

Thanks in advance for any help you can provide!
  
 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
 Attachments: field-collapsing-extended-592129.patch, 
 field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (SOLR-236) Field collapsing

2008-02-07 Thread Oleg Gnatovskiy (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566864#action_12566864
 ] 

oleg_gnatovskiy edited comment on SOLR-236 at 2/7/08 4:18 PM:
--

Hello everyone. I am planning to implement chain collapsing on a high traffic 
production environment, so I'd like to use a stable version of Solr. It doesn't 
seem like you have a chain collapse patch for Solr 1.2, so I tried the Solr 1.1 
patch. It seems to work fine at collapsing, but how do I get a countt for the 
documents other then the one being displayed?

As a result I see:

lst name=collapse_counts
int name=Restaurant2414/int
int name=Bar/Club9/int
int name=Directory  Services37/int
/lst

Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more 
Directory  Services? If so, then that's great.

However when I collapse on some  fields I get an empty collapse_counts list. It 
could be that those fields have a large number of different values that it 
collapses on. Is there a limit to the number of values that collaose_counts 
displays?

Thanks in advance for any help you can provide!

  was (Author: oleg_gnatovskiy):
Hello everyone. I am planning to implement chain collapsing on a high 
traffic production environment, so I'd like to use a stable version of Solr. It 
doesn't seem like you have a chain collapse patch for Solr 1.2, so I tried the 
Solr 1.1 patch. It seems to work fine at collapsing, but how do I get a countt 
for the documents other then the one being displayed?

As a result I see:

lst name=collapse_counts
int name=Restaurant2414/int
int name=Bar/Club9/int
int name=Directory  Services37/int
/lst

Does that mean that there are 2414 more Restaurants, 9 more Bars and 37 more 
Directory  Services? If so, then that's great.

However when I collapse on some integer fields I get an empty list for 
collapse_counts. Do counts only work for text fields?

Thanks in advance for any help you can provide!
  
 Field collapsing
 

 Key: SOLR-236
 URL: https://issues.apache.org/jira/browse/SOLR-236
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Emmanuel Keller
 Attachments: field-collapsing-extended-592129.patch, 
 field_collapsing_1.1.0.patch, field_collapsing_1.3.patch, 
 field_collapsing_dsteigerwald.diff, field_collapsing_dsteigerwald.diff, 
 SOLR-236-FieldCollapsing.patch, SOLR-236-FieldCollapsing.patch, 
 SOLR-236-FieldCollapsing.patch


 This patch include a new feature called Field collapsing.
 Used in order to collapse a group of results with similar value for a given 
 field to a single entry in the result set. Site collapsing is a special case 
 of this, where all results for a given web site is collapsed into one or two 
 entries in the result set, typically with an associated more documents from 
 this site link. See also Duplicate detection.
 http://www.fastsearch.com/glossary.aspx?m=48amid=299
 The implementation add 3 new query parameters (SolrParams):
 collapse.field to choose the field used to group results
 collapse.type normal (default value) or adjacent
 collapse.max to select how many continuous results are allowed before 
 collapsing
 TODO (in progress):
 - More documentation (on source code)
 - Test cases
 Two patches:
 - field_collapsing.patch for current development version
 - field_collapsing_1.1.0.patch for Solr-1.1.0
 P.S.: Feedback and misspelling correction are welcome ;-)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2008-02-07 Thread Fuad Efendi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566865#action_12566865
 ] 

Fuad Efendi commented on SOLR-127:
--

This is an alternative to initially proposed HTTP-caching, and it is extremely 
easy to implement:

Simply add request parameter http.header=If-Modified-Since: Tue, 05 Feb 2008 
03:50:00 GMT (better is to use other names, do not use http.header parameter; 
see below...)
Let SOLR to respond via standard XML message Not Modified, and avoid using 
304 response code

What do you think? We can even encapsulate MAX-AGE, EXPIRES, and other useful 
stuff (like as additional UPDATE-FREQUENCY: 30 days) into XML, and all those 
staff can depend on internal Lucene statistics (and not on hard-coded values in 
SOLR-CONFIG).

We should not use HTTP-Protocol response headers such as 304/400/500 to 
describe SOLR's external API.

Sample: Apache HTTPD front-end, Tomcat (Struts-based middleware), and SOLR 
(backend). With your initial proposal different users will get different data. 
Why? Multithreading at Apache HTTPD. At least, there are some possible 
fluctuations, cache is not shared in some configurations, etc. Each thread may 
get own copy of last-modified, and different users will see different data. 
It won't work for most business cases.

Without HTTP:
is modified? 
when is next update of BOOKS category?
- all caches around the world have the same timestamp for BOOKS category
... ... ...

 Make Solr more friendly to external HTTP caches
 ---

 Key: SOLR-127
 URL: https://issues.apache.org/jira/browse/SOLR-127
 Project: Solr
  Issue Type: Wish
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 1.3

 Attachments: CacheUnitTest.patch, CacheUnitTest.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch


 an offhand comment I saw recently reminded me of something that really bugged 
 me about the serach solution i used *before* Solr -- it didn't play nicely 
 with HTTP caches that might be sitting in front of it.
 at the moment, Solr doesn't put in particularly usefull info in the HTTP 
 Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
 requests with a 400, and doesn't do anything special with If-Modified-Since.
 t the very least, we can set a Last-Modified based on when the current 
 IndexReder was open (if not the Date on the IndexReader) and use the same 
 info to determing how to respond to If-Modified-Since requests.
 (for the record, i think the reason this hasn't occured to me in the 2+ years 
 i've been using Solr, is because with the internal caching, i've yet to need 
 to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-127) Make Solr more friendly to external HTTP caches

2008-02-07 Thread Fuad Efendi (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12566869#action_12566869
 ] 

Fuad Efendi commented on SOLR-127:
--

Of course ETag etc. will synchronize caches; but anyway why do we need such 
features of HTTP specs?

HTTP Caching is widely used to cache responces from HTTP Servers, content 
(HTML, PDF, JPG, EXE) can be cached at coprorate proxy, and locally in Internet 
Explorer's internal cache. That is the main idea.

*Are SOLR-XML responses roving the world and reaching internal cache of Mozilla 
Firefox, or corporate caching proxies?*

-Not. 

Clients of SOLR: Middleware. Do they need to act as caching-proxy? May be 
Just another use case: middleware publishes current time  weather together 
with response from SOLR; middleware wants to cache responses from SOLR and do 
not rely on requests coming from end users because of frequent weather changes 
;)   - it depends on implementation of such middleware, for sure, it will try 
to cache SolrDocument objects instead of pure XML, and such kind of caching is 
not HTTP-related.





 Make Solr more friendly to external HTTP caches
 ---

 Key: SOLR-127
 URL: https://issues.apache.org/jira/browse/SOLR-127
 Project: Solr
  Issue Type: Wish
Reporter: Hoss Man
Assignee: Hoss Man
 Fix For: 1.3

 Attachments: CacheUnitTest.patch, CacheUnitTest.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch, 
 HTTPCaching.patch, HTTPCaching.patch, HTTPCaching.patch


 an offhand comment I saw recently reminded me of something that really bugged 
 me about the serach solution i used *before* Solr -- it didn't play nicely 
 with HTTP caches that might be sitting in front of it.
 at the moment, Solr doesn't put in particularly usefull info in the HTTP 
 Response headers to aid in caching (ie: Last-Modified), responds to all HEAD 
 requests with a 400, and doesn't do anything special with If-Modified-Since.
 t the very least, we can set a Last-Modified based on when the current 
 IndexReder was open (if not the Date on the IndexReader) and use the same 
 info to determing how to respond to If-Modified-Since requests.
 (for the record, i think the reason this hasn't occured to me in the 2+ years 
 i've been using Solr, is because with the internal caching, i've yet to need 
 to put a proxy cache in front of Solr)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



/example/solr/bin is empty in trunk

2008-02-07 Thread Fuad Efendi

Is it correct?.. I want to try distribution/replication in v.2.3