[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

2009-01-04 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660572#action_12660572
 ] 

Noble Paul commented on SOLR-906:
-

Please ignore the number 40K docs. I just took it from your perf test numbers. 
I thought you were writing docs as a list

I am referring to the client code .The method in UpdateRequest
{code}
public CollectionContentStream getContentStreams() throws IOException {
return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
}
{code}

This means that the getXML() method actually constructs a huge String which is 
the entire xml. It is not very good if we are writing out very large no:of docs

I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of 
building that String in memory  we can just start streaming it to the server. 
So the OutputStream can be passed on to UpdateRequest so that it can write the 
xml right into the stream. So there is streaming effectively on both ends

This is valid where users do bulk updates. Not when they write one doc at a 
time. 

The new method SolrServer#add(IteratorSolrInputDocs docs) can start writing 
the docs immedietly and the docs can be uploaded as and when they are being 
produced. It is not related to these issue exactly, But the intend of this 
issue is to make upload faster.


SOLR-865 is not very related to this issue. StreamingHttpSolrServer can use 
javabin format as well.

bq.with the StreamingHttpSolrServer, you can send documents one at a time and 
each documents starts sending as soon as it can
One drawback of a StreamingHttpSolrServer is that it ends up opening  multiple 
connections for uploading the documents

Another enhancement . We can add one (or more ) extra thread in the server to 
do the call updaterequestprocessor.processAdd() . 

 Buffered / Streaming SolrServer implementaion
 -

 Key: SOLR-906
 URL: https://issues.apache.org/jira/browse/SOLR-906
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java


 While indexing lots of documents, the CommonsHttpSolrServer add( 
 SolrInputDocument ) is less then optimal.  This makes a new request for each 
 document.
 With a StreamingHttpSolrServer, documents are buffered and then written to 
 a single open Http connection.
 For related discussion see:
 http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

2009-01-04 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660607#action_12660607
 ] 

Ryan McKinley commented on SOLR-906:


Are you looking at the patch or just brainstorming how this could be 
implemented?

{panel}
I am referring to the client code .The method in UpdateRequest

public CollectionContentStream getContentStreams() throws IOException {
return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML );
}

This means that the getXML() method actually constructs a huge String which is 
the entire xml. It is not very good if we are writing out very large no:of docs
{panel}

This is not how the patch works...  for starters, it never calls 
getContentStreams() for UpdateRequest.  It opens a single connection and 
continually dumps the xml for each request.  Rather then call getXML() the 
patch adds a function writeXml( Writer ) that writes directly to the open 
buffer.

{panel}
I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of 
building that String in memory we can just start streaming it to the server. So 
the OutputStream can be passed on to UpdateRequest so that it can write the xml 
right into the stream. So there is streaming effectively on both ends
{panel}
The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest 
*may* want to write the content directly to the open stream.  The ContentStream 
interface gives us all that control.  One thing to note is that if you do not 
specify the length, the HttpCommons server will use chunked encoding.

But I think adding the StreammingUpdateSolrServer resolves that for everyone.  
Uses have either option.

{panel}
 One drawback of a StreamingHttpSolrServer is that it ends up opening multiple 
connections for uploading the documents
{panel}
Nonsense -- that is exactly what this avoids.  It opens a single connection and 
writes everything to it.  You can configure how many threads you want emptying 
the queue; each one will open a connection.

{panel}
Another enhancement . We can add one (or more ) extra thread in the server to 
do the call updaterequestprocessor.processAdd() . 
{panel}
That opens a whole can of worms...  perhaps better discussed on java-dev.  For 
now I think sticking to the 1 thread/prequest is a good model.  If you want 
multiple threads running on the server use multiple connections (it is even an 
argument in the StreammingHttpSolrServer)

 Buffered / Streaming SolrServer implementaion
 -

 Key: SOLR-906
 URL: https://issues.apache.org/jira/browse/SOLR-906
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java


 While indexing lots of documents, the CommonsHttpSolrServer add( 
 SolrInputDocument ) is less then optimal.  This makes a new request for each 
 document.
 With a StreamingHttpSolrServer, documents are buffered and then written to 
 a single open Http connection.
 For related discussion see:
 http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

2009-01-04 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660610#action_12660610
 ] 

Ryan McKinley commented on SOLR-906:


I would like to go ahead and commit this patch soon.  Shalin - did the changes 
in the latest patch resolve the issues you referred to?

 Buffered / Streaming SolrServer implementaion
 -

 Key: SOLR-906
 URL: https://issues.apache.org/jira/browse/SOLR-906
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java


 While indexing lots of documents, the CommonsHttpSolrServer add( 
 SolrInputDocument ) is less then optimal.  This makes a new request for each 
 document.
 With a StreamingHttpSolrServer, documents are buffered and then written to 
 a single open Http connection.
 For related discussion see:
 http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-844) A SolrServer impl to front-end multiple urls

2009-01-04 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660656#action_12660656
 ] 

Lance Norskog commented on SOLR-844:


Concur with dubious reaction. However, what you mention about multiple hops is 
a valid point.

The distributed searcher could have an option that returns the shard set. A 
Solr client library could run the distributed search/merge and return that to 
its calling app.  A similar list of active all-in-one servers could also be 
handed back to this mythical client library. 

Anyway, here's a use case for load balancing: we wanted to take a server out of 
the load balancer, rewarm its caches, then put it back in the balancer.

 A SolrServer impl to front-end multiple urls
 

 Key: SOLR-844
 URL: https://issues.apache.org/jira/browse/SOLR-844
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-844.patch, SOLR-844.patch


 Currently a {{CommonsHttpSolrServer}} can talk to only one server. This 
 demands that the user have a LoadBalancer or do the roundrobin on their own. 
 We must have a {{LBHttpSolrServer}} which must automatically do a 
 Loadbalancing between multiple hosts. This can be backed by the 
 {{CommonsHttpSolrServer}}
 This can have the following other features
 * Automatic failover
 * Optionally take in  a file /url containing the the urls of servers so that 
 the server list can be automatically updated  by periodically loading the 
 config
 * Support for adding removing servers during runtime
 * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin, 
 random etc)
 * Pluggable Failover mechanisms

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-844) A SolrServer impl to front-end multiple urls

2009-01-04 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660661#action_12660661
 ] 

Noble Paul commented on SOLR-844:
-

bq.The distributed searcher could have an option that returns the shard set. A 
Solr client library could run the distributed search/merge and return that to 
its calling app.

Do you mean to say that the client library must handle the distributed search? 
That may not be a good idea.

bq.Anyway, here's a use case for load balancing...
Is it a point against this or for this?

 A SolrServer impl to front-end multiple urls
 

 Key: SOLR-844
 URL: https://issues.apache.org/jira/browse/SOLR-844
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-844.patch, SOLR-844.patch


 Currently a {{CommonsHttpSolrServer}} can talk to only one server. This 
 demands that the user have a LoadBalancer or do the roundrobin on their own. 
 We must have a {{LBHttpSolrServer}} which must automatically do a 
 Loadbalancing between multiple hosts. This can be backed by the 
 {{CommonsHttpSolrServer}}
 This can have the following other features
 * Automatic failover
 * Optionally take in  a file /url containing the the urls of servers so that 
 the server list can be automatically updated  by periodically loading the 
 config
 * Support for adding removing servers during runtime
 * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin, 
 random etc)
 * Pluggable Failover mechanisms

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion

2009-01-04 Thread Noble Paul (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660662#action_12660662
 ] 

Noble Paul commented on SOLR-906:
-

Hi Ryan,
You got me wrong. I was trying to say how to make CommonsHttpSolrServer 
efficient by streaming docs as StreamingHttpSolrServer does when I add docs in 
bulk using 
{code}
SolrServer.add(ListSolrInputDocument docs)
{code}

Yes , StreamingHttpSolrServer uses only one connection per thread and it closes 
the connection after waiting for 250ms for a new document.

 Buffered / Streaming SolrServer implementaion
 -

 Key: SOLR-906
 URL: https://issues.apache.org/jira/browse/SOLR-906
 Project: Solr
  Issue Type: New Feature
  Components: clients - java
Reporter: Ryan McKinley
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, 
 SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java


 While indexing lots of documents, the CommonsHttpSolrServer add( 
 SolrInputDocument ) is less then optimal.  This makes a new request for each 
 document.
 With a StreamingHttpSolrServer, documents are buffered and then written to 
 a single open Http connection.
 For related discussion see:
 http://www.nabble.com/solr-performance-tt9055437.html#a20833680

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-914) Presence of finalize() in the codebase

2009-01-04 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660668#action_12660668
 ] 

Lance Norskog commented on SOLR-914:


A note: it is a good practice to use finalize() methods to check that a 
resource has already been released. It should log an error if the resource has 
not been released. Finalize() methods are all parked on one queue and that 
queue can back up. This can eventually cause the app to lock up. This is why it 
is not good to do I/O actions (like close a database connection) inside the 
finalize method.

If the method only checks an internal marker, that will not cause a backup.

 Presence of finalize() in the codebase 
 ---

 Key: SOLR-914
 URL: https://issues.apache.org/jira/browse/SOLR-914
 Project: Solr
  Issue Type: Improvement
  Components: clients - java
Affects Versions: 1.3
 Environment: Tomcat 6, JRE 6
Reporter: Kay Kay
Priority: Minor
 Fix For: 1.4

   Original Estimate: 480h
  Remaining Estimate: 480h

 There seems to be a number of classes - that implement finalize() method.  
 Given that it is perfectly ok for a Java VM to not to call it - may be - 
 there has to some other way  { try .. finally - when they are created to 
 destroy them } to destroy them and the presence of finalize() method , ( 
 depending on implementation ) might not serve what we want and in some cases 
 can end up delaying the gc process, depending on the algorithms. 
 $ find . -name *.java | xargs grep finalize
 ./contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/JdbcDataSource.java:
   protected void finalize() {
 ./src/java/org/apache/solr/update/SolrIndexWriter.java:  protected void 
 finalize() {
 ./src/java/org/apache/solr/core/CoreContainer.java:  protected void 
 finalize() {
 ./src/java/org/apache/solr/core/SolrCore.java:  protected void finalize() {
 ./src/common/org/apache/solr/common/util/ConcurrentLRUCache.java:  protected 
 void finalize() throws Throwable {
 May be we need to revisit these occurences from a design perspective to see 
 if they are necessary / if there is an alternate way of managing guaranteed 
 destruction of resources. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.