[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion
[ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660572#action_12660572 ] Noble Paul commented on SOLR-906: - Please ignore the number 40K docs. I just took it from your perf test numbers. I thought you were writing docs as a list I am referring to the client code .The method in UpdateRequest {code} public CollectionContentStream getContentStreams() throws IOException { return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML ); } {code} This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends This is valid where users do bulk updates. Not when they write one doc at a time. The new method SolrServer#add(IteratorSolrInputDocs docs) can start writing the docs immedietly and the docs can be uploaded as and when they are being produced. It is not related to these issue exactly, But the intend of this issue is to make upload faster. SOLR-865 is not very related to this issue. StreamingHttpSolrServer can use javabin format as well. bq.with the StreamingHttpSolrServer, you can send documents one at a time and each documents starts sending as soon as it can One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() . Buffered / Streaming SolrServer implementaion - Key: SOLR-906 URL: https://issues.apache.org/jira/browse/SOLR-906 Project: Solr Issue Type: New Feature Components: clients - java Reporter: Ryan McKinley Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal. This makes a new request for each document. With a StreamingHttpSolrServer, documents are buffered and then written to a single open Http connection. For related discussion see: http://www.nabble.com/solr-performance-tt9055437.html#a20833680 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion
[ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660607#action_12660607 ] Ryan McKinley commented on SOLR-906: Are you looking at the patch or just brainstorming how this could be implemented? {panel} I am referring to the client code .The method in UpdateRequest public CollectionContentStream getContentStreams() throws IOException { return ClientUtils.toContentStreams( getXML(), ClientUtils.TEXT_XML ); } This means that the getXML() method actually constructs a huge String which is the entire xml. It is not very good if we are writing out very large no:of docs {panel} This is not how the patch works... for starters, it never calls getContentStreams() for UpdateRequest. It opens a single connection and continually dumps the xml for each request. Rather then call getXML() the patch adds a function writeXml( Writer ) that writes directly to the open buffer. {panel} I am suggesting that ComonsHttpSolrServer has scope for improvement. Instead of building that String in memory we can just start streaming it to the server. So the OutputStream can be passed on to UpdateRequest so that it can write the xml right into the stream. So there is streaming effectively on both ends {panel} The ComonsHttpSolrServer is fine, but you are right that each UpdateRequest *may* want to write the content directly to the open stream. The ContentStream interface gives us all that control. One thing to note is that if you do not specify the length, the HttpCommons server will use chunked encoding. But I think adding the StreammingUpdateSolrServer resolves that for everyone. Uses have either option. {panel} One drawback of a StreamingHttpSolrServer is that it ends up opening multiple connections for uploading the documents {panel} Nonsense -- that is exactly what this avoids. It opens a single connection and writes everything to it. You can configure how many threads you want emptying the queue; each one will open a connection. {panel} Another enhancement . We can add one (or more ) extra thread in the server to do the call updaterequestprocessor.processAdd() . {panel} That opens a whole can of worms... perhaps better discussed on java-dev. For now I think sticking to the 1 thread/prequest is a good model. If you want multiple threads running on the server use multiple connections (it is even an argument in the StreammingHttpSolrServer) Buffered / Streaming SolrServer implementaion - Key: SOLR-906 URL: https://issues.apache.org/jira/browse/SOLR-906 Project: Solr Issue Type: New Feature Components: clients - java Reporter: Ryan McKinley Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal. This makes a new request for each document. With a StreamingHttpSolrServer, documents are buffered and then written to a single open Http connection. For related discussion see: http://www.nabble.com/solr-performance-tt9055437.html#a20833680 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion
[ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660610#action_12660610 ] Ryan McKinley commented on SOLR-906: I would like to go ahead and commit this patch soon. Shalin - did the changes in the latest patch resolve the issues you referred to? Buffered / Streaming SolrServer implementaion - Key: SOLR-906 URL: https://issues.apache.org/jira/browse/SOLR-906 Project: Solr Issue Type: New Feature Components: clients - java Reporter: Ryan McKinley Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal. This makes a new request for each document. With a StreamingHttpSolrServer, documents are buffered and then written to a single open Http connection. For related discussion see: http://www.nabble.com/solr-performance-tt9055437.html#a20833680 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-844) A SolrServer impl to front-end multiple urls
[ https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660656#action_12660656 ] Lance Norskog commented on SOLR-844: Concur with dubious reaction. However, what you mention about multiple hops is a valid point. The distributed searcher could have an option that returns the shard set. A Solr client library could run the distributed search/merge and return that to its calling app. A similar list of active all-in-one servers could also be handed back to this mythical client library. Anyway, here's a use case for load balancing: we wanted to take a server out of the load balancer, rewarm its caches, then put it back in the balancer. A SolrServer impl to front-end multiple urls Key: SOLR-844 URL: https://issues.apache.org/jira/browse/SOLR-844 Project: Solr Issue Type: New Feature Components: clients - java Affects Versions: 1.3 Reporter: Noble Paul Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-844.patch, SOLR-844.patch Currently a {{CommonsHttpSolrServer}} can talk to only one server. This demands that the user have a LoadBalancer or do the roundrobin on their own. We must have a {{LBHttpSolrServer}} which must automatically do a Loadbalancing between multiple hosts. This can be backed by the {{CommonsHttpSolrServer}} This can have the following other features * Automatic failover * Optionally take in a file /url containing the the urls of servers so that the server list can be automatically updated by periodically loading the config * Support for adding removing servers during runtime * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin, random etc) * Pluggable Failover mechanisms -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-844) A SolrServer impl to front-end multiple urls
[ https://issues.apache.org/jira/browse/SOLR-844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660661#action_12660661 ] Noble Paul commented on SOLR-844: - bq.The distributed searcher could have an option that returns the shard set. A Solr client library could run the distributed search/merge and return that to its calling app. Do you mean to say that the client library must handle the distributed search? That may not be a good idea. bq.Anyway, here's a use case for load balancing... Is it a point against this or for this? A SolrServer impl to front-end multiple urls Key: SOLR-844 URL: https://issues.apache.org/jira/browse/SOLR-844 Project: Solr Issue Type: New Feature Components: clients - java Affects Versions: 1.3 Reporter: Noble Paul Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-844.patch, SOLR-844.patch Currently a {{CommonsHttpSolrServer}} can talk to only one server. This demands that the user have a LoadBalancer or do the roundrobin on their own. We must have a {{LBHttpSolrServer}} which must automatically do a Loadbalancing between multiple hosts. This can be backed by the {{CommonsHttpSolrServer}} This can have the following other features * Automatic failover * Optionally take in a file /url containing the the urls of servers so that the server list can be automatically updated by periodically loading the config * Support for adding removing servers during runtime * Pluggable Loadbalancing mechanism. (round-robin, weighted round-robin, random etc) * Pluggable Failover mechanisms -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-906) Buffered / Streaming SolrServer implementaion
[ https://issues.apache.org/jira/browse/SOLR-906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660662#action_12660662 ] Noble Paul commented on SOLR-906: - Hi Ryan, You got me wrong. I was trying to say how to make CommonsHttpSolrServer efficient by streaming docs as StreamingHttpSolrServer does when I add docs in bulk using {code} SolrServer.add(ListSolrInputDocument docs) {code} Yes , StreamingHttpSolrServer uses only one connection per thread and it closes the connection after waiting for 250ms for a new document. Buffered / Streaming SolrServer implementaion - Key: SOLR-906 URL: https://issues.apache.org/jira/browse/SOLR-906 Project: Solr Issue Type: New Feature Components: clients - java Reporter: Ryan McKinley Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, SOLR-906-StreamingHttpSolrServer.patch, StreamingHttpSolrServer.java While indexing lots of documents, the CommonsHttpSolrServer add( SolrInputDocument ) is less then optimal. This makes a new request for each document. With a StreamingHttpSolrServer, documents are buffered and then written to a single open Http connection. For related discussion see: http://www.nabble.com/solr-performance-tt9055437.html#a20833680 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-914) Presence of finalize() in the codebase
[ https://issues.apache.org/jira/browse/SOLR-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660668#action_12660668 ] Lance Norskog commented on SOLR-914: A note: it is a good practice to use finalize() methods to check that a resource has already been released. It should log an error if the resource has not been released. Finalize() methods are all parked on one queue and that queue can back up. This can eventually cause the app to lock up. This is why it is not good to do I/O actions (like close a database connection) inside the finalize method. If the method only checks an internal marker, that will not cause a backup. Presence of finalize() in the codebase --- Key: SOLR-914 URL: https://issues.apache.org/jira/browse/SOLR-914 Project: Solr Issue Type: Improvement Components: clients - java Affects Versions: 1.3 Environment: Tomcat 6, JRE 6 Reporter: Kay Kay Priority: Minor Fix For: 1.4 Original Estimate: 480h Remaining Estimate: 480h There seems to be a number of classes - that implement finalize() method. Given that it is perfectly ok for a Java VM to not to call it - may be - there has to some other way { try .. finally - when they are created to destroy them } to destroy them and the presence of finalize() method , ( depending on implementation ) might not serve what we want and in some cases can end up delaying the gc process, depending on the algorithms. $ find . -name *.java | xargs grep finalize ./contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/JdbcDataSource.java: protected void finalize() { ./src/java/org/apache/solr/update/SolrIndexWriter.java: protected void finalize() { ./src/java/org/apache/solr/core/CoreContainer.java: protected void finalize() { ./src/java/org/apache/solr/core/SolrCore.java: protected void finalize() { ./src/common/org/apache/solr/common/util/ConcurrentLRUCache.java: protected void finalize() throws Throwable { May be we need to revisit these occurences from a design perspective to see if they are necessary / if there is an alternate way of managing guaranteed destruction of resources. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.