Re: SolrCloud and exernal file fields
Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work. Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is
Re: SolrCloud and exernal file fields
Martin, Right as far node in Zookeeper DistributedUpdateProcessor will broadcast commits to all peers. To hack this you can introduce dedicated UpdateProcessorChain without DistributedUpdateProcessor and send commit to that chain. 28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал: Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work. Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev
Re: SolrCloud and exernal file fields
Keep in mind that the distrib update proc will be auto inserted into chains! You have to include a proc that disables it - see the FAQ: http://wiki.apache.org/solr/SolrCloud#FAQ - Mark On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Right as far node in Zookeeper DistributedUpdateProcessor will broadcast commits to all peers. To hack this you can introduce dedicated UpdateProcessorChain without DistributedUpdateProcessor and send commit to that chain. 28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал: Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work. Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev
Re: SolrCloud and exernal file fields
Mark, Your comment is quite valuable. Let me mention the keyword to be able to find later NoOpDistributingUpdateProcessorFactory.* *Thanks*! * On Wed, Nov 28, 2012 at 5:56 PM, Mark Miller markrmil...@gmail.com wrote: Keep in mind that the distrib update proc will be auto inserted into chains! You have to include a proc that disables it - see the FAQ: http://wiki.apache.org/solr/SolrCloud#FAQ - Mark On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Right as far node in Zookeeper DistributedUpdateProcessor will broadcast commits to all peers. To hack this you can introduce dedicated UpdateProcessorChain without DistributedUpdateProcessor and send commit to that chain. 28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал: Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work. Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram,
Re: SolrCloud and exernal file fields
Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail
Re: SolrCloud and exernal file fields
Hi Gopal, the post you linked is interesting, it takes a different approach than mine : it implements a codec for Lucene, so at a lower level than my solution that works at Solr UpdateHandler level, so before the document reaches Lucene. The lucene-codec approach should offer a few advantages : the field is normally exposed in the document, and as such carried by SolrCloud while creating new replicas (which is the part I'm not yet sure my solution handles correctly). On the other side, it limits some flexibility, I'm already planning at least atomic addition to support popularity ranking. My post on lucene-dev has received no feedback so far. I'll keep working on it, but I'm still far from a submittable patch, and help from the dev community would be of great. Simone 2012/11/24 Gopal Patwa gopalpa...@gmail.com Hi, I am also very much interested in this, since we use Solr 4 with NRT where we update index every second but most of time it update only stored filed. if Solr/Lucene could provide external datastore without re-indexing even for stored field only, it would be very beneficial for frequent update use case, where cache invalidation will not happen for stored fields update and it will improve indexing performance due to smaller index size. Here is below link for similar work. http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni simo...@apache.org wrote: Posted, see it here http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html Simone 2012/11/23 Simone Gianni simo...@apache.org 2012/11/22 Martin Koch m...@issuu.com IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. Hi Martin, I'm working on implementing exactly this, and I have a working prototype right now. I'm going to write on lucene dev about the details and asking advice there. I'll contribute the code, so anyone interested followup on dev. Simone
Re: SolrCloud and exernal file fields
The short answer is no; the number was chosen in an attempt to get as many cores working in parallel to complete the search faster, but I realize that there is an overhead incurred by distribution and merging the results. We've now gone to 8 shards and will be monitoring performance. /Martin On Thu, Nov 22, 2012 at 3:53 PM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote: around 7M documents in the index; each document has a 45 character ID. 7M documents isn't that large. Is there a reason why you need so many shards (16 in your case) on a single box? -Yonik http://lucidworks.com
Re: SolrCloud and exernal file fields
2012/11/22 Martin Koch m...@issuu.com IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. Hi Martin, I'm working on implementing exactly this, and I have a working prototype right now. I'm going to write on lucene dev about the details and asking advice there. I'll contribute the code, so anyone interested followup on dev. Simone
Re: SolrCloud and exernal file fields
Posted, see it here http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html Simone 2012/11/23 Simone Gianni simo...@apache.org 2012/11/22 Martin Koch m...@issuu.com IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. Hi Martin, I'm working on implementing exactly this, and I have a working prototype right now. I'm going to write on lucene dev about the details and asking advice there. I'll contribute the code, so anyone interested followup on dev. Simone
Re: SolrCloud and exernal file fields
Hi, I am also very much interested in this, since we use Solr 4 with NRT where we update index every second but most of time it update only stored filed. if Solr/Lucene could provide external datastore without re-indexing even for stored field only, it would be very beneficial for frequent update use case, where cache invalidation will not happen for stored fields update and it will improve indexing performance due to smaller index size. Here is below link for similar work. http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni simo...@apache.org wrote: Posted, see it here http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html Simone 2012/11/23 Simone Gianni simo...@apache.org 2012/11/22 Martin Koch m...@issuu.com IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. Hi Martin, I'm working on implementing exactly this, and I have a working prototype right now. I'm going to write on lucene dev about the details and asking advice there. I'll contribute the code, so anyone interested followup on dev. Simone
Re: SolrCloud and exernal file fields
Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by
Re: SolrCloud and exernal file fields
On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote: around 7M documents in the index; each document has a 45 character ID. 7M documents isn't that large. Is there a reason why you need so many shards (16 in your case) on a single box? -Yonik http://lucidworks.com
Re: SolrCloud and exernal file fields
On Wed, Nov 21, 2012 at 7:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir should be org.apache.lucene.store.MMapDirectory It says ' org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@ ' /Martin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrCloud and exernal file fields
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores you run per box? This box is a 16-virtual core (8 hyperthreaded
Re: SolrCloud and exernal file fields
Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just
Re: SolrCloud and exernal file fields
Hi Martin, thanks for sharing your experience with EFF and saving me a lot of time figuring it out myself, I was afraid of exactly this kind of problems. Mikhail, thanks for expanding the thread with even more useful informations! Simone 2012/11/20 Martin Koch m...@issuu.com Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query front-end. This value will then be the authoritative value at the time of the query. The value of the popularity measure that we use for boosting in the ranking of the search results is only updated when the value has changed enough so that the impact on the boost will be significant (say, more than 2%). This does require frequent re-indexing of the documents that have significant changes in the number of reads, but at least we won't have to update a document if it moves from, say, 100 to 101 reads. /Martin Koch - ISSUU - senior systems architect. On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote: Hi all, I'm planning to move a quite big Solr index to SolrCloud. However, in this index, an external file field is used for popularity ranking. Does SolrCloud supports external file fields? How does it cope with sharding and replication? Where should the external file be placed now that the index folder is not local but in the cloud? Are there otherwise other best practices to deal with the use cases external file fields were used for, like popularity/ranking, in SolrCloud? Custom ValueSources going to something external? Thanks in advance, Simone
Re: SolrCloud and exernal file fields
Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and
Re: SolrCloud and exernal file fields
Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query front-end. This value will then be the authoritative value at the time of the query. The value of the popularity measure that we use for boosting in the ranking of the search results is only updated when the value has changed enough so that the impact on the boost will be significant (say, more than 2%). This does require frequent re-indexing of the documents that have significant changes in the number of reads, but at least we won't have to update a document if it moves from, say, 100 to 101 reads. /Martin Koch - ISSUU - senior systems architect. On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote: Hi all, I'm planning to move a quite big Solr index to SolrCloud. However, in this index, an external file field is used for popularity ranking. Does SolrCloud supports external file fields? How does it cope with sharding and replication? Where should the external file be placed now that the index folder is not local but in the cloud? Are there otherwise other best practices to deal with the use cases external file fields were used for, like popularity/ranking, in SolrCloud? Custom ValueSources going to something external? Thanks in advance, Simone
Re: SolrCloud and exernal file fields
Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. why all cores can't read these values simultaneously? Can you confirm that IDs in the file is ordered by the index term order? AFAIK it can impact load time. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? I'm really appreciate if you comment on the questions above. PS: It's time to pitch, how much https://issues.apache.org/jira/browse/SOLR-4085 Commit-free ExternalFileField can help you? On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote: Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query front-end. This value will then be the authoritative value at the time of the query. The value of the popularity measure that we use for boosting in the ranking of the search results is only updated when the value has changed enough so that the impact on the boost will be significant (say, more than 2%). This does require frequent re-indexing of the documents that have significant changes in the number of reads, but at least we won't have to update a document if it moves from, say, 100 to 101 reads. /Martin Koch - ISSUU - senior systems architect. On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote: Hi all, I'm planning to move a quite big Solr index to SolrCloud. However, in this index, an external file field is used for popularity ranking. Does SolrCloud supports external file fields? How does it cope with sharding and replication? Where should the external file be placed now that the index folder is not local but in the cloud? Are there otherwise other best practices to deal with the use cases external file fields were used for, like popularity/ranking, in SolrCloud? Custom ValueSources going to something external? Thanks in advance, Simone -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrCloud and exernal file fields
Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). why all cores can't read these values simultaneously? Again, this is a solr implementation detail that I can't answer :) Can you confirm that IDs in the file is ordered by the index term order? Yes, we sorted the files (standard UNIX sort). AFAIK it can impact load time. Yes, it does. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? In our architecture, the clients query an API that generates the SOLR query, retrieves the relevant additional fields that we needs, and returns the relevant JSON to the front-end. In our use case, results are returned from SOLR by the 10's, not by the 1000's, so it is a manageable job. Even so, if solr returned thousands of results, it would be up to the implementation of the api to augment only the results that needed to be returned to the front-end. Even so, patching up a JSON structure with 1 results should be possible. I'm really appreciate if you comment on the questions above. PS: It's time to pitch, how much https://issues.apache.org/jira/browse/SOLR-4085 Commit-free ExternalFileField can help you? It looks very interesting :) Does it make it possible to avoid re-reading the EFF on every commit, and only re-read the values that have actually changed? /Martin On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote: Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query
Re: SolrCloud and exernal file fields
Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Also, about simlink - Don't you share that file via some NFS? how many cores you run per box? Do boxes has plenty of ram to cache filesystem beside of jvm heaps? I assume you use 64 bit linux and mmap directory. Please confirm that. causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). Ok. Let's have it in mind as a possible direction. why all cores can't read these values simultaneously? Again, this is a solr implementation detail that I can't answer :) Can you confirm that IDs in the file is ordered by the index term order? Yes, we sorted the files (standard UNIX sort). AFAIK it can impact load time. Yes, it does Ok, I've got that you aware of it, and your IDs are just strings, not integers. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? In our architecture, the clients query an API that generates the SOLR query, retrieves the relevant additional fields that we needs, and returns the relevant JSON to the front-end. In our use case, results are returned from SOLR by the 10's, not by the 1000's, so it is a manageable job. Even so, if solr returned thousands of results, it would be up to the implementation of the api to augment only the results that needed to be returned to the front-end. Even so, patching up a JSON structure with 1 results should be possible. You are right. I'm concerned anyway because retrieving whole result is expensive, and not always possible. I'm really appreciate if you comment on the questions above. PS: It's time to pitch, how much https://issues.apache.org/jira/browse/SOLR-4085 Commit-free ExternalFileField can help you? It looks very interesting :) Does it make it possible to avoid re-reading the EFF on every commit, and only re-read the values that have actually changed? You don't need commit (in SOLR-4085) to reload file content, but after commit you need to read whole file and scan all key terms and postings. That's because EFF sits on top of top level searcher. it's a Solr-like way. In some future we might have per-segment EFF, in this case adding a segment will trigger full file scan, but in the index only that new segment will be scanned. It should be faster. You know, straightforward sharing internal data structures between different index views/generations is not possible. If you are asking about applying delta changes on external file that's something what we did ourselves http://goo.gl/P8GFq . This feature is much more doubtful and vague, although it might be the next contribution after SOLR-4085. /Martin On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote: Solr 4.0 does
Re: SolrCloud and exernal file fields
Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores you run per box? This box is a 16-virtual core (8 hyperthreaded cores) with 60GB of RAM. We run 16 solr cores on this box in Jetty. Do boxes has plenty of ram to cache filesystem beside of jvm heaps? Yes. We've allocated 10GB for jetty, and left the rest for the OS. I assume you use 64 bit linux and mmap directory. Please confirm that. We use 64-bit linux. I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). Ok. Let's have it in mind as a possible direction. Alternatively, an API that would allow updating a single field for a document might be an option. why all cores can't read these values simultaneously? Again, this is a solr implementation detail that I can't answer :) Can you confirm that IDs in the file is ordered by the index term order? Yes, we sorted the files (standard UNIX sort). AFAIK it can impact load time. Yes, it does Ok, I've got that you aware of it, and your IDs are just strings, not integers. Yes, ids are strings. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? In our architecture, the clients query an API that generates the SOLR query, retrieves the relevant additional fields that we needs, and returns the relevant JSON to the front-end. In our use case, results are returned from SOLR by the 10's, not by the 1000's, so it is a manageable job. Even so, if solr returned thousands of results, it would be up to the implementation of the api to augment only the results that needed to be returned to the front-end. Even so, patching up a JSON structure with 1 results should be possible.
Re: SolrCloud and exernal file fields
Martin, This deployment seems a little bit confusing to me. You have 16-way fairy virtual box, and send 16 request for really heavy operation at the same moment, it does not surprise me that you loosing it for some period of time. At that time you should have more than 16 in load average metrics. I suggest to send commit to those cores one-by-one and have inconsistency and some sort of blinking as a trade-off for availability. In this case only single virtual CPU will be fully consumed by the commit's _thread divergence action_ and others will serve requests. Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores you run per box? This box is a 16-virtual core (8 hyperthreaded cores) with 60GB of RAM. We run 16 solr cores on this box in Jetty. Do boxes has plenty of ram to cache filesystem beside of jvm heaps? Yes. We've allocated 10GB for jetty, and left the rest for the OS. I assume you use 64 bit linux and mmap directory. Please confirm that. We use 64-bit linux. I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). Ok. Let's have it in mind as a possible direction. Alternatively, an API that would allow updating a single field for a document might be an option. why all cores can't read these values
Re: SolrCloud and exernal file fields
On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir should be org.apache.lucene.store.MMapDirectory -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrCloud and exernal file fields
Mikhail I appreciate your input, it's very useful :) On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, This deployment seems a little bit confusing to me. You have 16-way fairy virtual box, and send 16 request for really heavy operation at the same moment, it does not surprise me that you loosing it for some period of time. At that time you should have more than 16 in load average metrics. I suggest to send commit to those cores one-by-one and have inconsistency and some sort of blinking as a trade-off for availability. In this case only single virtual CPU will be fully consumed by the commit's _thread divergence action_ and others will serve requests. I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores
SolrCloud and exernal file fields
Hi all, I'm planning to move a quite big Solr index to SolrCloud. However, in this index, an external file field is used for popularity ranking. Does SolrCloud supports external file fields? How does it cope with sharding and replication? Where should the external file be placed now that the index folder is not local but in the cloud? Are there otherwise other best practices to deal with the use cases external file fields were used for, like popularity/ranking, in SolrCloud? Custom ValueSources going to something external? Thanks in advance, Simone