Re: SolrCloud and exernal file fields

2012-11-28 Thread Martin Koch
Mikhail

I haven't experimented further yet. I think that the previous experiment of
issuing a commit to a specific core proved that all cores get the commit,
so I don't think that this approach will work.

Thanks,
/Martin


On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 It's still not clear to me whether you solve the problem completely or
 partially:
 Does reducing number of cores free some resources for searching during
 commit?
 Does the commiting one-by-one core prevents the freeze?

 Thanks


 On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:

 Mikhail

 To avoid freezes we deployed the patches that are now on the 4.1 trunk
 (bug
 3985). But this wasn't good enough, because SOLR would still take very
 long
 to restart when that was necessary.

 I don't see how we could throw more hardware at the problem without making
 it worse, really - the only solution here would be *fewer* shards, not

 more.

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could
 be
 by linking to some external data store, or in the lucene/solr internals.
 If
 it would make things easier, a good first step would be to have
 dynamically
 updateable numerical fields only.

 /Martin

 On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  I don't think solrconfig.xml shed any light on. I've just found what I
  didn't get in your setup - the way of how to explicitly assigning core
 to
  collection. Now, I realized most of details after all!
  Ball is on your side, let us know whether you have managed your cores to
  commit one by one to avoid freeze, or could you eliminate pauses by
  allocating more hardware?
  Thanks in advance!
 
 
  On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
   Mikhail,
  
   PSB
  
   On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com
 wrote:
   

 I wasn't aware until now that it is possible to send a commit to
 one
   core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should
  experiment
 with solr/coreN/update?commit=true. A quick trial run seems to
  indicate
 that a commit to a single core causes commits on all cores.

You should see something like this in the log:
... SolrCmdDistributor  Distrib commit to: ...
   
Yup, a commit towards a single core results in a commit on all
 cores.
  
  


 Perhaps I should clarify that we are using SOLR as a black box;
 we do
   not
 touch the code at all - we only install the distribution WAR file
 and
 proceed from there.

I still don't understand how you deploy/launch Solr. How many jettys
  you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you
 specifies
shards= param for every request and distributes updates yourself?
 What
collections do you create and with which settings?
   
We let SOLR do the sharding using one collection with 16 SOLR cores
   holding one shard each. We launch only one instance of jetty with the
   folllowing arguments:
  
   -DnumShards=16
   -DzkHost=zookeeperhost:port
   -Xmx10G
   -Xms10G
   -Xmn2G
   -server
  
   Would you like to see the solrconfig.xml?
  
   /Martin
  
  


  Also from my POV such deployments should start at least from
 *16*
   4-way
  vboxes, it's more expensive, but should be much better available
   during
  cpu-consuming operations.
 

 Do you mean that you recommend 16 hosts with 4 cores each? Or 4
 hosts
with
 16 cores? Or am I misunderstanding something :) ?

I prefer to start from 16 hosts with 4 cores each.
   
   


  Other details, if you use single jetty for all of them, are you
  sure
that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of
  cores
 index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, but
 we're
   using
 an
 SSD disk to minimize disk access time. We have tried putting the
 EFF
onto a
 ram disk, but this didn't have a measurable effect.

 Thanks,
 /Martin


  Thanks
 
 
  On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
  wrote:
 
   Mikhail
  
   PSB
  
   On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Please find additional question from me below.
   
Simone,
   
I'm sorry for hijacking your thread. The only what I've
 heard
   about
 it
  at
recent ApacheCon sessions is that Zookeeper is 

Re: SolrCloud and exernal file fields

2012-11-28 Thread Mikhail Khludnev
Martin,
Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
commits to all peers. To hack this you can introduce dedicated
UpdateProcessorChain without DistributedUpdateProcessor and send commit to
that chain.
 28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал:

 Mikhail

 I haven't experimented further yet. I think that the previous experiment
 of issuing a commit to a specific core proved that all cores get the
 commit, so I don't think that this approach will work.

 Thanks,
 /Martin


 On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

 Martin,

 It's still not clear to me whether you solve the problem completely or
 partially:
 Does reducing number of cores free some resources for searching during
 commit?
 Does the commiting one-by-one core prevents the freeze?

 Thanks


 On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:

 Mikhail

 To avoid freezes we deployed the patches that are now on the 4.1 trunk
 (bug
 3985). But this wasn't good enough, because SOLR would still take very
 long
 to restart when that was necessary.

 I don't see how we could throw more hardware at the problem without
 making
 it worse, really - the only solution here would be *fewer* shards, not

 more.

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could
 be
 by linking to some external data store, or in the lucene/solr internals.
 If
 it would make things easier, a good first step would be to have
 dynamically
 updateable numerical fields only.

 /Martin

 On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  I don't think solrconfig.xml shed any light on. I've just found what I
  didn't get in your setup - the way of how to explicitly assigning core
 to
  collection. Now, I realized most of details after all!
  Ball is on your side, let us know whether you have managed your cores
 to
  commit one by one to avoid freeze, or could you eliminate pauses by
  allocating more hardware?
  Thanks in advance!
 
 
  On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
   Mikhail,
  
   PSB
  
   On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com
 wrote:
   

 I wasn't aware until now that it is possible to send a commit to
 one
   core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should
  experiment
 with solr/coreN/update?commit=true. A quick trial run seems to
  indicate
 that a commit to a single core causes commits on all cores.

You should see something like this in the log:
... SolrCmdDistributor  Distrib commit to: ...
   
Yup, a commit towards a single core results in a commit on all
 cores.
  
  


 Perhaps I should clarify that we are using SOLR as a black box;
 we do
   not
 touch the code at all - we only install the distribution WAR
 file and
 proceed from there.

I still don't understand how you deploy/launch Solr. How many
 jettys
  you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you
 specifies
shards= param for every request and distributes updates yourself?
 What
collections do you create and with which settings?
   
We let SOLR do the sharding using one collection with 16 SOLR cores
   holding one shard each. We launch only one instance of jetty with the
   folllowing arguments:
  
   -DnumShards=16
   -DzkHost=zookeeperhost:port
   -Xmx10G
   -Xms10G
   -Xmn2G
   -server
  
   Would you like to see the solrconfig.xml?
  
   /Martin
  
  


  Also from my POV such deployments should start at least from
 *16*
   4-way
  vboxes, it's more expensive, but should be much better
 available
   during
  cpu-consuming operations.
 

 Do you mean that you recommend 16 hosts with 4 cores each? Or 4
 hosts
with
 16 cores? Or am I misunderstanding something :) ?

I prefer to start from 16 hosts with 4 cores each.
   
   


  Other details, if you use single jetty for all of them, are you
  sure
that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of
  cores
 index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, but
 we're
   using
 an
 SSD disk to minimize disk access time. We have tried putting the
 EFF
onto a
 ram disk, but this didn't have a measurable effect.

 Thanks,
 /Martin


  Thanks
 
 
  On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
  wrote:
 
   Mikhail
  
   PSB
  
   On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 

Re: SolrCloud and exernal file fields

2012-11-28 Thread Mark Miller
Keep in mind that the distrib update proc will be auto inserted into chains! 
You have to include a proc that disables it - see the FAQ: 
http://wiki.apache.org/solr/SolrCloud#FAQ

- Mark

On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com 
wrote:

 Martin,
 Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
 commits to all peers. To hack this you can introduce dedicated
 UpdateProcessorChain without DistributedUpdateProcessor and send commit to
 that chain.
 28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал:
 
 Mikhail
 
 I haven't experimented further yet. I think that the previous experiment
 of issuing a commit to a specific core proved that all cores get the
 commit, so I don't think that this approach will work.
 
 Thanks,
 /Martin
 
 
 On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:
 
 Martin,
 
 It's still not clear to me whether you solve the problem completely or
 partially:
 Does reducing number of cores free some resources for searching during
 commit?
 Does the commiting one-by-one core prevents the freeze?
 
 Thanks
 
 
 On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:
 
 Mikhail
 
 To avoid freezes we deployed the patches that are now on the 4.1 trunk
 (bug
 3985). But this wasn't good enough, because SOLR would still take very
 long
 to restart when that was necessary.
 
 I don't see how we could throw more hardware at the problem without
 making
 it worse, really - the only solution here would be *fewer* shards, not
 
 more.
 
 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could
 be
 by linking to some external data store, or in the lucene/solr internals.
 If
 it would make things easier, a good first step would be to have
 dynamically
 updateable numerical fields only.
 
 /Martin
 
 On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:
 
 Martin,
 
 I don't think solrconfig.xml shed any light on. I've just found what I
 didn't get in your setup - the way of how to explicitly assigning core
 to
 collection. Now, I realized most of details after all!
 Ball is on your side, let us know whether you have managed your cores
 to
 commit one by one to avoid freeze, or could you eliminate pauses by
 allocating more hardware?
 Thanks in advance!
 
 
 On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
 Mikhail,
 
 PSB
 
 On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:
 
 On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com
 wrote:
 
 
 I wasn't aware until now that it is possible to send a commit to
 one
 core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should
 experiment
 with solr/coreN/update?commit=true. A quick trial run seems to
 indicate
 that a commit to a single core causes commits on all cores.
 
 You should see something like this in the log:
 ... SolrCmdDistributor  Distrib commit to: ...
 
 Yup, a commit towards a single core results in a commit on all
 cores.
 
 
 
 
 Perhaps I should clarify that we are using SOLR as a black box;
 we do
 not
 touch the code at all - we only install the distribution WAR
 file and
 proceed from there.
 
 I still don't understand how you deploy/launch Solr. How many
 jettys
 you
 start whether you have -DzkRun -DzkHost -DnumShards=2  or you
 specifies
 shards= param for every request and distributes updates yourself?
 What
 collections do you create and with which settings?
 
 We let SOLR do the sharding using one collection with 16 SOLR cores
 holding one shard each. We launch only one instance of jetty with the
 folllowing arguments:
 
 -DnumShards=16
 -DzkHost=zookeeperhost:port
 -Xmx10G
 -Xms10G
 -Xmn2G
 -server
 
 Would you like to see the solrconfig.xml?
 
 /Martin
 
 
 
 
 Also from my POV such deployments should start at least from
 *16*
 4-way
 vboxes, it's more expensive, but should be much better
 available
 during
 cpu-consuming operations.
 
 
 Do you mean that you recommend 16 hosts with 4 cores each? Or 4
 hosts
 with
 16 cores? Or am I misunderstanding something :) ?
 
 I prefer to start from 16 hosts with 4 cores each.
 
 
 
 
 Other details, if you use single jetty for all of them, are you
 sure
 that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of
 cores
 index
 directories is less than 45G?
 
 The total index size is 230 GB, so it won't fit in ram, but
 we're
 using
 an
 SSD disk to minimize disk access time. We have tried putting the
 EFF
 onto a
 ram disk, but this didn't have a measurable effect.
 
 Thanks,
 /Martin
 
 
 Thanks
 
 
 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
 wrote:
 
 Mikhail
 
 PSB
 
 On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
 

Re: SolrCloud and exernal file fields

2012-11-28 Thread Mikhail Khludnev
Mark,

Your comment is quite valuable. Let me mention the keyword to be able to
find later NoOpDistributingUpdateProcessorFactory.*
*Thanks*!
*


On Wed, Nov 28, 2012 at 5:56 PM, Mark Miller markrmil...@gmail.com wrote:

 Keep in mind that the distrib update proc will be auto inserted into
 chains! You have to include a proc that disables it - see the FAQ:
 http://wiki.apache.org/solr/SolrCloud#FAQ

 - Mark

 On Nov 28, 2012, at 7:25 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

  Martin,
  Right as far node in Zookeeper DistributedUpdateProcessor will broadcast
  commits to all peers. To hack this you can introduce dedicated
  UpdateProcessorChain without DistributedUpdateProcessor and send commit
 to
  that chain.
  28.11.2012 13:16 пользователь Martin Koch m...@issuu.com написал:
 
  Mikhail
 
  I haven't experimented further yet. I think that the previous experiment
  of issuing a commit to a specific core proved that all cores get the
  commit, so I don't think that this approach will work.
 
  Thanks,
  /Martin
 
 
  On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
  Martin,
 
  It's still not clear to me whether you solve the problem completely or
  partially:
  Does reducing number of cores free some resources for searching during
  commit?
  Does the commiting one-by-one core prevents the freeze?
 
  Thanks
 
 
  On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:
 
  Mikhail
 
  To avoid freezes we deployed the patches that are now on the 4.1 trunk
  (bug
  3985). But this wasn't good enough, because SOLR would still take very
  long
  to restart when that was necessary.
 
  I don't see how we could throw more hardware at the problem without
  making
  it worse, really - the only solution here would be *fewer* shards, not
 
  more.
 
  IMO it would be ideal if the lucene/solr community could come up with
 a
  good way of updating fields in a document without reindexing. This
 could
  be
  by linking to some external data store, or in the lucene/solr
 internals.
  If
  it would make things easier, a good first step would be to have
  dynamically
  updateable numerical fields only.
 
  /Martin
 
  On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
  Martin,
 
  I don't think solrconfig.xml shed any light on. I've just found what
 I
  didn't get in your setup - the way of how to explicitly assigning
 core
  to
  collection. Now, I realized most of details after all!
  Ball is on your side, let us know whether you have managed your cores
  to
  commit one by one to avoid freeze, or could you eliminate pauses by
  allocating more hardware?
  Thanks in advance!
 
 
  On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
  Mikhail,
 
  PSB
 
  On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
  On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com
  wrote:
 
 
  I wasn't aware until now that it is possible to send a commit to
  one
  core
  only. What we observed was the effect of curl
  localhost:8080/solr/update?commit=true but perhaps we should
  experiment
  with solr/coreN/update?commit=true. A quick trial run seems to
  indicate
  that a commit to a single core causes commits on all cores.
 
  You should see something like this in the log:
  ... SolrCmdDistributor  Distrib commit to: ...
 
  Yup, a commit towards a single core results in a commit on all
  cores.
 
 
 
 
  Perhaps I should clarify that we are using SOLR as a black box;
  we do
  not
  touch the code at all - we only install the distribution WAR
  file and
  proceed from there.
 
  I still don't understand how you deploy/launch Solr. How many
  jettys
  you
  start whether you have -DzkRun -DzkHost -DnumShards=2  or you
  specifies
  shards= param for every request and distributes updates yourself?
  What
  collections do you create and with which settings?
 
  We let SOLR do the sharding using one collection with 16 SOLR cores
  holding one shard each. We launch only one instance of jetty with
 the
  folllowing arguments:
 
  -DnumShards=16
  -DzkHost=zookeeperhost:port
  -Xmx10G
  -Xms10G
  -Xmn2G
  -server
 
  Would you like to see the solrconfig.xml?
 
  /Martin
 
 
 
 
  Also from my POV such deployments should start at least from
  *16*
  4-way
  vboxes, it's more expensive, but should be much better
  available
  during
  cpu-consuming operations.
 
 
  Do you mean that you recommend 16 hosts with 4 cores each? Or 4
  hosts
  with
  16 cores? Or am I misunderstanding something :) ?
 
  I prefer to start from 16 hosts with 4 cores each.
 
 
 
 
  Other details, if you use single jetty for all of them, are you
  sure
  that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of
  cores
  index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, 

Re: SolrCloud and exernal file fields

2012-11-27 Thread Mikhail Khludnev
Martin,

It's still not clear to me whether you solve the problem completely or
partially:
Does reducing number of cores free some resources for searching during
commit?
Does the commiting one-by-one core prevents the freeze?

Thanks


On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:

 Mikhail

 To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
 3985). But this wasn't good enough, because SOLR would still take very long
 to restart when that was necessary.

 I don't see how we could throw more hardware at the problem without making
 it worse, really - the only solution here would be *fewer* shards, not
 more.

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could be
 by linking to some external data store, or in the lucene/solr internals. If
 it would make things easier, a good first step would be to have dynamically
 updateable numerical fields only.

 /Martin

 On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  I don't think solrconfig.xml shed any light on. I've just found what I
  didn't get in your setup - the way of how to explicitly assigning core to
  collection. Now, I realized most of details after all!
  Ball is on your side, let us know whether you have managed your cores to
  commit one by one to avoid freeze, or could you eliminate pauses by
  allocating more hardware?
  Thanks in advance!
 
 
  On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
   Mikhail,
  
   PSB
  
   On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:
   

 I wasn't aware until now that it is possible to send a commit to
 one
   core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should
  experiment
 with solr/coreN/update?commit=true. A quick trial run seems to
  indicate
 that a commit to a single core causes commits on all cores.

You should see something like this in the log:
... SolrCmdDistributor  Distrib commit to: ...
   
Yup, a commit towards a single core results in a commit on all cores.
  
  


 Perhaps I should clarify that we are using SOLR as a black box; we
 do
   not
 touch the code at all - we only install the distribution WAR file
 and
 proceed from there.

I still don't understand how you deploy/launch Solr. How many jettys
  you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you
 specifies
shards= param for every request and distributes updates yourself?
 What
collections do you create and with which settings?
   
We let SOLR do the sharding using one collection with 16 SOLR cores
   holding one shard each. We launch only one instance of jetty with the
   folllowing arguments:
  
   -DnumShards=16
   -DzkHost=zookeeperhost:port
   -Xmx10G
   -Xms10G
   -Xmn2G
   -server
  
   Would you like to see the solrconfig.xml?
  
   /Martin
  
  


  Also from my POV such deployments should start at least from *16*
   4-way
  vboxes, it's more expensive, but should be much better available
   during
  cpu-consuming operations.
 

 Do you mean that you recommend 16 hosts with 4 cores each? Or 4
 hosts
with
 16 cores? Or am I misunderstanding something :) ?

I prefer to start from 16 hosts with 4 cores each.
   
   


  Other details, if you use single jetty for all of them, are you
  sure
that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of
  cores
 index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, but we're
   using
 an
 SSD disk to minimize disk access time. We have tried putting the
 EFF
onto a
 ram disk, but this didn't have a measurable effect.

 Thanks,
 /Martin


  Thanks
 
 
  On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
  wrote:
 
   Mikhail
  
   PSB
  
   On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Please find additional question from me below.
   
Simone,
   
I'm sorry for hijacking your thread. The only what I've heard
   about
 it
  at
recent ApacheCon sessions is that Zookeeper is supposed to
replicate
   those
files as configs under solr home. And I'm really looking
  forward
   to
  know
how it works with huge files in production.
   
Thank You, Guys!
   
20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
написал:

 Hi Mikhail

Re: SolrCloud and exernal file fields

2012-11-25 Thread Simone Gianni
Hi Gopal,
the post you linked is interesting, it takes a different approach than mine
: it implements a codec for Lucene, so at a lower level than my solution
that works at Solr UpdateHandler level, so before the document reaches
Lucene.

The lucene-codec approach should offer a few advantages : the field is
normally exposed in the document, and as such carried by SolrCloud while
creating new replicas (which is the part I'm not yet sure my solution
handles correctly). On the other side, it limits some flexibility, I'm
already planning at least atomic addition to support popularity ranking.

My post on lucene-dev has received no feedback so far. I'll keep working on
it, but I'm still far from a submittable patch, and help from the dev
community would be of great.

Simone


2012/11/24 Gopal Patwa gopalpa...@gmail.com

 Hi, I am also very much interested in this, since we use Solr 4 with NRT
 where we update index every second but most of time it update only stored
 filed.
  if Solr/Lucene could provide external datastore without re-indexing even
 for stored field only, it would be very beneficial for frequent update use
 case, where cache invalidation will not happen for stored fields update and
 it will improve indexing performance due to smaller index size.

 Here is below link for similar work.


 http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/


 On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni simo...@apache.org
 wrote:

  Posted,
  see it here
 
 
 http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html
 
  Simone
 
 
  2012/11/23 Simone Gianni simo...@apache.org
 
   2012/11/22 Martin Koch m...@issuu.com
  
   IMO it would be ideal if the lucene/solr community could come up with
 a
   good way of updating fields in a document without reindexing. This
 could
   be
   by linking to some external data store, or in the lucene/solr
 internals.
   If
   it would make things easier, a good first step would be to have
   dynamically
   updateable numerical fields only.
  
  
   Hi Martin,
   I'm working on implementing exactly this, and I have a working
 prototype
   right now. I'm going to write on lucene dev about the details and
 asking
   advice there. I'll contribute the code, so anyone interested followup
 on
   dev.
  
   Simone
  
  
 



Re: SolrCloud and exernal file fields

2012-11-23 Thread Martin Koch
The short answer is no; the number was chosen in an attempt to get as many
cores working in parallel to complete the search faster, but I realize that
there is an overhead incurred by distribution and merging the results.
We've now gone to 8 shards and will be monitoring performance.

/Martin


On Thu, Nov 22, 2012 at 3:53 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote:
  around 7M documents in the index; each document has a 45 character ID.

 7M documents isn't that large.  Is there a reason why you need so many
 shards (16 in your case) on a single box?

 -Yonik
 http://lucidworks.com



Re: SolrCloud and exernal file fields

2012-11-23 Thread Simone Gianni
2012/11/22 Martin Koch m...@issuu.com

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could be
 by linking to some external data store, or in the lucene/solr internals. If
 it would make things easier, a good first step would be to have dynamically
 updateable numerical fields only.


Hi Martin,
I'm working on implementing exactly this, and I have a working prototype
right now. I'm going to write on lucene dev about the details and asking
advice there. I'll contribute the code, so anyone interested followup on
dev.

Simone


Re: SolrCloud and exernal file fields

2012-11-23 Thread Simone Gianni
Posted,
see it here
http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html

Simone


2012/11/23 Simone Gianni simo...@apache.org

 2012/11/22 Martin Koch m...@issuu.com

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could
 be
 by linking to some external data store, or in the lucene/solr internals.
 If
 it would make things easier, a good first step would be to have
 dynamically
 updateable numerical fields only.


 Hi Martin,
 I'm working on implementing exactly this, and I have a working prototype
 right now. I'm going to write on lucene dev about the details and asking
 advice there. I'll contribute the code, so anyone interested followup on
 dev.

 Simone




Re: SolrCloud and exernal file fields

2012-11-23 Thread Gopal Patwa
Hi, I am also very much interested in this, since we use Solr 4 with NRT
where we update index every second but most of time it update only stored
filed.
 if Solr/Lucene could provide external datastore without re-indexing even
for stored field only, it would be very beneficial for frequent update use
case, where cache invalidation will not happen for stored fields update and
it will improve indexing performance due to smaller index size.

Here is below link for similar work.

http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/


On Fri, Nov 23, 2012 at 11:42 AM, Simone Gianni simo...@apache.org wrote:

 Posted,
 see it here

 http://lucene.472066.n3.nabble.com/Possible-sharded-and-replicated-replacement-for-ExternalFileFields-in-SolrCloud-td4022108.html

 Simone


 2012/11/23 Simone Gianni simo...@apache.org

  2012/11/22 Martin Koch m...@issuu.com
 
  IMO it would be ideal if the lucene/solr community could come up with a
  good way of updating fields in a document without reindexing. This could
  be
  by linking to some external data store, or in the lucene/solr internals.
  If
  it would make things easier, a good first step would be to have
  dynamically
  updateable numerical fields only.
 
 
  Hi Martin,
  I'm working on implementing exactly this, and I have a working prototype
  right now. I'm going to write on lucene dev about the details and asking
  advice there. I'll contribute the code, so anyone interested followup on
  dev.
 
  Simone
 
 



Re: SolrCloud and exernal file fields

2012-11-22 Thread Martin Koch
Mikhail

To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
3985). But this wasn't good enough, because SOLR would still take very long
to restart when that was necessary.

I don't see how we could throw more hardware at the problem without making
it worse, really - the only solution here would be *fewer* shards, not
more.

IMO it would be ideal if the lucene/solr community could come up with a
good way of updating fields in a document without reindexing. This could be
by linking to some external data store, or in the lucene/solr internals. If
it would make things easier, a good first step would be to have dynamically
updateable numerical fields only.

/Martin

On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 I don't think solrconfig.xml shed any light on. I've just found what I
 didn't get in your setup - the way of how to explicitly assigning core to
 collection. Now, I realized most of details after all!
 Ball is on your side, let us know whether you have managed your cores to
 commit one by one to avoid freeze, or could you eliminate pauses by
 allocating more hardware?
 Thanks in advance!


 On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:

  Mikhail,
 
  PSB
 
  On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:
  
   
I wasn't aware until now that it is possible to send a commit to one
  core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should
 experiment
with solr/coreN/update?commit=true. A quick trial run seems to
 indicate
that a commit to a single core causes commits on all cores.
   
   You should see something like this in the log:
   ... SolrCmdDistributor  Distrib commit to: ...
  
   Yup, a commit towards a single core results in a commit on all cores.
 
 
   
   
Perhaps I should clarify that we are using SOLR as a black box; we do
  not
touch the code at all - we only install the distribution WAR file and
proceed from there.
   
   I still don't understand how you deploy/launch Solr. How many jettys
 you
   start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
   shards= param for every request and distributes updates yourself? What
   collections do you create and with which settings?
  
   We let SOLR do the sharding using one collection with 16 SOLR cores
  holding one shard each. We launch only one instance of jetty with the
  folllowing arguments:
 
  -DnumShards=16
  -DzkHost=zookeeperhost:port
  -Xmx10G
  -Xms10G
  -Xmn2G
  -server
 
  Would you like to see the solrconfig.xml?
 
  /Martin
 
 
   
   
 Also from my POV such deployments should start at least from *16*
  4-way
 vboxes, it's more expensive, but should be much better available
  during
 cpu-consuming operations.

   
Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
   with
16 cores? Or am I misunderstanding something :) ?
   
   I prefer to start from 16 hosts with 4 cores each.
  
  
   
   
 Other details, if you use single jetty for all of them, are you
 sure
   that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of
 cores
index
 directories is less than 45G?

 The total index size is 230 GB, so it won't fit in ram, but we're
  using
an
SSD disk to minimize disk access time. We have tried putting the EFF
   onto a
ram disk, but this didn't have a measurable effect.
   
Thanks,
/Martin
   
   
 Thanks


 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
 wrote:

  Mikhail
 
  PSB
 
  On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Please find additional question from me below.
  
   Simone,
  
   I'm sorry for hijacking your thread. The only what I've heard
  about
it
 at
   recent ApacheCon sessions is that Zookeeper is supposed to
   replicate
  those
   files as configs under solr home. And I'm really looking
 forward
  to
 know
   how it works with huge files in production.
  
   Thank You, Guys!
  
   20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
   написал:
   
Hi Mikhail
   
Please see answers below.
   
On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Thank you for telling your own war-story. It's really
  useful
for
 community.
 The first question might seems not really conscious, but
  would
you
  tell
   me
 what blocks searching during EFF reload, when it's
 triggered
  by

Re: SolrCloud and exernal file fields

2012-11-22 Thread Yonik Seeley
On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote:
 around 7M documents in the index; each document has a 45 character ID.

7M documents isn't that large.  Is there a reason why you need so many
shards (16 in your case) on a single box?

-Yonik
http://lucidworks.com


Re: SolrCloud and exernal file fields

2012-11-21 Thread Martin Koch
On Wed, Nov 21, 2012 at 7:08 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

   I'm not sure about the mmap directory or where that
  would be configured in solr - can you explain that?
 

 You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir
 should be org.apache.lucene.store.MMapDirectory

 It says '
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
'

/Martin

--
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: SolrCloud and exernal file fields

2012-11-21 Thread Mikhail Khludnev
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:


 I wasn't aware until now that it is possible to send a commit to one core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should experiment
 with solr/coreN/update?commit=true. A quick trial run seems to indicate
 that a commit to a single core causes commits on all cores.

You should see something like this in the log:
... SolrCmdDistributor  Distrib commit to: ...



 Perhaps I should clarify that we are using SOLR as a black box; we do not
 touch the code at all - we only install the distribution WAR file and
 proceed from there.

I still don't understand how you deploy/launch Solr. How many jettys you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
shards= param for every request and distributes updates yourself? What
collections do you create and with which settings?




  Also from my POV such deployments should start at least from *16* 4-way
  vboxes, it's more expensive, but should be much better available during
  cpu-consuming operations.
 

 Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
 16 cores? Or am I misunderstanding something :) ?

I prefer to start from 16 hosts with 4 cores each.




  Other details, if you use single jetty for all of them, are you sure that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of cores
 index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, but we're using
 an
 SSD disk to minimize disk access time. We have tried putting the EFF onto a
 ram disk, but this didn't have a measurable effect.

 Thanks,
 /Martin


  Thanks
 
 
  On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:
 
   Mikhail
  
   PSB
  
   On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Please find additional question from me below.
   
Simone,
   
I'm sorry for hijacking your thread. The only what I've heard about
 it
  at
recent ApacheCon sessions is that Zookeeper is supposed to replicate
   those
files as configs under solr home. And I'm really looking forward to
  know
how it works with huge files in production.
   
Thank You, Guys!
   
20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:

 Hi Mikhail

 Please see answers below.

 On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  Thank you for telling your own war-story. It's really useful
 for
  community.
  The first question might seems not really conscious, but would
 you
   tell
me
  what blocks searching during EFF reload, when it's triggered by
   handler
or
  by listener?
 

 We continuously index new documents using CommitWithin to get
 regular
 commits. However, we observed that the EFFs were not re-read, so we
  had
to
 do external commits (curl '.../solr/update?commit=true') to force
   reload.
 When this is done, solr blocks. I can't tell you exactly why it's
  doing
 that (it was related to SOLR-3985).
   
Is there a chance to get a thread dump when they are blocked?
   
   
   Well I could try to recreate the situation. But the setup is fairly
  simple:
   Create a large EFF in a largeish index with many shards. Issue a
 commit,
   and then try to do a search. Solr will not respond to the search before
  the
   commit has completed, and this will take a long time.
  
  
   


  I don't really get the sentence about sequential commits and
 number
   of
  cores. Do I get right that file is replicated via Zookeeper?
  Doesn't
   it
 

 Again, this is observed behavior. When we issue a commit on a
 system
   with
a
 system with many solr cores using EFFs, the system blocks for a
 long
   time
 (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
   symlink
 from each cores index dir to the actual file, which is updated by
 an
 external process.
   
Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
   
Do you use SolrCloud, SolrShards, or these cores are just replicas of
  the
same index?
   
  
   Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
  bit
   unsure about the terminology here, but we've got a single index divided
   into 16 shard. Each shard is hosted in a solr core.
  
  
Also, about simlink - Don't you share that file via some NFS?
   
No, we generate the EFF on the local solr host (there is only one
   physical
   host that holds all shards), so there is no need for NFS or copying
 files
   around. No need for Zookeeper either.
  
  
how many cores you run per box?
   
   This box is a 16-virtual core (8 hyperthreaded 

Re: SolrCloud and exernal file fields

2012-11-21 Thread Martin Koch
Mikhail,

PSB

On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:

 
  I wasn't aware until now that it is possible to send a commit to one core
  only. What we observed was the effect of curl
  localhost:8080/solr/update?commit=true but perhaps we should experiment
  with solr/coreN/update?commit=true. A quick trial run seems to indicate
  that a commit to a single core causes commits on all cores.
 
 You should see something like this in the log:
 ... SolrCmdDistributor  Distrib commit to: ...

 Yup, a commit towards a single core results in a commit on all cores.


 
 
  Perhaps I should clarify that we are using SOLR as a black box; we do not
  touch the code at all - we only install the distribution WAR file and
  proceed from there.
 
 I still don't understand how you deploy/launch Solr. How many jettys you
 start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
 shards= param for every request and distributes updates yourself? What
 collections do you create and with which settings?

 We let SOLR do the sharding using one collection with 16 SOLR cores
holding one shard each. We launch only one instance of jetty with the
folllowing arguments:

-DnumShards=16
-DzkHost=zookeeperhost:port
-Xmx10G
-Xms10G
-Xmn2G
-server

Would you like to see the solrconfig.xml?

/Martin


 
 
   Also from my POV such deployments should start at least from *16* 4-way
   vboxes, it's more expensive, but should be much better available during
   cpu-consuming operations.
  
 
  Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
 with
  16 cores? Or am I misunderstanding something :) ?
 
 I prefer to start from 16 hosts with 4 cores each.


 
 
   Other details, if you use single jetty for all of them, are you sure
 that
   jetty's threadpool doesn't limit requests? is it large enough?
   You have 60G and set -Xmx=10G. are you sure that total size of cores
  index
   directories is less than 45G?
  
   The total index size is 230 GB, so it won't fit in ram, but we're using
  an
  SSD disk to minimize disk access time. We have tried putting the EFF
 onto a
  ram disk, but this didn't have a measurable effect.
 
  Thanks,
  /Martin
 
 
   Thanks
  
  
   On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:
  
Mikhail
   
PSB
   
On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Please find additional question from me below.

 Simone,

 I'm sorry for hijacking your thread. The only what I've heard about
  it
   at
 recent ApacheCon sessions is that Zookeeper is supposed to
 replicate
those
 files as configs under solr home. And I'm really looking forward to
   know
 how it works with huge files in production.

 Thank You, Guys!

 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
 написал:
 
  Hi Mikhail
 
  Please see answers below.
 
  On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Thank you for telling your own war-story. It's really useful
  for
   community.
   The first question might seems not really conscious, but would
  you
tell
 me
   what blocks searching during EFF reload, when it's triggered by
handler
 or
   by listener?
  
 
  We continuously index new documents using CommitWithin to get
  regular
  commits. However, we observed that the EFFs were not re-read, so
 we
   had
 to
  do external commits (curl '.../solr/update?commit=true') to force
reload.
  When this is done, solr blocks. I can't tell you exactly why it's
   doing
  that (it was related to SOLR-3985).

 Is there a chance to get a thread dump when they are blocked?


Well I could try to recreate the situation. But the setup is fairly
   simple:
Create a large EFF in a largeish index with many shards. Issue a
  commit,
and then try to do a search. Solr will not respond to the search
 before
   the
commit has completed, and this will take a long time.
   
   

 
 
   I don't really get the sentence about sequential commits and
  number
of
   cores. Do I get right that file is replicated via Zookeeper?
   Doesn't
it
  
 
  Again, this is observed behavior. When we issue a commit on a
  system
with
 a
  system with many solr cores using EFFs, the system blocks for a
  long
time
  (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
symlink
  from each cores index dir to the actual file, which is updated by
  an
  external process.

 Hold on, I asked about Zookeeper because the subj mentions
 SolrCloud.

 Do you use SolrCloud, SolrShards, or these cores are just 

Re: SolrCloud and exernal file fields

2012-11-21 Thread Simone Gianni
Hi Martin,
thanks for sharing your experience with EFF and saving me a lot of time
figuring it out myself, I was afraid of exactly this kind of problems.

Mikhail, thanks for expanding the thread with even more useful informations!

Simone


2012/11/20 Martin Koch m...@issuu.com

 Solr 4.0 does support using EFFs, but it might not give you what you're
 hoping fore.

 We tried using Solr Cloud, and have given up again.

 The EFF is placed in the parent of the index directory in each core; each
 core reads the entire EFF and picks out the IDs that it is responsible for.

 In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
 while re-reading the EFF. Even worse, it seems that the time to re-read the
 EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
 each core sequentially). The contents of the EFF become active after the
 first EXTERNAL commit (commitWithin does NOT work here) after the file has
 been updated.

 In our case, the EFF was quite large - around 450MB - and we use 16 shards,
 so when we triggered an external commit to force re-reading, the whole
 system would block for several (10-15) minutes. This won't work in a
 production environment. The reason for the size of the EFF is that we have
 around 7M documents in the index; each document has a 45 character ID.

 We got some help to try to fix the problem so that the re-read of the EFF
 proceeds in the background (see
 herehttps://issues.apache.org/jira/browse/SOLR-3985 for
 a fix on the 4.1 branch). However, even though the re-read proceeds in the
 background, the time required to launch solr now takes at least as long as
 re-reading the EFFs. Again, this is not good enough for our needs.

 The next issue is that you cannot sort on EFF fields (though you can return
 them as values using fl=field(my_eff_field). This is also fixed in the 4.1
 branch here https://issues.apache.org/jira/browse/SOLR-4022.

 So: Even after these fixes, EFF performance is not that great. Our solution
 is as follows: The actual value of the popularity measure (say, reads) that
 we want to report to the user is inserted into the search response
 post-query by our query front-end. This value will then be the
 authoritative value at the time of the query. The value of the popularity
 measure that we use for boosting in the ranking of the search results is
 only updated when the value has changed enough so that the impact on the
 boost will be significant (say, more than 2%). This does require frequent
 re-indexing of the documents that have significant changes in the number of
 reads, but at least we won't have to update a document if it moves from,
 say, 100 to 101 reads.

 /Martin Koch - ISSUU - senior systems architect.

 On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote:

  Hi all,
  I'm planning to move a quite big Solr index to SolrCloud. However, in
 this
  index, an external file field is used for popularity ranking.
 
  Does SolrCloud supports external file fields? How does it cope with
  sharding and replication? Where should the external file be placed now
 that
  the index folder is not local but in the cloud?
 
  Are there otherwise other best practices to deal with the use cases
  external file fields were used for, like popularity/ranking, in
 SolrCloud?
  Custom ValueSources going to something external?
 
  Thanks in advance,
  Simone
 



Re: SolrCloud and exernal file fields

2012-11-21 Thread Mikhail Khludnev
Martin,

I don't think solrconfig.xml shed any light on. I've just found what I
didn't get in your setup - the way of how to explicitly assigning core to
collection. Now, I realized most of details after all!
Ball is on your side, let us know whether you have managed your cores to
commit one by one to avoid freeze, or could you eliminate pauses by
allocating more hardware?
Thanks in advance!


On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:

 Mikhail,

 PSB

 On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:
 
  
   I wasn't aware until now that it is possible to send a commit to one
 core
   only. What we observed was the effect of curl
   localhost:8080/solr/update?commit=true but perhaps we should experiment
   with solr/coreN/update?commit=true. A quick trial run seems to indicate
   that a commit to a single core causes commits on all cores.
  
  You should see something like this in the log:
  ... SolrCmdDistributor  Distrib commit to: ...
 
  Yup, a commit towards a single core results in a commit on all cores.


  
  
   Perhaps I should clarify that we are using SOLR as a black box; we do
 not
   touch the code at all - we only install the distribution WAR file and
   proceed from there.
  
  I still don't understand how you deploy/launch Solr. How many jettys you
  start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
  shards= param for every request and distributes updates yourself? What
  collections do you create and with which settings?
 
  We let SOLR do the sharding using one collection with 16 SOLR cores
 holding one shard each. We launch only one instance of jetty with the
 folllowing arguments:

 -DnumShards=16
 -DzkHost=zookeeperhost:port
 -Xmx10G
 -Xms10G
 -Xmn2G
 -server

 Would you like to see the solrconfig.xml?

 /Martin


  
  
Also from my POV such deployments should start at least from *16*
 4-way
vboxes, it's more expensive, but should be much better available
 during
cpu-consuming operations.
   
  
   Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
  with
   16 cores? Or am I misunderstanding something :) ?
  
  I prefer to start from 16 hosts with 4 cores each.
 
 
  
  
Other details, if you use single jetty for all of them, are you sure
  that
jetty's threadpool doesn't limit requests? is it large enough?
You have 60G and set -Xmx=10G. are you sure that total size of cores
   index
directories is less than 45G?
   
The total index size is 230 GB, so it won't fit in ram, but we're
 using
   an
   SSD disk to minimize disk access time. We have tried putting the EFF
  onto a
   ram disk, but this didn't have a measurable effect.
  
   Thanks,
   /Martin
  
  
Thanks
   
   
On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:
   
 Mikhail

 PSB

 On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  Please find additional question from me below.
 
  Simone,
 
  I'm sorry for hijacking your thread. The only what I've heard
 about
   it
at
  recent ApacheCon sessions is that Zookeeper is supposed to
  replicate
 those
  files as configs under solr home. And I'm really looking forward
 to
know
  how it works with huge files in production.
 
  Thank You, Guys!
 
  20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
  написал:
  
   Hi Mikhail
  
   Please see answers below.
  
   On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Thank you for telling your own war-story. It's really
 useful
   for
community.
The first question might seems not really conscious, but
 would
   you
 tell
  me
what blocks searching during EFF reload, when it's triggered
 by
 handler
  or
by listener?
   
  
   We continuously index new documents using CommitWithin to get
   regular
   commits. However, we observed that the EFFs were not re-read,
 so
  we
had
  to
   do external commits (curl '.../solr/update?commit=true') to
 force
 reload.
   When this is done, solr blocks. I can't tell you exactly why
 it's
doing
   that (it was related to SOLR-3985).
 
  Is there a chance to get a thread dump when they are blocked?
 
 
 Well I could try to recreate the situation. But the setup is fairly
simple:
 Create a large EFF in a largeish index with many shards. Issue a
   commit,
 and then try to do a search. Solr will not respond to the search
  before
the
 commit has completed, and this will take a long time.


 
  
  
I don't really get the sentence about sequential commits and

Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Solr 4.0 does support using EFFs, but it might not give you what you're
hoping fore.

We tried using Solr Cloud, and have given up again.

The EFF is placed in the parent of the index directory in each core; each
core reads the entire EFF and picks out the IDs that it is responsible for.

In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
while re-reading the EFF. Even worse, it seems that the time to re-read the
EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
each core sequentially). The contents of the EFF become active after the
first EXTERNAL commit (commitWithin does NOT work here) after the file has
been updated.

In our case, the EFF was quite large - around 450MB - and we use 16 shards,
so when we triggered an external commit to force re-reading, the whole
system would block for several (10-15) minutes. This won't work in a
production environment. The reason for the size of the EFF is that we have
around 7M documents in the index; each document has a 45 character ID.

We got some help to try to fix the problem so that the re-read of the EFF
proceeds in the background (see
herehttps://issues.apache.org/jira/browse/SOLR-3985 for
a fix on the 4.1 branch). However, even though the re-read proceeds in the
background, the time required to launch solr now takes at least as long as
re-reading the EFFs. Again, this is not good enough for our needs.

The next issue is that you cannot sort on EFF fields (though you can return
them as values using fl=field(my_eff_field). This is also fixed in the 4.1
branch here https://issues.apache.org/jira/browse/SOLR-4022.

So: Even after these fixes, EFF performance is not that great. Our solution
is as follows: The actual value of the popularity measure (say, reads) that
we want to report to the user is inserted into the search response
post-query by our query front-end. This value will then be the
authoritative value at the time of the query. The value of the popularity
measure that we use for boosting in the ranking of the search results is
only updated when the value has changed enough so that the impact on the
boost will be significant (say, more than 2%). This does require frequent
re-indexing of the documents that have significant changes in the number of
reads, but at least we won't have to update a document if it moves from,
say, 100 to 101 reads.

/Martin Koch - ISSUU - senior systems architect.

On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote:

 Hi all,
 I'm planning to move a quite big Solr index to SolrCloud. However, in this
 index, an external file field is used for popularity ranking.

 Does SolrCloud supports external file fields? How does it cope with
 sharding and replication? Where should the external file be placed now that
 the index folder is not local but in the cloud?

 Are there otherwise other best practices to deal with the use cases
 external file fields were used for, like popularity/ranking, in SolrCloud?
 Custom ValueSources going to something external?

 Thanks in advance,
 Simone



Re: SolrCloud and exernal file fields

2012-11-20 Thread Mikhail Khludnev
Martin,

Thank you for telling your own war-story. It's really useful for
community.
The first question might seems not really conscious, but would you tell me
what blocks searching during EFF reload, when it's triggered by handler or
by listener?
I don't really get the sentence about sequential commits and number of
cores. Do I get right that file is replicated via Zookeeper? Doesn't it
causes scalability problem or long time to reload? Will it help if we'll
have, let's say ExternalDatabaseField which will pull values from jdbc. ie.
why all cores can't read these values simultaneously?
Can you confirm that IDs in the file is ordered by the index term order?
AFAIK it can impact load time.
Regarding your post-query solution can you tell me if query found 1
docs, but I need to display only first page with 100 rows, whether I need
to pull all 10K results to frontend to order them by the rank?

I'm really appreciate if you comment on the questions above.
PS: It's time to pitch, how much
https://issues.apache.org/jira/browse/SOLR-4085 Commit-free
ExternalFileField can help you?



On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote:

 Solr 4.0 does support using EFFs, but it might not give you what you're
 hoping fore.

 We tried using Solr Cloud, and have given up again.

 The EFF is placed in the parent of the index directory in each core; each
 core reads the entire EFF and picks out the IDs that it is responsible for.

 In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
 while re-reading the EFF. Even worse, it seems that the time to re-read the
 EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
 each core sequentially). The contents of the EFF become active after the
 first EXTERNAL commit (commitWithin does NOT work here) after the file has
 been updated.

 In our case, the EFF was quite large - around 450MB - and we use 16 shards,
 so when we triggered an external commit to force re-reading, the whole
 system would block for several (10-15) minutes. This won't work in a
 production environment. The reason for the size of the EFF is that we have
 around 7M documents in the index; each document has a 45 character ID.

 We got some help to try to fix the problem so that the re-read of the EFF
 proceeds in the background (see
 herehttps://issues.apache.org/jira/browse/SOLR-3985 for
 a fix on the 4.1 branch). However, even though the re-read proceeds in the
 background, the time required to launch solr now takes at least as long as
 re-reading the EFFs. Again, this is not good enough for our needs.

 The next issue is that you cannot sort on EFF fields (though you can return
 them as values using fl=field(my_eff_field). This is also fixed in the 4.1
 branch here https://issues.apache.org/jira/browse/SOLR-4022.

 So: Even after these fixes, EFF performance is not that great. Our solution
 is as follows: The actual value of the popularity measure (say, reads) that
 we want to report to the user is inserted into the search response
 post-query by our query front-end. This value will then be the
 authoritative value at the time of the query. The value of the popularity
 measure that we use for boosting in the ranking of the search results is
 only updated when the value has changed enough so that the impact on the
 boost will be significant (say, more than 2%). This does require frequent
 re-indexing of the documents that have significant changes in the number of
 reads, but at least we won't have to update a document if it moves from,
 say, 100 to 101 reads.

 /Martin Koch - ISSUU - senior systems architect.

 On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote:

  Hi all,
  I'm planning to move a quite big Solr index to SolrCloud. However, in
 this
  index, an external file field is used for popularity ranking.
 
  Does SolrCloud supports external file fields? How does it cope with
  sharding and replication? Where should the external file be placed now
 that
  the index folder is not local but in the cloud?
 
  Are there otherwise other best practices to deal with the use cases
  external file fields were used for, like popularity/ranking, in
 SolrCloud?
  Custom ValueSources going to something external?
 
  Thanks in advance,
  Simone
 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Hi Mikhail

Please see answers below.

On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 Thank you for telling your own war-story. It's really useful for
 community.
 The first question might seems not really conscious, but would you tell me
 what blocks searching during EFF reload, when it's triggered by handler or
 by listener?


We continuously index new documents using CommitWithin to get regular
commits. However, we observed that the EFFs were not re-read, so we had to
do external commits (curl '.../solr/update?commit=true') to force reload.
When this is done, solr blocks. I can't tell you exactly why it's doing
that (it was related to SOLR-3985).


 I don't really get the sentence about sequential commits and number of
 cores. Do I get right that file is replicated via Zookeeper? Doesn't it


Again, this is observed behavior. When we issue a commit on a system with a
system with many solr cores using EFFs, the system blocks for a long time
(15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
from each cores index dir to the actual file, which is updated by an
external process.


 causes scalability problem or long time to reload? Will it help if we'll
 have, let's say ExternalDatabaseField which will pull values from jdbc. ie.


I think the possibility of having some fields being retrieved from an
external, dynamically updatable store would be really interesting. This
could be JDBC, something in-memory like redis, or a NoSql product (e.g.
Cassandra).


 why all cores can't read these values simultaneously?


Again, this is a solr implementation detail that I can't answer :)


 Can you confirm that IDs in the file is ordered by the index term order?


Yes, we sorted the files (standard UNIX sort).


 AFAIK it can impact load time.

Yes, it does.


 Regarding your post-query solution can you tell me if query found 1
 docs, but I need to display only first page with 100 rows, whether I need
 to pull all 10K results to frontend to order them by the rank?


In our architecture, the clients query an API that generates the SOLR
query, retrieves the relevant additional fields that we needs, and returns
the relevant JSON to the front-end.

In our use case, results are returned from SOLR by the 10's, not by the
1000's, so it is a manageable job. Even so, if solr returned thousands of
results, it would be up to the implementation of the api to augment only
the results that needed to be returned to the front-end.

Even so, patching up a JSON structure with 1 results should be
possible.


 I'm really appreciate if you comment on the questions above.
 PS: It's time to pitch, how much
 https://issues.apache.org/jira/browse/SOLR-4085 Commit-free
 ExternalFileField can help you?


 It looks very interesting :) Does it make it possible to avoid re-reading
the EFF on every commit, and only re-read the values that have actually
changed?

/Martin



 On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote:

  Solr 4.0 does support using EFFs, but it might not give you what you're
  hoping fore.
 
  We tried using Solr Cloud, and have given up again.
 
  The EFF is placed in the parent of the index directory in each core; each
  core reads the entire EFF and picks out the IDs that it is responsible
 for.
 
  In the current 4.0.0 release of solr, solr blocks (doesn't answer
 queries)
  while re-reading the EFF. Even worse, it seems that the time to re-read
 the
  EFF is multiplied by the number of cores in use (i.e. the EFF is re-read
 by
  each core sequentially). The contents of the EFF become active after the
  first EXTERNAL commit (commitWithin does NOT work here) after the file
 has
  been updated.
 
  In our case, the EFF was quite large - around 450MB - and we use 16
 shards,
  so when we triggered an external commit to force re-reading, the whole
  system would block for several (10-15) minutes. This won't work in a
  production environment. The reason for the size of the EFF is that we
 have
  around 7M documents in the index; each document has a 45 character ID.
 
  We got some help to try to fix the problem so that the re-read of the EFF
  proceeds in the background (see
  herehttps://issues.apache.org/jira/browse/SOLR-3985 for
  a fix on the 4.1 branch). However, even though the re-read proceeds in
 the
  background, the time required to launch solr now takes at least as long
 as
  re-reading the EFFs. Again, this is not good enough for our needs.
 
  The next issue is that you cannot sort on EFF fields (though you can
 return
  them as values using fl=field(my_eff_field). This is also fixed in the
 4.1
  branch here https://issues.apache.org/jira/browse/SOLR-4022.
 
  So: Even after these fixes, EFF performance is not that great. Our
 solution
  is as follows: The actual value of the popularity measure (say, reads)
 that
  we want to report to the user is inserted into the search response
  post-query by our query 

Re: SolrCloud and exernal file fields

2012-11-20 Thread Mikhail Khludnev
Martin,

Please find additional question from me below.

Simone,

I'm sorry for hijacking your thread. The only what I've heard about it at
recent ApacheCon sessions is that Zookeeper is supposed to replicate those
files as configs under solr home. And I'm really looking forward to know
how it works with huge files in production.

Thank You, Guys!

20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:

 Hi Mikhail

 Please see answers below.

 On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  Thank you for telling your own war-story. It's really useful for
  community.
  The first question might seems not really conscious, but would you tell
me
  what blocks searching during EFF reload, when it's triggered by handler
or
  by listener?
 

 We continuously index new documents using CommitWithin to get regular
 commits. However, we observed that the EFFs were not re-read, so we had to
 do external commits (curl '.../solr/update?commit=true') to force reload.
 When this is done, solr blocks. I can't tell you exactly why it's doing
 that (it was related to SOLR-3985).

Is there a chance to get a thread dump when they are blocked?




  I don't really get the sentence about sequential commits and number of
  cores. Do I get right that file is replicated via Zookeeper? Doesn't it
 

 Again, this is observed behavior. When we issue a commit on a system with
a
 system with many solr cores using EFFs, the system blocks for a long time
 (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
 from each cores index dir to the actual file, which is updated by an
 external process.

Hold on, I asked about Zookeeper because the subj mentions SolrCloud.

Do you use SolrCloud, SolrShards, or these cores are just replicas of the
same index?
Also, about simlink - Don't you share that file via some NFS?

how many cores you run per box?

Do boxes has plenty of ram to cache filesystem beside of jvm heaps?

I assume you use 64 bit linux and mmap directory. Please confirm that.




  causes scalability problem or long time to reload? Will it help if we'll
  have, let's say ExternalDatabaseField which will pull values from jdbc.
ie.
 

 I think the possibility of having some fields being retrieved from an
 external, dynamically updatable store would be really interesting. This
 could be JDBC, something in-memory like redis, or a NoSql product (e.g.
 Cassandra).

Ok. Let's have it in mind as a possible direction.



  why all cores can't read these values simultaneously?
 

 Again, this is a solr implementation detail that I can't answer :)


  Can you confirm that IDs in the file is ordered by the index term order?
 

 Yes, we sorted the files (standard UNIX sort).


  AFAIK it can impact load time.
 
 Yes, it does

Ok, I've got that you aware of it, and your IDs are just strings, not
integers.




  Regarding your post-query solution can you tell me if query found 1
  docs, but I need to display only first page with 100 rows, whether I
need
  to pull all 10K results to frontend to order them by the rank?
 
 
 In our architecture, the clients query an API that generates the SOLR
 query, retrieves the relevant additional fields that we needs, and returns
 the relevant JSON to the front-end.

 In our use case, results are returned from SOLR by the 10's, not by the
 1000's, so it is a manageable job. Even so, if solr returned thousands of
 results, it would be up to the implementation of the api to augment only
 the results that needed to be returned to the front-end.

 Even so, patching up a JSON structure with 1 results should be
 possible.

You are right. I'm concerned anyway because retrieving whole result is
expensive, and not always possible.




  I'm really appreciate if you comment on the questions above.
  PS: It's time to pitch, how much
  https://issues.apache.org/jira/browse/SOLR-4085 Commit-free
  ExternalFileField can help you?
 
 
  It looks very interesting :) Does it make it possible to avoid
re-reading
 the EFF on every commit, and only re-read the values that have actually
 changed?


You don't need commit (in SOLR-4085) to reload file content, but after
commit you need to read whole file and scan all key terms and postings.
That's because EFF sits on top of top level searcher. it's a Solr-like way.
In some future we might have per-segment EFF, in this case adding a segment
will trigger full file scan, but in the index only that new segment will be
scanned. It should be faster. You know, straightforward sharing internal
data structures between different index views/generations is not possible.
If you are asking about applying delta changes on external file that's
something what we did ourselves http://goo.gl/P8GFq . This feature is much
more doubtful and vague, although it might be the next contribution after
SOLR-4085.


 /Martin


 
  On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote:
 
   Solr 4.0 does 

Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Mikhail

PSB

On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 Please find additional question from me below.

 Simone,

 I'm sorry for hijacking your thread. The only what I've heard about it at
 recent ApacheCon sessions is that Zookeeper is supposed to replicate those
 files as configs under solr home. And I'm really looking forward to know
 how it works with huge files in production.

 Thank You, Guys!

 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:
 
  Hi Mikhail
 
  Please see answers below.
 
  On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Thank you for telling your own war-story. It's really useful for
   community.
   The first question might seems not really conscious, but would you tell
 me
   what blocks searching during EFF reload, when it's triggered by handler
 or
   by listener?
  
 
  We continuously index new documents using CommitWithin to get regular
  commits. However, we observed that the EFFs were not re-read, so we had
 to
  do external commits (curl '.../solr/update?commit=true') to force reload.
  When this is done, solr blocks. I can't tell you exactly why it's doing
  that (it was related to SOLR-3985).

 Is there a chance to get a thread dump when they are blocked?


Well I could try to recreate the situation. But the setup is fairly simple:
Create a large EFF in a largeish index with many shards. Issue a commit,
and then try to do a search. Solr will not respond to the search before the
commit has completed, and this will take a long time.



 
 
   I don't really get the sentence about sequential commits and number of
   cores. Do I get right that file is replicated via Zookeeper? Doesn't it
  
 
  Again, this is observed behavior. When we issue a commit on a system with
 a
  system with many solr cores using EFFs, the system blocks for a long time
  (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
  from each cores index dir to the actual file, which is updated by an
  external process.

 Hold on, I asked about Zookeeper because the subj mentions SolrCloud.

 Do you use SolrCloud, SolrShards, or these cores are just replicas of the
 same index?


Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit
unsure about the terminology here, but we've got a single index divided
into 16 shard. Each shard is hosted in a solr core.


 Also, about simlink - Don't you share that file via some NFS?

 No, we generate the EFF on the local solr host (there is only one physical
host that holds all shards), so there is no need for NFS or copying files
around. No need for Zookeeper either.


 how many cores you run per box?

This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM. We
run 16 solr cores on this box in Jetty.


 Do boxes has plenty of ram to cache filesystem beside of jvm heaps?

 Yes. We've allocated 10GB for jetty, and left the rest for the OS.


 I assume you use 64 bit linux and mmap directory. Please confirm that.


We use 64-bit linux. I'm not sure about the mmap directory or where that
would be configured in solr - can you explain that?


 
 
   causes scalability problem or long time to reload? Will it help if
 we'll
   have, let's say ExternalDatabaseField which will pull values from jdbc.
 ie.
  
 
  I think the possibility of having some fields being retrieved from an
  external, dynamically updatable store would be really interesting. This
  could be JDBC, something in-memory like redis, or a NoSql product (e.g.
  Cassandra).

 Ok. Let's have it in mind as a possible direction.


Alternatively, an API that would allow updating a single field for a
document might be an option.



 
 
   why all cores can't read these values simultaneously?
  
 
  Again, this is a solr implementation detail that I can't answer :)
 
 
   Can you confirm that IDs in the file is ordered by the index term
 order?
  
 
  Yes, we sorted the files (standard UNIX sort).
 
 
   AFAIK it can impact load time.
  
  Yes, it does

 Ok, I've got that you aware of it, and your IDs are just strings, not
 integers.


Yes, ids are strings.


 
 
   Regarding your post-query solution can you tell me if query found 1
   docs, but I need to display only first page with 100 rows, whether I
 need
   to pull all 10K results to frontend to order them by the rank?
  
  
  In our architecture, the clients query an API that generates the SOLR
  query, retrieves the relevant additional fields that we needs, and
 returns
  the relevant JSON to the front-end.
 
  In our use case, results are returned from SOLR by the 10's, not by the
  1000's, so it is a manageable job. Even so, if solr returned thousands of
  results, it would be up to the implementation of the api to augment only
  the results that needed to be returned to the front-end.
 
  Even so, patching up a JSON structure with 1 results should be
  possible.

 

Re: SolrCloud and exernal file fields

2012-11-20 Thread Mikhail Khludnev
Martin,
This deployment seems a little bit confusing to me. You have 16-way fairy
virtual box, and send 16 request for really heavy operation at the same
moment, it does not surprise me that you loosing it for some period of
time. At that time you should have more than 16 in load average metrics.
I suggest to send commit to those cores one-by-one and have inconsistency
and some sort of blinking as a trade-off for availability. In this case
only single virtual CPU will be fully consumed by the commit's _thread
divergence action_ and others will serve requests.
Also from my POV such deployments should start at least from *16* 4-way
vboxes, it's more expensive, but should be much better available during
cpu-consuming operations.
Other details, if you use single jetty for all of them, are you sure that
jetty's threadpool doesn't limit requests? is it large enough?
You have 60G and set -Xmx=10G. are you sure that total size of cores index
directories is less than 45G?

Thanks


On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

 Mikhail

 PSB

 On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  Please find additional question from me below.
 
  Simone,
 
  I'm sorry for hijacking your thread. The only what I've heard about it at
  recent ApacheCon sessions is that Zookeeper is supposed to replicate
 those
  files as configs under solr home. And I'm really looking forward to know
  how it works with huge files in production.
 
  Thank You, Guys!
 
  20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:
  
   Hi Mikhail
  
   Please see answers below.
  
   On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Thank you for telling your own war-story. It's really useful for
community.
The first question might seems not really conscious, but would you
 tell
  me
what blocks searching during EFF reload, when it's triggered by
 handler
  or
by listener?
   
  
   We continuously index new documents using CommitWithin to get regular
   commits. However, we observed that the EFFs were not re-read, so we had
  to
   do external commits (curl '.../solr/update?commit=true') to force
 reload.
   When this is done, solr blocks. I can't tell you exactly why it's doing
   that (it was related to SOLR-3985).
 
  Is there a chance to get a thread dump when they are blocked?
 
 
 Well I could try to recreate the situation. But the setup is fairly simple:
 Create a large EFF in a largeish index with many shards. Issue a commit,
 and then try to do a search. Solr will not respond to the search before the
 commit has completed, and this will take a long time.


 
  
  
I don't really get the sentence about sequential commits and number
 of
cores. Do I get right that file is replicated via Zookeeper? Doesn't
 it
   
  
   Again, this is observed behavior. When we issue a commit on a system
 with
  a
   system with many solr cores using EFFs, the system blocks for a long
 time
   (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
 symlink
   from each cores index dir to the actual file, which is updated by an
   external process.
 
  Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
 
  Do you use SolrCloud, SolrShards, or these cores are just replicas of the
  same index?
 

 Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit
 unsure about the terminology here, but we've got a single index divided
 into 16 shard. Each shard is hosted in a solr core.


  Also, about simlink - Don't you share that file via some NFS?
 
  No, we generate the EFF on the local solr host (there is only one
 physical
 host that holds all shards), so there is no need for NFS or copying files
 around. No need for Zookeeper either.


  how many cores you run per box?
 
 This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM. We
 run 16 solr cores on this box in Jetty.


  Do boxes has plenty of ram to cache filesystem beside of jvm heaps?
 
  Yes. We've allocated 10GB for jetty, and left the rest for the OS.


  I assume you use 64 bit linux and mmap directory. Please confirm that.
 
 
 We use 64-bit linux. I'm not sure about the mmap directory or where that
 would be configured in solr - can you explain that?

 
  
  
causes scalability problem or long time to reload? Will it help if
  we'll
have, let's say ExternalDatabaseField which will pull values from
 jdbc.
  ie.
   
  
   I think the possibility of having some fields being retrieved from an
   external, dynamically updatable store would be really interesting. This
   could be JDBC, something in-memory like redis, or a NoSql product (e.g.
   Cassandra).
 
  Ok. Let's have it in mind as a possible direction.
 

 Alternatively, an API that would allow updating a single field for a
 document might be an option.


 
  
  
why all cores can't read these values 

Re: SolrCloud and exernal file fields

2012-11-20 Thread Mikhail Khludnev
On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

  I'm not sure about the mmap directory or where that
 would be configured in solr - can you explain that?


You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir
should be org.apache.lucene.store.MMapDirectory

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Mikhail

I appreciate your input, it's very useful :)

On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,
 This deployment seems a little bit confusing to me. You have 16-way fairy
 virtual box, and send 16 request for really heavy operation at the same
 moment, it does not surprise me that you loosing it for some period of
 time. At that time you should have more than 16 in load average metrics.
 I suggest to send commit to those cores one-by-one and have inconsistency
 and some sort of blinking as a trade-off for availability. In this case
 only single virtual CPU will be fully consumed by the commit's _thread
 divergence action_ and others will serve requests.


I wasn't aware until now that it is possible to send a commit to one core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should experiment
with solr/coreN/update?commit=true. A quick trial run seems to indicate
that a commit to a single core causes commits on all cores.


Perhaps I should clarify that we are using SOLR as a black box; we do not
touch the code at all - we only install the distribution WAR file and
proceed from there.


 Also from my POV such deployments should start at least from *16* 4-way
 vboxes, it's more expensive, but should be much better available during
 cpu-consuming operations.


Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
16 cores? Or am I misunderstanding something :) ?


 Other details, if you use single jetty for all of them, are you sure that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of cores index
 directories is less than 45G?

 The total index size is 230 GB, so it won't fit in ram, but we're using an
SSD disk to minimize disk access time. We have tried putting the EFF onto a
ram disk, but this didn't have a measurable effect.

Thanks,
/Martin


 Thanks


 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

  Mikhail
 
  PSB
 
  On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Please find additional question from me below.
  
   Simone,
  
   I'm sorry for hijacking your thread. The only what I've heard about it
 at
   recent ApacheCon sessions is that Zookeeper is supposed to replicate
  those
   files as configs under solr home. And I'm really looking forward to
 know
   how it works with huge files in production.
  
   Thank You, Guys!
  
   20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:
   
Hi Mikhail
   
Please see answers below.
   
On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Thank you for telling your own war-story. It's really useful for
 community.
 The first question might seems not really conscious, but would you
  tell
   me
 what blocks searching during EFF reload, when it's triggered by
  handler
   or
 by listener?

   
We continuously index new documents using CommitWithin to get regular
commits. However, we observed that the EFFs were not re-read, so we
 had
   to
do external commits (curl '.../solr/update?commit=true') to force
  reload.
When this is done, solr blocks. I can't tell you exactly why it's
 doing
that (it was related to SOLR-3985).
  
   Is there a chance to get a thread dump when they are blocked?
  
  
  Well I could try to recreate the situation. But the setup is fairly
 simple:
  Create a large EFF in a largeish index with many shards. Issue a commit,
  and then try to do a search. Solr will not respond to the search before
 the
  commit has completed, and this will take a long time.
 
 
  
   
   
 I don't really get the sentence about sequential commits and number
  of
 cores. Do I get right that file is replicated via Zookeeper?
 Doesn't
  it

   
Again, this is observed behavior. When we issue a commit on a system
  with
   a
system with many solr cores using EFFs, the system blocks for a long
  time
(15 minutes).  We do NOT use zookeeper for anything. The EFF is a
  symlink
from each cores index dir to the actual file, which is updated by an
external process.
  
   Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
  
   Do you use SolrCloud, SolrShards, or these cores are just replicas of
 the
   same index?
  
 
  Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
 bit
  unsure about the terminology here, but we've got a single index divided
  into 16 shard. Each shard is hosted in a solr core.
 
 
   Also, about simlink - Don't you share that file via some NFS?
  
   No, we generate the EFF on the local solr host (there is only one
  physical
  host that holds all shards), so there is no need for NFS or copying files
  around. No need for Zookeeper either.
 
 
   how many cores 

SolrCloud and exernal file fields

2012-11-19 Thread Simone Gianni
Hi all,
I'm planning to move a quite big Solr index to SolrCloud. However, in this
index, an external file field is used for popularity ranking.

Does SolrCloud supports external file fields? How does it cope with
sharding and replication? Where should the external file be placed now that
the index folder is not local but in the cloud?

Are there otherwise other best practices to deal with the use cases
external file fields were used for, like popularity/ranking, in SolrCloud?
Custom ValueSources going to something external?

Thanks in advance,
Simone