Re: Zookeeper: Could not get shard_id for core

2013-03-04 Thread Martin Koch
Is it possible to run solr without zookeeper, but still using sharding, if
it's all running on one host? Would the shards have to be explicitly
included in the query urls?

Thanks,
/Martin


On Fri, Mar 1, 2013 at 3:58 PM, Shawn Heisey s...@elyograg.org wrote:

 On 3/1/2013 7:34 AM, Martin Koch wrote:

 Most of the time things run just fine; however, we see this error every so
 often, and fix it as described.

 How do I run solr in non cloud mode? Could you point me to a description?


 The zookeeper options are required for cloud mode - zkHost to tell it
 about all your zookeeper nodes, zkRun to run an embedded zookeeper server.
  If you don't have these in solr.xml or your startup commandline, Solr 4.x
 will not be running in cloud mode, just like earlier versions.

 Thanks,
 Shawn




Zookeeper: Could not get shard_id for core

2013-03-01 Thread Martin Koch
On a host that is running two separate solr (jetty) processes and a single
zookeeper process, we're often seeing solr complain that it can't find a
particular core. If we restart the solr process, when it comes back up, it
has lost all information about its cores

Feb 28, 2013 10:26:47 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [core0] Registered new searcher
Searcher@14df33aemain{StandardDirectoryReader(segments_aat:181977
_16pu(4.0.0.2):C263610
/78380 _vwv(4.0.0.2):C285538/130332 ... [snip]
Feb 28, 2013 10:26:47 PM org.apache.solr.common.cloud.ZkStateReader$2
process
INFO: A cluster state change has occurred - updating...
Feb 28, 2013 10:27:47 PM org.apache.solr.common.SolrException log
*SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id
for core: core0*
at
org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:995)
at
org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1053)
at
org.apache.solr.core.CoreContainer.register(CoreContainer.java:662)
[snip]

SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id
for core: core1

etc for all the cores.

The solution has been so far to shut down solr and zookeeper, delete the
zookeeper configuration from disk, and then bring everything back up again.

Has anyone else seen this problem? I'd love to be able to do with the
hassle of having to run zookeeper, and the problems that are associated
with it. Is this possible?

Thanks,
/Martin Koch - Senior Systems Architect - Issuu.com


Re: Zookeeper: Could not get shard_id for core

2013-03-01 Thread Martin Koch
Most of the time things run just fine; however, we see this error every so
often, and fix it as described.

How do I run solr in non cloud mode? Could you point me to a description?

Thanks,
/Martin


On Fri, Mar 1, 2013 at 3:30 PM, Mark Miller markrmil...@gmail.com wrote:

 It sounds like you have some sort of configuration issue perhaps. When
 things are setup right, you should not be seeing anything like this.

 Whether or not you can do without ZooKeeper depends on what your
 requirements are and what you want to support. You can use SolrCloud mode
 and non SolrCloud mode - there are advantages and disadvantages to each.

 - Mark

 On Mar 1, 2013, at 9:03 AM, Martin Koch m...@issuu.com wrote:

  On a host that is running two separate solr (jetty) processes and a
 single
  zookeeper process, we're often seeing solr complain that it can't find a
  particular core. If we restart the solr process, when it comes back up,
 it
  has lost all information about its cores
 
  Feb 28, 2013 10:26:47 PM org.apache.solr.core.SolrCore registerSearcher
  INFO: [core0] Registered new searcher
  Searcher@14df33aemain{StandardDirectoryReader(segments_aat:181977
  _16pu(4.0.0.2):C263610
  /78380 _vwv(4.0.0.2):C285538/130332 ... [snip]
  Feb 28, 2013 10:26:47 PM org.apache.solr.common.cloud.ZkStateReader$2
  process
  INFO: A cluster state change has occurred - updating...
  Feb 28, 2013 10:27:47 PM org.apache.solr.common.SolrException log
  *SEVERE: null:org.apache.solr.common.SolrException: Could not get
 shard_id
  for core: core0*
 at
 
 org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:995)
 at
  org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1053)
 at
  org.apache.solr.core.CoreContainer.register(CoreContainer.java:662)
 [snip]
 
  SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id
  for core: core1
 
  etc for all the cores.
 
  The solution has been so far to shut down solr and zookeeper, delete the
  zookeeper configuration from disk, and then bring everything back up
 again.
 
  Has anyone else seen this problem? I'd love to be able to do with the
  hassle of having to run zookeeper, and the problems that are associated
  with it. Is this possible?
 
  Thanks,
  /Martin Koch - Senior Systems Architect - Issuu.com




Re: Zookeeper: Could not get shard_id for core

2013-03-01 Thread Martin Koch
Thank you very much, Shawn. I had understood that Zookeeper was a mandatory
component for Solr 4, and it is immensely useful to know that it is
possible to do without.

/Martin Koch


On Fri, Mar 1, 2013 at 3:58 PM, Shawn Heisey s...@elyograg.org wrote:

 On 3/1/2013 7:34 AM, Martin Koch wrote:

 Most of the time things run just fine; however, we see this error every so
 often, and fix it as described.

 How do I run solr in non cloud mode? Could you point me to a description?


 The zookeeper options are required for cloud mode - zkHost to tell it
 about all your zookeeper nodes, zkRun to run an embedded zookeeper server.
  If you don't have these in solr.xml or your startup commandline, Solr 4.x
 will not be running in cloud mode, just like earlier versions.

 Thanks,
 Shawn




Blogpost about SOLR at Issuu

2013-02-13 Thread Martin Koch
Hi list

I have written a blog post about the use of SOLR for searching at
Issuuhttp://www.issuu.com
.

To give you a sense of the scale, Issuu indexes more than 9 million
documents and 200 million pages. In January Issuu had 4.3 billion pageviews
and over 125.8 million visits (60.1 unique).

You can see the blog post
herehttp://blog.issuu.com/post/41189476451/how-search-at-issuu-actually-works
.

Happy reading,
/Martin Koch - Senior Systems Architect - Issuu.


Re: SolrCloud and exernal file fields

2012-11-28 Thread Martin Koch
Mikhail

I haven't experimented further yet. I think that the previous experiment of
issuing a commit to a specific core proved that all cores get the commit,
so I don't think that this approach will work.

Thanks,
/Martin


On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 It's still not clear to me whether you solve the problem completely or
 partially:
 Does reducing number of cores free some resources for searching during
 commit?
 Does the commiting one-by-one core prevents the freeze?

 Thanks


 On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote:

 Mikhail

 To avoid freezes we deployed the patches that are now on the 4.1 trunk
 (bug
 3985). But this wasn't good enough, because SOLR would still take very
 long
 to restart when that was necessary.

 I don't see how we could throw more hardware at the problem without making
 it worse, really - the only solution here would be *fewer* shards, not

 more.

 IMO it would be ideal if the lucene/solr community could come up with a
 good way of updating fields in a document without reindexing. This could
 be
 by linking to some external data store, or in the lucene/solr internals.
 If
 it would make things easier, a good first step would be to have
 dynamically
 updateable numerical fields only.

 /Martin

 On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  Martin,
 
  I don't think solrconfig.xml shed any light on. I've just found what I
  didn't get in your setup - the way of how to explicitly assigning core
 to
  collection. Now, I realized most of details after all!
  Ball is on your side, let us know whether you have managed your cores to
  commit one by one to avoid freeze, or could you eliminate pauses by
  allocating more hardware?
  Thanks in advance!
 
 
  On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:
 
   Mikhail,
  
   PSB
  
   On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com
 wrote:
   

 I wasn't aware until now that it is possible to send a commit to
 one
   core
 only. What we observed was the effect of curl
 localhost:8080/solr/update?commit=true but perhaps we should
  experiment
 with solr/coreN/update?commit=true. A quick trial run seems to
  indicate
 that a commit to a single core causes commits on all cores.

You should see something like this in the log:
... SolrCmdDistributor  Distrib commit to: ...
   
Yup, a commit towards a single core results in a commit on all
 cores.
  
  


 Perhaps I should clarify that we are using SOLR as a black box;
 we do
   not
 touch the code at all - we only install the distribution WAR file
 and
 proceed from there.

I still don't understand how you deploy/launch Solr. How many jettys
  you
start whether you have -DzkRun -DzkHost -DnumShards=2  or you
 specifies
shards= param for every request and distributes updates yourself?
 What
collections do you create and with which settings?
   
We let SOLR do the sharding using one collection with 16 SOLR cores
   holding one shard each. We launch only one instance of jetty with the
   folllowing arguments:
  
   -DnumShards=16
   -DzkHost=zookeeperhost:port
   -Xmx10G
   -Xms10G
   -Xmn2G
   -server
  
   Would you like to see the solrconfig.xml?
  
   /Martin
  
  


  Also from my POV such deployments should start at least from
 *16*
   4-way
  vboxes, it's more expensive, but should be much better available
   during
  cpu-consuming operations.
 

 Do you mean that you recommend 16 hosts with 4 cores each? Or 4
 hosts
with
 16 cores? Or am I misunderstanding something :) ?

I prefer to start from 16 hosts with 4 cores each.
   
   


  Other details, if you use single jetty for all of them, are you
  sure
that
  jetty's threadpool doesn't limit requests? is it large enough?
  You have 60G and set -Xmx=10G. are you sure that total size of
  cores
 index
  directories is less than 45G?
 
  The total index size is 230 GB, so it won't fit in ram, but
 we're
   using
 an
 SSD disk to minimize disk access time. We have tried putting the
 EFF
onto a
 ram disk, but this didn't have a measurable effect.

 Thanks,
 /Martin


  Thanks
 
 
  On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
  wrote:
 
   Mikhail
  
   PSB
  
   On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
Martin,
   
Please find additional question from me below.
   
Simone,
   
I'm sorry for hijacking your thread. The only what I've
 heard
   about
 it
  at
recent ApacheCon sessions is that Zookeeper

Re: How to post atomic updates using xml

2012-11-27 Thread Martin Koch
Are all your fields marked as stored in your schema? This is a
requirement for atomic updates.

/Martin Koch


On Mon, Nov 26, 2012 at 7:58 PM, Darniz rnizamud...@edmunds.com wrote:

 i tried using the same logic to update a specific field and to my surprise
 all my other fields were lost. i had a doc with almost 50 fields and i
 wanted to update only the gender field i issued the below command

 curl http://host:8080/solr/update?commit=true -H 'Content-type:text/xml'
 -d
 'adddocfield name=id63481697/fieldfield name=authorGender
 update=setmale/field/doc/add'

 to me it looks like it replace my entire document.

 Can you please let me know what went wrong



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/How-to-post-atomic-updates-using-xml-tp4007323p4022424.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud and exernal file fields

2012-11-23 Thread Martin Koch
The short answer is no; the number was chosen in an attempt to get as many
cores working in parallel to complete the search faster, but I realize that
there is an overhead incurred by distribution and merging the results.
We've now gone to 8 shards and will be monitoring performance.

/Martin


On Thu, Nov 22, 2012 at 3:53 PM, Yonik Seeley yo...@lucidworks.com wrote:

 On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote:
  around 7M documents in the index; each document has a 45 character ID.

 7M documents isn't that large.  Is there a reason why you need so many
 shards (16 in your case) on a single box?

 -Yonik
 http://lucidworks.com



Re: SolrCloud and exernal file fields

2012-11-22 Thread Martin Koch
Mikhail

To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug
3985). But this wasn't good enough, because SOLR would still take very long
to restart when that was necessary.

I don't see how we could throw more hardware at the problem without making
it worse, really - the only solution here would be *fewer* shards, not
more.

IMO it would be ideal if the lucene/solr community could come up with a
good way of updating fields in a document without reindexing. This could be
by linking to some external data store, or in the lucene/solr internals. If
it would make things easier, a good first step would be to have dynamically
updateable numerical fields only.

/Martin

On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 I don't think solrconfig.xml shed any light on. I've just found what I
 didn't get in your setup - the way of how to explicitly assigning core to
 collection. Now, I realized most of details after all!
 Ball is on your side, let us know whether you have managed your cores to
 commit one by one to avoid freeze, or could you eliminate pauses by
 allocating more hardware?
 Thanks in advance!


 On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote:

  Mikhail,
 
  PSB
 
  On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:
  
   
I wasn't aware until now that it is possible to send a commit to one
  core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should
 experiment
with solr/coreN/update?commit=true. A quick trial run seems to
 indicate
that a commit to a single core causes commits on all cores.
   
   You should see something like this in the log:
   ... SolrCmdDistributor  Distrib commit to: ...
  
   Yup, a commit towards a single core results in a commit on all cores.
 
 
   
   
Perhaps I should clarify that we are using SOLR as a black box; we do
  not
touch the code at all - we only install the distribution WAR file and
proceed from there.
   
   I still don't understand how you deploy/launch Solr. How many jettys
 you
   start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
   shards= param for every request and distributes updates yourself? What
   collections do you create and with which settings?
  
   We let SOLR do the sharding using one collection with 16 SOLR cores
  holding one shard each. We launch only one instance of jetty with the
  folllowing arguments:
 
  -DnumShards=16
  -DzkHost=zookeeperhost:port
  -Xmx10G
  -Xms10G
  -Xmn2G
  -server
 
  Would you like to see the solrconfig.xml?
 
  /Martin
 
 
   
   
 Also from my POV such deployments should start at least from *16*
  4-way
 vboxes, it's more expensive, but should be much better available
  during
 cpu-consuming operations.

   
Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
   with
16 cores? Or am I misunderstanding something :) ?
   
   I prefer to start from 16 hosts with 4 cores each.
  
  
   
   
 Other details, if you use single jetty for all of them, are you
 sure
   that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of
 cores
index
 directories is less than 45G?

 The total index size is 230 GB, so it won't fit in ram, but we're
  using
an
SSD disk to minimize disk access time. We have tried putting the EFF
   onto a
ram disk, but this didn't have a measurable effect.
   
Thanks,
/Martin
   
   
 Thanks


 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com
 wrote:

  Mikhail
 
  PSB
 
  On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Please find additional question from me below.
  
   Simone,
  
   I'm sorry for hijacking your thread. The only what I've heard
  about
it
 at
   recent ApacheCon sessions is that Zookeeper is supposed to
   replicate
  those
   files as configs under solr home. And I'm really looking
 forward
  to
 know
   how it works with huge files in production.
  
   Thank You, Guys!
  
   20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
   написал:
   
Hi Mikhail
   
Please see answers below.
   
On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Thank you for telling your own war-story. It's really
  useful
for
 community.
 The first question might seems not really conscious, but
  would
you
  tell
   me
 what blocks searching during EFF reload, when it's
 triggered

Re: SolrCloud and exernal file fields

2012-11-21 Thread Martin Koch
On Wed, Nov 21, 2012 at 7:08 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

   I'm not sure about the mmap directory or where that
  would be configured in solr - can you explain that?
 

 You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir
 should be org.apache.lucene.store.MMapDirectory

 It says '
org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
'

/Martin

--
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics

 http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: SolrCloud and exernal file fields

2012-11-21 Thread Martin Koch
Mikhail,

PSB

On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote:

 
  I wasn't aware until now that it is possible to send a commit to one core
  only. What we observed was the effect of curl
  localhost:8080/solr/update?commit=true but perhaps we should experiment
  with solr/coreN/update?commit=true. A quick trial run seems to indicate
  that a commit to a single core causes commits on all cores.
 
 You should see something like this in the log:
 ... SolrCmdDistributor  Distrib commit to: ...

 Yup, a commit towards a single core results in a commit on all cores.


 
 
  Perhaps I should clarify that we are using SOLR as a black box; we do not
  touch the code at all - we only install the distribution WAR file and
  proceed from there.
 
 I still don't understand how you deploy/launch Solr. How many jettys you
 start whether you have -DzkRun -DzkHost -DnumShards=2  or you specifies
 shards= param for every request and distributes updates yourself? What
 collections do you create and with which settings?

 We let SOLR do the sharding using one collection with 16 SOLR cores
holding one shard each. We launch only one instance of jetty with the
folllowing arguments:

-DnumShards=16
-DzkHost=zookeeperhost:port
-Xmx10G
-Xms10G
-Xmn2G
-server

Would you like to see the solrconfig.xml?

/Martin


 
 
   Also from my POV such deployments should start at least from *16* 4-way
   vboxes, it's more expensive, but should be much better available during
   cpu-consuming operations.
  
 
  Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts
 with
  16 cores? Or am I misunderstanding something :) ?
 
 I prefer to start from 16 hosts with 4 cores each.


 
 
   Other details, if you use single jetty for all of them, are you sure
 that
   jetty's threadpool doesn't limit requests? is it large enough?
   You have 60G and set -Xmx=10G. are you sure that total size of cores
  index
   directories is less than 45G?
  
   The total index size is 230 GB, so it won't fit in ram, but we're using
  an
  SSD disk to minimize disk access time. We have tried putting the EFF
 onto a
  ram disk, but this didn't have a measurable effect.
 
  Thanks,
  /Martin
 
 
   Thanks
  
  
   On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:
  
Mikhail
   
PSB
   
On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Please find additional question from me below.

 Simone,

 I'm sorry for hijacking your thread. The only what I've heard about
  it
   at
 recent ApacheCon sessions is that Zookeeper is supposed to
 replicate
those
 files as configs under solr home. And I'm really looking forward to
   know
 how it works with huge files in production.

 Thank You, Guys!

 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com
 написал:
 
  Hi Mikhail
 
  Please see answers below.
 
  On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Thank you for telling your own war-story. It's really useful
  for
   community.
   The first question might seems not really conscious, but would
  you
tell
 me
   what blocks searching during EFF reload, when it's triggered by
handler
 or
   by listener?
  
 
  We continuously index new documents using CommitWithin to get
  regular
  commits. However, we observed that the EFFs were not re-read, so
 we
   had
 to
  do external commits (curl '.../solr/update?commit=true') to force
reload.
  When this is done, solr blocks. I can't tell you exactly why it's
   doing
  that (it was related to SOLR-3985).

 Is there a chance to get a thread dump when they are blocked?


Well I could try to recreate the situation. But the setup is fairly
   simple:
Create a large EFF in a largeish index with many shards. Issue a
  commit,
and then try to do a search. Solr will not respond to the search
 before
   the
commit has completed, and this will take a long time.
   
   

 
 
   I don't really get the sentence about sequential commits and
  number
of
   cores. Do I get right that file is replicated via Zookeeper?
   Doesn't
it
  
 
  Again, this is observed behavior. When we issue a commit on a
  system
with
 a
  system with many solr cores using EFFs, the system blocks for a
  long
time
  (15 minutes).  We do NOT use zookeeper for anything. The EFF is a
symlink
  from each cores index dir to the actual file, which is updated by
  an
  external process.

 Hold on, I asked about Zookeeper because the subj mentions
 SolrCloud.

 Do you use SolrCloud, SolrShards, or these cores are just

Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Solr 4.0 does support using EFFs, but it might not give you what you're
hoping fore.

We tried using Solr Cloud, and have given up again.

The EFF is placed in the parent of the index directory in each core; each
core reads the entire EFF and picks out the IDs that it is responsible for.

In the current 4.0.0 release of solr, solr blocks (doesn't answer queries)
while re-reading the EFF. Even worse, it seems that the time to re-read the
EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by
each core sequentially). The contents of the EFF become active after the
first EXTERNAL commit (commitWithin does NOT work here) after the file has
been updated.

In our case, the EFF was quite large - around 450MB - and we use 16 shards,
so when we triggered an external commit to force re-reading, the whole
system would block for several (10-15) minutes. This won't work in a
production environment. The reason for the size of the EFF is that we have
around 7M documents in the index; each document has a 45 character ID.

We got some help to try to fix the problem so that the re-read of the EFF
proceeds in the background (see
herehttps://issues.apache.org/jira/browse/SOLR-3985 for
a fix on the 4.1 branch). However, even though the re-read proceeds in the
background, the time required to launch solr now takes at least as long as
re-reading the EFFs. Again, this is not good enough for our needs.

The next issue is that you cannot sort on EFF fields (though you can return
them as values using fl=field(my_eff_field). This is also fixed in the 4.1
branch here https://issues.apache.org/jira/browse/SOLR-4022.

So: Even after these fixes, EFF performance is not that great. Our solution
is as follows: The actual value of the popularity measure (say, reads) that
we want to report to the user is inserted into the search response
post-query by our query front-end. This value will then be the
authoritative value at the time of the query. The value of the popularity
measure that we use for boosting in the ranking of the search results is
only updated when the value has changed enough so that the impact on the
boost will be significant (say, more than 2%). This does require frequent
re-indexing of the documents that have significant changes in the number of
reads, but at least we won't have to update a document if it moves from,
say, 100 to 101 reads.

/Martin Koch - ISSUU - senior systems architect.

On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote:

 Hi all,
 I'm planning to move a quite big Solr index to SolrCloud. However, in this
 index, an external file field is used for popularity ranking.

 Does SolrCloud supports external file fields? How does it cope with
 sharding and replication? Where should the external file be placed now that
 the index folder is not local but in the cloud?

 Are there otherwise other best practices to deal with the use cases
 external file fields were used for, like popularity/ranking, in SolrCloud?
 Custom ValueSources going to something external?

 Thanks in advance,
 Simone



Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Hi Mikhail

Please see answers below.

On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 Thank you for telling your own war-story. It's really useful for
 community.
 The first question might seems not really conscious, but would you tell me
 what blocks searching during EFF reload, when it's triggered by handler or
 by listener?


We continuously index new documents using CommitWithin to get regular
commits. However, we observed that the EFFs were not re-read, so we had to
do external commits (curl '.../solr/update?commit=true') to force reload.
When this is done, solr blocks. I can't tell you exactly why it's doing
that (it was related to SOLR-3985).


 I don't really get the sentence about sequential commits and number of
 cores. Do I get right that file is replicated via Zookeeper? Doesn't it


Again, this is observed behavior. When we issue a commit on a system with a
system with many solr cores using EFFs, the system blocks for a long time
(15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
from each cores index dir to the actual file, which is updated by an
external process.


 causes scalability problem or long time to reload? Will it help if we'll
 have, let's say ExternalDatabaseField which will pull values from jdbc. ie.


I think the possibility of having some fields being retrieved from an
external, dynamically updatable store would be really interesting. This
could be JDBC, something in-memory like redis, or a NoSql product (e.g.
Cassandra).


 why all cores can't read these values simultaneously?


Again, this is a solr implementation detail that I can't answer :)


 Can you confirm that IDs in the file is ordered by the index term order?


Yes, we sorted the files (standard UNIX sort).


 AFAIK it can impact load time.

Yes, it does.


 Regarding your post-query solution can you tell me if query found 1
 docs, but I need to display only first page with 100 rows, whether I need
 to pull all 10K results to frontend to order them by the rank?


In our architecture, the clients query an API that generates the SOLR
query, retrieves the relevant additional fields that we needs, and returns
the relevant JSON to the front-end.

In our use case, results are returned from SOLR by the 10's, not by the
1000's, so it is a manageable job. Even so, if solr returned thousands of
results, it would be up to the implementation of the api to augment only
the results that needed to be returned to the front-end.

Even so, patching up a JSON structure with 1 results should be
possible.


 I'm really appreciate if you comment on the questions above.
 PS: It's time to pitch, how much
 https://issues.apache.org/jira/browse/SOLR-4085 Commit-free
 ExternalFileField can help you?


 It looks very interesting :) Does it make it possible to avoid re-reading
the EFF on every commit, and only re-read the values that have actually
changed?

/Martin



 On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote:

  Solr 4.0 does support using EFFs, but it might not give you what you're
  hoping fore.
 
  We tried using Solr Cloud, and have given up again.
 
  The EFF is placed in the parent of the index directory in each core; each
  core reads the entire EFF and picks out the IDs that it is responsible
 for.
 
  In the current 4.0.0 release of solr, solr blocks (doesn't answer
 queries)
  while re-reading the EFF. Even worse, it seems that the time to re-read
 the
  EFF is multiplied by the number of cores in use (i.e. the EFF is re-read
 by
  each core sequentially). The contents of the EFF become active after the
  first EXTERNAL commit (commitWithin does NOT work here) after the file
 has
  been updated.
 
  In our case, the EFF was quite large - around 450MB - and we use 16
 shards,
  so when we triggered an external commit to force re-reading, the whole
  system would block for several (10-15) minutes. This won't work in a
  production environment. The reason for the size of the EFF is that we
 have
  around 7M documents in the index; each document has a 45 character ID.
 
  We got some help to try to fix the problem so that the re-read of the EFF
  proceeds in the background (see
  herehttps://issues.apache.org/jira/browse/SOLR-3985 for
  a fix on the 4.1 branch). However, even though the re-read proceeds in
 the
  background, the time required to launch solr now takes at least as long
 as
  re-reading the EFFs. Again, this is not good enough for our needs.
 
  The next issue is that you cannot sort on EFF fields (though you can
 return
  them as values using fl=field(my_eff_field). This is also fixed in the
 4.1
  branch here https://issues.apache.org/jira/browse/SOLR-4022.
 
  So: Even after these fixes, EFF performance is not that great. Our
 solution
  is as follows: The actual value of the popularity measure (say, reads)
 that
  we want to report to the user is inserted into the search response
  post-query by our query front

Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Mikhail

PSB

On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,

 Please find additional question from me below.

 Simone,

 I'm sorry for hijacking your thread. The only what I've heard about it at
 recent ApacheCon sessions is that Zookeeper is supposed to replicate those
 files as configs under solr home. And I'm really looking forward to know
 how it works with huge files in production.

 Thank You, Guys!

 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:
 
  Hi Mikhail
 
  Please see answers below.
 
  On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Thank you for telling your own war-story. It's really useful for
   community.
   The first question might seems not really conscious, but would you tell
 me
   what blocks searching during EFF reload, when it's triggered by handler
 or
   by listener?
  
 
  We continuously index new documents using CommitWithin to get regular
  commits. However, we observed that the EFFs were not re-read, so we had
 to
  do external commits (curl '.../solr/update?commit=true') to force reload.
  When this is done, solr blocks. I can't tell you exactly why it's doing
  that (it was related to SOLR-3985).

 Is there a chance to get a thread dump when they are blocked?


Well I could try to recreate the situation. But the setup is fairly simple:
Create a large EFF in a largeish index with many shards. Issue a commit,
and then try to do a search. Solr will not respond to the search before the
commit has completed, and this will take a long time.



 
 
   I don't really get the sentence about sequential commits and number of
   cores. Do I get right that file is replicated via Zookeeper? Doesn't it
  
 
  Again, this is observed behavior. When we issue a commit on a system with
 a
  system with many solr cores using EFFs, the system blocks for a long time
  (15 minutes).  We do NOT use zookeeper for anything. The EFF is a symlink
  from each cores index dir to the actual file, which is updated by an
  external process.

 Hold on, I asked about Zookeeper because the subj mentions SolrCloud.

 Do you use SolrCloud, SolrShards, or these cores are just replicas of the
 same index?


Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit
unsure about the terminology here, but we've got a single index divided
into 16 shard. Each shard is hosted in a solr core.


 Also, about simlink - Don't you share that file via some NFS?

 No, we generate the EFF on the local solr host (there is only one physical
host that holds all shards), so there is no need for NFS or copying files
around. No need for Zookeeper either.


 how many cores you run per box?

This box is a 16-virtual core (8 hyperthreaded cores)  with 60GB of RAM. We
run 16 solr cores on this box in Jetty.


 Do boxes has plenty of ram to cache filesystem beside of jvm heaps?

 Yes. We've allocated 10GB for jetty, and left the rest for the OS.


 I assume you use 64 bit linux and mmap directory. Please confirm that.


We use 64-bit linux. I'm not sure about the mmap directory or where that
would be configured in solr - can you explain that?


 
 
   causes scalability problem or long time to reload? Will it help if
 we'll
   have, let's say ExternalDatabaseField which will pull values from jdbc.
 ie.
  
 
  I think the possibility of having some fields being retrieved from an
  external, dynamically updatable store would be really interesting. This
  could be JDBC, something in-memory like redis, or a NoSql product (e.g.
  Cassandra).

 Ok. Let's have it in mind as a possible direction.


Alternatively, an API that would allow updating a single field for a
document might be an option.



 
 
   why all cores can't read these values simultaneously?
  
 
  Again, this is a solr implementation detail that I can't answer :)
 
 
   Can you confirm that IDs in the file is ordered by the index term
 order?
  
 
  Yes, we sorted the files (standard UNIX sort).
 
 
   AFAIK it can impact load time.
  
  Yes, it does

 Ok, I've got that you aware of it, and your IDs are just strings, not
 integers.


Yes, ids are strings.


 
 
   Regarding your post-query solution can you tell me if query found 1
   docs, but I need to display only first page with 100 rows, whether I
 need
   to pull all 10K results to frontend to order them by the rank?
  
  
  In our architecture, the clients query an API that generates the SOLR
  query, retrieves the relevant additional fields that we needs, and
 returns
  the relevant JSON to the front-end.
 
  In our use case, results are returned from SOLR by the 10's, not by the
  1000's, so it is a manageable job. Even so, if solr returned thousands of
  results, it would be up to the implementation of the api to augment only
  the results that needed to be returned to the front-end.
 
  Even so, patching up a JSON structure with 1 results should be
  possible

Re: SolrCloud and exernal file fields

2012-11-20 Thread Martin Koch
Mikhail

I appreciate your input, it's very useful :)

On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 Martin,
 This deployment seems a little bit confusing to me. You have 16-way fairy
 virtual box, and send 16 request for really heavy operation at the same
 moment, it does not surprise me that you loosing it for some period of
 time. At that time you should have more than 16 in load average metrics.
 I suggest to send commit to those cores one-by-one and have inconsistency
 and some sort of blinking as a trade-off for availability. In this case
 only single virtual CPU will be fully consumed by the commit's _thread
 divergence action_ and others will serve requests.


I wasn't aware until now that it is possible to send a commit to one core
only. What we observed was the effect of curl
localhost:8080/solr/update?commit=true but perhaps we should experiment
with solr/coreN/update?commit=true. A quick trial run seems to indicate
that a commit to a single core causes commits on all cores.


Perhaps I should clarify that we are using SOLR as a black box; we do not
touch the code at all - we only install the distribution WAR file and
proceed from there.


 Also from my POV such deployments should start at least from *16* 4-way
 vboxes, it's more expensive, but should be much better available during
 cpu-consuming operations.


Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with
16 cores? Or am I misunderstanding something :) ?


 Other details, if you use single jetty for all of them, are you sure that
 jetty's threadpool doesn't limit requests? is it large enough?
 You have 60G and set -Xmx=10G. are you sure that total size of cores index
 directories is less than 45G?

 The total index size is 230 GB, so it won't fit in ram, but we're using an
SSD disk to minimize disk access time. We have tried putting the EFF onto a
ram disk, but this didn't have a measurable effect.

Thanks,
/Martin


 Thanks


 On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote:

  Mikhail
 
  PSB
 
  On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   Martin,
  
   Please find additional question from me below.
  
   Simone,
  
   I'm sorry for hijacking your thread. The only what I've heard about it
 at
   recent ApacheCon sessions is that Zookeeper is supposed to replicate
  those
   files as configs under solr home. And I'm really looking forward to
 know
   how it works with huge files in production.
  
   Thank You, Guys!
  
   20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал:
   
Hi Mikhail
   
Please see answers below.
   
On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:
   
 Martin,

 Thank you for telling your own war-story. It's really useful for
 community.
 The first question might seems not really conscious, but would you
  tell
   me
 what blocks searching during EFF reload, when it's triggered by
  handler
   or
 by listener?

   
We continuously index new documents using CommitWithin to get regular
commits. However, we observed that the EFFs were not re-read, so we
 had
   to
do external commits (curl '.../solr/update?commit=true') to force
  reload.
When this is done, solr blocks. I can't tell you exactly why it's
 doing
that (it was related to SOLR-3985).
  
   Is there a chance to get a thread dump when they are blocked?
  
  
  Well I could try to recreate the situation. But the setup is fairly
 simple:
  Create a large EFF in a largeish index with many shards. Issue a commit,
  and then try to do a search. Solr will not respond to the search before
 the
  commit has completed, and this will take a long time.
 
 
  
   
   
 I don't really get the sentence about sequential commits and number
  of
 cores. Do I get right that file is replicated via Zookeeper?
 Doesn't
  it

   
Again, this is observed behavior. When we issue a commit on a system
  with
   a
system with many solr cores using EFFs, the system blocks for a long
  time
(15 minutes).  We do NOT use zookeeper for anything. The EFF is a
  symlink
from each cores index dir to the actual file, which is updated by an
external process.
  
   Hold on, I asked about Zookeeper because the subj mentions SolrCloud.
  
   Do you use SolrCloud, SolrShards, or these cores are just replicas of
 the
   same index?
  
 
  Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a
 bit
  unsure about the terminology here, but we've got a single index divided
  into 16 shard. Each shard is hosted in a solr core.
 
 
   Also, about simlink - Don't you share that file via some NFS?
  
   No, we generate the EFF on the local solr host (there is only one
  physical
  host that holds all shards), so there is no need for NFS or copying files
  around. No need for Zookeeper either.
 
 
   how many cores

Re: solr blocking on commit

2012-11-01 Thread Martin Koch
Are you using solr 4.0? We had some problems similar to this (not in a
master/slave setup, though), where the resolution was to disable the
transaction log, i.e. remove updateLog in the updateHandler section -
we don't need NRT get, so this isn't important to us.

Cheers,
/Martin Koch

On Thu, Nov 1, 2012 at 1:25 AM, dbabits dbab...@gmail.com wrote:

 I second the original poster- all selects are blocked during commits.
 I have Master replicating to Slave.
 Indexing happens to Master, few docs/about every 30 secs
 Selects are run against Slave.

 This is the pattern from the Slave log:

 Oct 30, 2012 12:33:23 AM org.apache.solr.core.SolrDeletionPolicy
 updateCommits
 INFO: newest commit = 1349195567630
 Oct 30, 2012 12:33:42 AM org.apache.solr.core.SolrCore execute
 INFO: [core3] webapp=/solr path=/select

 During the 19 seconds that you see between the 2 lines, the /select is
 blocked, until the commit is done.
 This has nothing to do with jvm, I'm monitoring the memory and GC stats
 with
 jConsole and log.
 I played with all settings imaginable: commitWithin, commit=true,
 useColdSearcher, autoWarming settings from 0 on-nothing helps.

 The environment is: 3.6.0, RHEL Lunux 5.3.2, 64-bit, 96G RAM, 6 CPU cores,
 java 1.6.0_24, ~70 million docs.
 As soon as I suspend replication (command=disablepoll), everything becomes
 fast.
 As soon as I enable it - it pretty much becomes useless.
 Querying Master directly exibits the same problem of course.

 Thanks a lot for your help.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/solr-blocking-on-commit-tp474874p4017416.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: is it possible to index

2012-10-24 Thread Martin Koch
In my experience, about as fast as you can push the new data :) Depending
on the size of your records, this should be a matter of seconds.

/Martin Koch

On Wed, Oct 24, 2012 at 9:01 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Erick,

  Thanks for the help, it sure helps a lot to read that, as it gives me
 more confidence I am not crazy about what I am thinking.
  The only problem I see by de-normalizing data as you said is that if
 any relation between customer and vendor changes, I will have to update the
 index for all the vendors. I could have about 10 000 customers per vendor.
  Anyway, by what you're saying, it's more common than I was imagining,
 right? I wonder how long solr will take to reindex 1 records when this
 happens.

 Thanks,
 Marcelo Valle.

 2012/10/24 Erick Erickson erickerick...@gmail.com

  One, take off your RDBMS cap G...
 
  DB folks regularly reject the idea of de-normalizing data
  to make best use of Solr, but that's what I would explore
  first. Yes, this repeats the, in your case, vendor information
  perhaps many times, but try that first, even though that
  causes you to update multiple customers whenever a vendor
  changes. You haven't specified how many customers and vendors
  you're talking abou there, but unless the total number of documents
  (where each document is a customer+vendor combination)
  is multiple tens of millions, you probably will be fine.
 
  You can get a list of just customers by using grouping where you
  group on customer, although that may not be the most efficient. You
  could index a field, call it cust_filter that was set to true for the
  first
  customer/vendor you indexed and false (or just left out) for all the
  rest and q=blahblahfq=cust_filter:true.
 
  Hope that helps
  Erick
 
  On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle
  mvall...@gmail.com wrote:
   Hello,
  
   I am new to Solr and I have a scenario where I want to use it, but
 I
   might be misunderstanding some concepts. I will explain what I want
 here,
   if someone has a solution for this, I would gladly accept the help.
   I have a core indexing customers. I have another core indexing
  vendors.
   Both are related to each other.
   Here is what I want to do in my application: I want to find all the
   customers that follow some criteria and them find the vendors related
 to
   them.
  
   My first option was to to have just vendor core and in for each
   document in vendor core I would have all the customers related to it.
   However, I would write the same customer several times to the index, as
   more than one vendor could be related to the same customer. Besides, I
   wonder how would I write a query to list just the different customers.
   Another problem is that I update customers in a different frequency I
   update vendors, but have vendor + customers in a single document would
  obly
   me to do the full update.
  
   Does anyone have a good solution for this I am not being able to
  see? I
   might be missing some basic concept here...
  
   Thanks,
   --
   Marcelo Elias Del Valle
   http://mvalle.com - @mvallebr
 



 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Reloading ExternalFileField blocks Solr

2012-10-08 Thread Martin Koch
Hi List

We're using Solr-4.0.0-Beta with a 7M document index running on a single
host with 16 shards. We'd like to use an ExternalFileField to hold a value
that changes often. However, we've discovered that the file is apparently
re-read by every shard/core on *every commit*; the index is unresponsive in
this period (around 20s on the host we're running on). This is unacceptable
for our needs. In the future, we'd like to add other values as
ExternalFileFields, and this will make the problem worse.

It would be better if the external file were instead read in in the
background, updating previously read relevant values for each shard as they
are read in.

I guess a change in the ExternalFileField code would be required to achieve
this, but I have no experience here, so suggestions are very welcome.

Thanks,
/Martin Koch - Issuu - Senior Systems Architect.


Re: Reloading ExternalFileField blocks Solr

2012-10-08 Thread Martin Koch
Sure: We're boosting search results based on user actions which could be
e.g. the number of times a particular document has been read. In future,
we'd also like to boost by e.g. impressions (the number of times a document
has been displayed) and other values.

/Martin

On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 Martin,

 Can you tell me what's the content of that field, and how it should affect
 search result?

 On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote:

  Hi List
 
  We're using Solr-4.0.0-Beta with a 7M document index running on a single
  host with 16 shards. We'd like to use an ExternalFileField to hold a
 value
  that changes often. However, we've discovered that the file is apparently
  re-read by every shard/core on *every commit*; the index is unresponsive
 in
  this period (around 20s on the host we're running on). This is
 unacceptable
  for our needs. In the future, we'd like to add other values as
  ExternalFileFields, and this will make the problem worse.
 
  It would be better if the external file were instead read in in the
  background, updating previously read relevant values for each shard as
 they
  are read in.
 
  I guess a change in the ExternalFileField code would be required to
 achieve
  this, but I have no experience here, so suggestions are very welcome.
 
  Thanks,
  /Martin Koch - Issuu - Senior Systems Architect.
 



 --
 Sincerely yours
 Mikhail Khludnev
 Tech Lead
 Grid Dynamics

 http://www.griddynamics.com
  mkhlud...@griddynamics.com



Re: SolrCloud 4.0 ALHPA, replicas, large commit times

2012-08-27 Thread Martin Koch
(I'm working with Raghav on this): We've got several parallel workers that
add documents in batches of 16 through pysolr, and using commitWithin at 60
seconds when the commit causes solr to freeze; if the commit is only 5
seconds, then everything seems to work fine. In both cases, throughput is
around 500 documents / second.

We can certainly give it a try with the Beta.

Thanks,
/Martin

On Mon, Aug 27, 2012 at 7:30 PM, Mark Miller markrmil...@gmail.com wrote:

 How are you adding the docs? In batch, streaming, a doc at a time?

 Any chance you can try with the Beta?

 On Mon, Aug 27, 2012 at 9:35 AM, Raghav Karol r...@issuu.com wrote:
  Hello *,
 
  We are using SolrClould 4.0 - Alpha and have a 4 machine setup.
 
  Machine 1 - 16 Solr cores - Shard 1 - 16
  Machine 2 - 16 Solr cores - Shard 17 - 32
  Machine 3 - 16 Solr cores - Replica 1 - 16
  Machine 4 - 16 Solr cores - Replice 17 - 32
 
  Index at 500 docs/sec and committing every 60 seconds, i.e., 30,000
 Documents causes Solr to freeze. There is nothing in the logs to indicate
 errors or replication activity - Solr just appeared to freeze.
 
  Increasing the commit frequency we observed that commits of at most
 2,500 docs worked fine.
 
  Are we using SolrCloud and replications incorrectly?
 
  --
  Raghav
 
 
 



 --
 - Mark

 http://www.lucidimagination.com



Re: SolrCloud 4.0 ALHPA, replicas, large commit times

2012-08-27 Thread Martin Koch
It actually is Beta that we're working with.

/Martin

On Mon, Aug 27, 2012 at 10:38 PM, Martin Koch m...@issuu.com wrote:

 (I'm working with Raghav on this): We've got several parallel workers that
 add documents in batches of 16 through pysolr, and using commitWithin at 60
 seconds when the commit causes solr to freeze; if the commit is only 5
 seconds, then everything seems to work fine. In both cases, throughput is
 around 500 documents / second.

 We can certainly give it a try with the Beta.

 Thanks,
 /Martin


 On Mon, Aug 27, 2012 at 7:30 PM, Mark Miller markrmil...@gmail.comwrote:

 How are you adding the docs? In batch, streaming, a doc at a time?

 Any chance you can try with the Beta?

 On Mon, Aug 27, 2012 at 9:35 AM, Raghav Karol r...@issuu.com wrote:
  Hello *,
 
  We are using SolrClould 4.0 - Alpha and have a 4 machine setup.
 
  Machine 1 - 16 Solr cores - Shard 1 - 16
  Machine 2 - 16 Solr cores - Shard 17 - 32
  Machine 3 - 16 Solr cores - Replica 1 - 16
  Machine 4 - 16 Solr cores - Replice 17 - 32
 
  Index at 500 docs/sec and committing every 60 seconds, i.e., 30,000
 Documents causes Solr to freeze. There is nothing in the logs to indicate
 errors or replication activity - Solr just appeared to freeze.
 
  Increasing the commit frequency we observed that commits of at most
 2,500 docs worked fine.
 
  Are we using SolrCloud and replications incorrectly?
 
  --
  Raghav
 
 
 



 --
 - Mark

 http://www.lucidimagination.com





Re: Solr advanced boosting

2012-03-29 Thread Martin Koch
We're doing something similar: We want to combine search relevancy with a
fitness value computed from several other data sources.

For this, we pre-compute the fitness value for each document and store it a
flat file (lines of the format document_id=fitness_score) that we use an
externalFileFieldhttp://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
to
access from Solr.

This file can be updated at regular intervals, e.g. to reflect recent views
or up/downvotes. It is re-read by solr on every commit.

The fitness field can then be included as a boost field in a (e)dismax
query.

/Martin

On Thu, Mar 29, 2012 at 9:56 AM, mads mads...@yahoo.dk wrote:

 Hello everyone!

 I am new to Solr and I have been doing a bit of reading about boosting
 search results. My search index consists of products with different
 attributes like a title, a description, a brand, a price, a discount
 percent
 and so on. I would like to do a fairly complex boosting, so that for
 example
 a hit on the brand name, a low price, a high discount percent is boosted
 compared to a hit in the title, higher prices etc. Basically I would like
 to
 make a more intelligent search with a my self-defined boosting algorithm
 of definition. I hope it makes sense. My question is if more experienced
 Solr people considers this possible, and how I can get started on this
 project? Is it possible to do a kind of a plugin, or?

 Regards Mads

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-advanced-boosting-tp3867025p3867025.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Practical Optimization

2012-03-24 Thread Martin Koch
Thanks for writing this up. These are good tips.

/Martin

On Fri, Mar 23, 2012 at 9:57 PM, dw5ight dw5i...@gmail.com wrote:

 Hey All-

 we run a  http://carsabi.com car search engine  with Solr and did some
 benchmarking recently after we switched from a hosted service to
 self-hosting. In brief, we went from 800ms complex range queries on a 1.5M
 document corpus to 43ms. The major shifts were switching from EC2 Large to
 EC2 CC8XL which got us down to 282ms (2.82x speed gain due to 2.75x CPU
 speed increase we think), and then down to 43ms when we sharded to 8 cores.
 We tried sharding to 12 and 16 but saw negligible gains after this point.

 Anyway, hope this might be useful to someone - we write up exact stats and
 a
 step by step sharding procedure on our

 http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/
 tech blog  if anyone's interested.

 best
 Dwight

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Practical-Optimization-tp3852776p3852776.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Error 500 seek past EOF : SOLR bug?

2012-03-23 Thread Martin Koch
)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.FileNotFoundException: File does not exist
/mnt/solr.data.0/index.20120323132730/_16c_5s.del
at org.apache.solr.common.util.FileUtils.sync(FileUtils.java:64)
at
org.apache.solr.handler.SnapPuller$FileFetcher$1.run(SnapPuller.java:923)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
... 3 more

Thanks,
/Martin Koch


Re: Simple Slave Replication Question

2012-03-23 Thread Martin Koch
I guess this would depend on network bandwidth, but we move around
150G/hour when hooking up a new slave to the master.

/Martin

On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy 
ben.mccar...@tradermedia.co.uk wrote:

 Hello,

 Im looking at the replication from a master to a number of slaves.  I have
 configured it and it appears to be working.  When updating 40K records on
 the master is it standard to always copy over the full index, currently 5gb
 in size.  If this is standard what do people do who have massive 200gb
 indexs, does it not take a while to bring the slaves inline with the master?

 Thanks
 Ben

 


 This e-mail is sent on behalf of Trader Media Group Limited, Registered
 Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower
 Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833).
 This email and any files transmitted with it are confidential and may be
 legally privileged, and intended solely for the use of the individual or
 entity to whom they are addressed. If you have received this email in error
 please notify the sender. This email message has been swept for the
 presence of computer viruses.




Re: Commit without an update handler?

2012-01-05 Thread Martin Koch
Yes.

However, something must actually have been updated in the index before a
commit on the master causes the slave to update (this is what was confusing
me).

Since I'll be updating the index fairly often, this will not be a problem
for me.

If, however, the external file field is updated often, but the index proper
isn't, this could be a problem.

Thanks,
/Martin

On Thu, Jan 5, 2012 at 2:56 PM, Erick Erickson erickerick...@gmail.comwrote:

 Hmmm, does it work just to put this in the masters index and let
 replication to its tricks and issue your commit on the master?

 Or am I missing something here?

 Best
 Erick

 On Tue, Jan 3, 2012 at 1:33 PM, Martin Koch m...@issuu.com wrote:
  Hi List
 
  I have a Solr cluster set up in a master/slave configuration where the
  master acts as an indexing node and the slaves serve user requests.
 
  To avoid accidental posts of new documents to the slaves, I have disabled
  the update handlers.
 
  However, I use an externalFileField. When the file is updated, I need to
  issue a commit to reload the new file. This requires an update handler.
 Is
  there an update handler that doesn't accept new documents, but will
 effect
  a commit?
 
  Thanks,
  /Martin



Commit without an update handler?

2012-01-03 Thread Martin Koch
Hi List

I have a Solr cluster set up in a master/slave configuration where the
master acts as an indexing node and the slaves serve user requests.

To avoid accidental posts of new documents to the slaves, I have disabled
the update handlers.

However, I use an externalFileField. When the file is updated, I need to
issue a commit to reload the new file. This requires an update handler. Is
there an update handler that doesn't accept new documents, but will effect
a commit?

Thanks,
/Martin


Re: Indexing problem

2011-12-28 Thread Martin Koch
Could it be a commit you're needing?

curl 'localhost:8983/solr/update?commit=true'

/Martin

On Wed, Dec 28, 2011 at 11:47 AM, mumairshamsi mumairsha...@gmail.comwrote:

 http://lucene.472066.n3.nabble.com/file/n3616191/02.xml 02.xml

 i am trying to index this file for this i am using this command

 java -jar post.jar *.xml

 commands run fine but when i search not result is displaying

 I think it is encoding problem can any one help ??



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Indexing-problem-tp3616191p3616191.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Default Search UI not working

2011-12-19 Thread Martin Koch
Have you looked here http://wiki.apache.org/solr/VelocityResponseWriter  ?

/Martin


On Mon, Dec 19, 2011 at 12:44 PM, remi tassing tassingr...@yahoo.comwrote:

 Hello guys,
 the default search UI doesn't work for me.
 http://localhost:8983/solr/browse gives me an HTTP 404 error.
 I'm using Solr-1.4. Any idea how to fix this?
 Remi


Re: Large RDBMS dataset

2011-12-14 Thread Martin Koch
Instead of handling it from within solr, I'd suggest writing an external
application (e.g. in python using pysolr) that wraps the (fast) SQL query
you like. Then retrieve a batch of documents, and write them to solr. For
extra speed, don't commit until you're done.

/Martin

On Wed, Dec 14, 2011 at 11:18 AM, Finotti Simone tech...@yoox.com wrote:

 Hello,
 I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my
 Solr application to pull data from.

 Problem is that the document fields which I have to index aren't in the
 same table, but I have to join records with two other tables. Well, in fact
 they are views, but I don't think that this makes any difference.

 That's the data import handler that I've actually written:

 ?xml version=1.0?
 dataConfig
  dataSource type=JdbcDataSource
 driver=net.sourceforge.jtds.jdbc.Driver
 url=jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1
 instance=SVCSQLDEV /
  document name=Products
entity name=fd query=SELECT * FROM clust_w_fast_dump ORDER BY
 endeca_id;
  entity name=fd2 query=SELECT macrocolor_id, color_descr,
 gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE
 endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/
  entity name=cpd query=SELECT DepartmentCode, Ranking,
 DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket
 WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/
  entity name=env query=SELECT Environment FROM clust_w_Environment
 WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/
/entity
  /document
 /dataConfig

 It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That
 means that digesting the whole dataset would take 1 Ms (= 12 days).

 The problem is that for each record in fd, Solr makes three distinct
 SELECT on the other three tables. Of course, this is absolutely inefficient.

 Is there a way to have Solr loading every record in the four tables and
 join them when they are already loaded in memory?

 TIA



Re: Solr using very high I/O

2011-12-14 Thread Martin Koch
Do you commit often? If so, try committing less often :)

/Martin

On Wed, Dec 7, 2011 at 12:16 PM, Adrian Fita adrian.f...@gmail.com wrote:

 Hi. I experience an issue where Solr is using huge ammounts of I/O.
 Basically it uses the whole HDD continously, leaving nothing to the
 other processes. Solr is called by a script which continously indexes
 some files.

 The index has around 800MB and I can't understand why it could trash
 the HDD so much.

 I could use some help on how to optimize Solr so it doesn't use so much
 I/O.

 Thank you.
 --
 Fita Adrian



Comparing apples oranges?

2011-11-04 Thread Martin Koch
Hi List

I have a solr index where I want to include numerical fields in my ranking
function as well as keyword relevance. For example, each document has a
document view count, and I'd like to increase the relevancy of documents
that are read often, and penalize documents with a very low view count. I'm
aware that this could be achieved with a filter as well, but ignore that
for this question :) since this will be extended to other numerical fields.

The keyword scoring works just fine and I can include the view count as a
factor in the scoring, but I would like to somehow express that the view
count accounts for e.g. 25% of the total score. This could be achieved by
mapping the view count into some predetermined fixed range and then
performing suitable arithmetic to scale to the score of the query. The
score of the term query is normalized to queryNorm, so I'd like somehow to
express that the view count score should be normalized to the queryNorm.

If I look at the explain of how the score below is computed, the 17.4 is
the part of the score that comes from term relevancy. Searching for another
(set of) terms yields a different queryNorm, so I can't see how I can
a-priori pick a scaling function (I've used log for this example) and boost
factor that will give control of the final contribution of the view count
to the score.

19.14161 = (MATCH) sum of:
  17.403849 = (MATCH) max plus 0.1 times others of:
16.747877 = (MATCH) weight(document:water^4.0 in 1076362), product of:
  0.22298127 = queryWeight(document:water^4.0), product of:
4.0 = boost
2.939238 = idf(docFreq=527730, maxDocs=3669552)
0.018965907 = queryNorm
  75.108894 = (MATCH) fieldWeight(document:water in 1076362), product
of:
25.553865 = tf(termFreq(document:water)=653)
2.939238 = idf(docFreq=527730, maxDocs=3669552)
1.0 = fieldNorm(field=document, doc=1076362)
[snip]
  1.7377597 = (MATCH) FunctionQuery(log(map(int(views),0.0,0.0,1.0))),
product of:
1.8325089 = log(map(int(views)=68,min=0.0,max=0.0,target=1.0))
50.0 = boost
0.018965907 = queryNorm

Thanks in advance for your help,
/Martin