Re: Zookeeper: Could not get shard_id for core
Is it possible to run solr without zookeeper, but still using sharding, if it's all running on one host? Would the shards have to be explicitly included in the query urls? Thanks, /Martin On Fri, Mar 1, 2013 at 3:58 PM, Shawn Heisey s...@elyograg.org wrote: On 3/1/2013 7:34 AM, Martin Koch wrote: Most of the time things run just fine; however, we see this error every so often, and fix it as described. How do I run solr in non cloud mode? Could you point me to a description? The zookeeper options are required for cloud mode - zkHost to tell it about all your zookeeper nodes, zkRun to run an embedded zookeeper server. If you don't have these in solr.xml or your startup commandline, Solr 4.x will not be running in cloud mode, just like earlier versions. Thanks, Shawn
Zookeeper: Could not get shard_id for core
On a host that is running two separate solr (jetty) processes and a single zookeeper process, we're often seeing solr complain that it can't find a particular core. If we restart the solr process, when it comes back up, it has lost all information about its cores Feb 28, 2013 10:26:47 PM org.apache.solr.core.SolrCore registerSearcher INFO: [core0] Registered new searcher Searcher@14df33aemain{StandardDirectoryReader(segments_aat:181977 _16pu(4.0.0.2):C263610 /78380 _vwv(4.0.0.2):C285538/130332 ... [snip] Feb 28, 2013 10:26:47 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating... Feb 28, 2013 10:27:47 PM org.apache.solr.common.SolrException log *SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: core0* at org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:995) at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1053) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:662) [snip] SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: core1 etc for all the cores. The solution has been so far to shut down solr and zookeeper, delete the zookeeper configuration from disk, and then bring everything back up again. Has anyone else seen this problem? I'd love to be able to do with the hassle of having to run zookeeper, and the problems that are associated with it. Is this possible? Thanks, /Martin Koch - Senior Systems Architect - Issuu.com
Re: Zookeeper: Could not get shard_id for core
Most of the time things run just fine; however, we see this error every so often, and fix it as described. How do I run solr in non cloud mode? Could you point me to a description? Thanks, /Martin On Fri, Mar 1, 2013 at 3:30 PM, Mark Miller markrmil...@gmail.com wrote: It sounds like you have some sort of configuration issue perhaps. When things are setup right, you should not be seeing anything like this. Whether or not you can do without ZooKeeper depends on what your requirements are and what you want to support. You can use SolrCloud mode and non SolrCloud mode - there are advantages and disadvantages to each. - Mark On Mar 1, 2013, at 9:03 AM, Martin Koch m...@issuu.com wrote: On a host that is running two separate solr (jetty) processes and a single zookeeper process, we're often seeing solr complain that it can't find a particular core. If we restart the solr process, when it comes back up, it has lost all information about its cores Feb 28, 2013 10:26:47 PM org.apache.solr.core.SolrCore registerSearcher INFO: [core0] Registered new searcher Searcher@14df33aemain{StandardDirectoryReader(segments_aat:181977 _16pu(4.0.0.2):C263610 /78380 _vwv(4.0.0.2):C285538/130332 ... [snip] Feb 28, 2013 10:26:47 PM org.apache.solr.common.cloud.ZkStateReader$2 process INFO: A cluster state change has occurred - updating... Feb 28, 2013 10:27:47 PM org.apache.solr.common.SolrException log *SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: core0* at org.apache.solr.cloud.ZkController.doGetShardIdProcess(ZkController.java:995) at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1053) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:662) [snip] SEVERE: null:org.apache.solr.common.SolrException: Could not get shard_id for core: core1 etc for all the cores. The solution has been so far to shut down solr and zookeeper, delete the zookeeper configuration from disk, and then bring everything back up again. Has anyone else seen this problem? I'd love to be able to do with the hassle of having to run zookeeper, and the problems that are associated with it. Is this possible? Thanks, /Martin Koch - Senior Systems Architect - Issuu.com
Re: Zookeeper: Could not get shard_id for core
Thank you very much, Shawn. I had understood that Zookeeper was a mandatory component for Solr 4, and it is immensely useful to know that it is possible to do without. /Martin Koch On Fri, Mar 1, 2013 at 3:58 PM, Shawn Heisey s...@elyograg.org wrote: On 3/1/2013 7:34 AM, Martin Koch wrote: Most of the time things run just fine; however, we see this error every so often, and fix it as described. How do I run solr in non cloud mode? Could you point me to a description? The zookeeper options are required for cloud mode - zkHost to tell it about all your zookeeper nodes, zkRun to run an embedded zookeeper server. If you don't have these in solr.xml or your startup commandline, Solr 4.x will not be running in cloud mode, just like earlier versions. Thanks, Shawn
Blogpost about SOLR at Issuu
Hi list I have written a blog post about the use of SOLR for searching at Issuuhttp://www.issuu.com . To give you a sense of the scale, Issuu indexes more than 9 million documents and 200 million pages. In January Issuu had 4.3 billion pageviews and over 125.8 million visits (60.1 unique). You can see the blog post herehttp://blog.issuu.com/post/41189476451/how-search-at-issuu-actually-works . Happy reading, /Martin Koch - Senior Systems Architect - Issuu.
Re: SolrCloud and exernal file fields
Mikhail I haven't experimented further yet. I think that the previous experiment of issuing a commit to a specific core proved that all cores get the commit, so I don't think that this approach will work. Thanks, /Martin On Tue, Nov 27, 2012 at 6:24 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, It's still not clear to me whether you solve the problem completely or partially: Does reducing number of cores free some resources for searching during commit? Does the commiting one-by-one core prevents the freeze? Thanks On Thu, Nov 22, 2012 at 4:31 PM, Martin Koch m...@issuu.com wrote: Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper
Re: How to post atomic updates using xml
Are all your fields marked as stored in your schema? This is a requirement for atomic updates. /Martin Koch On Mon, Nov 26, 2012 at 7:58 PM, Darniz rnizamud...@edmunds.com wrote: i tried using the same logic to update a specific field and to my surprise all my other fields were lost. i had a doc with almost 50 fields and i wanted to update only the gender field i issued the below command curl http://host:8080/solr/update?commit=true -H 'Content-type:text/xml' -d 'adddocfield name=id63481697/fieldfield name=authorGender update=setmale/field/doc/add' to me it looks like it replace my entire document. Can you please let me know what went wrong -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-post-atomic-updates-using-xml-tp4007323p4022424.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud and exernal file fields
The short answer is no; the number was chosen in an attempt to get as many cores working in parallel to complete the search faster, but I realize that there is an overhead incurred by distribution and merging the results. We've now gone to 8 shards and will be monitoring performance. /Martin On Thu, Nov 22, 2012 at 3:53 PM, Yonik Seeley yo...@lucidworks.com wrote: On Tue, Nov 20, 2012 at 4:16 AM, Martin Koch m...@issuu.com wrote: around 7M documents in the index; each document has a 45 character ID. 7M documents isn't that large. Is there a reason why you need so many shards (16 in your case) on a single box? -Yonik http://lucidworks.com
Re: SolrCloud and exernal file fields
Mikhail To avoid freezes we deployed the patches that are now on the 4.1 trunk (bug 3985). But this wasn't good enough, because SOLR would still take very long to restart when that was necessary. I don't see how we could throw more hardware at the problem without making it worse, really - the only solution here would be *fewer* shards, not more. IMO it would be ideal if the lucene/solr community could come up with a good way of updating fields in a document without reindexing. This could be by linking to some external data store, or in the lucene/solr internals. If it would make things easier, a good first step would be to have dynamically updateable numerical fields only. /Martin On Wed, Nov 21, 2012 at 8:51 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, I don't think solrconfig.xml shed any light on. I've just found what I didn't get in your setup - the way of how to explicitly assigning core to collection. Now, I realized most of details after all! Ball is on your side, let us know whether you have managed your cores to commit one by one to avoid freeze, or could you eliminate pauses by allocating more hardware? Thanks in advance! On Wed, Nov 21, 2012 at 3:56 PM, Martin Koch m...@issuu.com wrote: Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered
Re: SolrCloud and exernal file fields
On Wed, Nov 21, 2012 at 7:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? You can check it at Solr Admin/Statistics/core/searcher/stats/readerDir should be org.apache.lucene.store.MMapDirectory It says ' org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@ ' /Martin -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrCloud and exernal file fields
Mikhail, PSB On Wed, Nov 21, 2012 at 10:08 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Nov 21, 2012 at 11:53 AM, Martin Koch m...@issuu.com wrote: I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. You should see something like this in the log: ... SolrCmdDistributor Distrib commit to: ... Yup, a commit towards a single core results in a commit on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. I still don't understand how you deploy/launch Solr. How many jettys you start whether you have -DzkRun -DzkHost -DnumShards=2 or you specifies shards= param for every request and distributes updates yourself? What collections do you create and with which settings? We let SOLR do the sharding using one collection with 16 SOLR cores holding one shard each. We launch only one instance of jetty with the folllowing arguments: -DnumShards=16 -DzkHost=zookeeperhost:port -Xmx10G -Xms10G -Xmn2G -server Would you like to see the solrconfig.xml? /Martin Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? I prefer to start from 16 hosts with 4 cores each. Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just
Re: SolrCloud and exernal file fields
Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query front-end. This value will then be the authoritative value at the time of the query. The value of the popularity measure that we use for boosting in the ranking of the search results is only updated when the value has changed enough so that the impact on the boost will be significant (say, more than 2%). This does require frequent re-indexing of the documents that have significant changes in the number of reads, but at least we won't have to update a document if it moves from, say, 100 to 101 reads. /Martin Koch - ISSUU - senior systems architect. On Mon, Nov 19, 2012 at 3:22 PM, Simone Gianni simo...@apache.org wrote: Hi all, I'm planning to move a quite big Solr index to SolrCloud. However, in this index, an external file field is used for popularity ranking. Does SolrCloud supports external file fields? How does it cope with sharding and replication? Where should the external file be placed now that the index folder is not local but in the cloud? Are there otherwise other best practices to deal with the use cases external file fields were used for, like popularity/ranking, in SolrCloud? Custom ValueSources going to something external? Thanks in advance, Simone
Re: SolrCloud and exernal file fields
Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). why all cores can't read these values simultaneously? Again, this is a solr implementation detail that I can't answer :) Can you confirm that IDs in the file is ordered by the index term order? Yes, we sorted the files (standard UNIX sort). AFAIK it can impact load time. Yes, it does. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? In our architecture, the clients query an API that generates the SOLR query, retrieves the relevant additional fields that we needs, and returns the relevant JSON to the front-end. In our use case, results are returned from SOLR by the 10's, not by the 1000's, so it is a manageable job. Even so, if solr returned thousands of results, it would be up to the implementation of the api to augment only the results that needed to be returned to the front-end. Even so, patching up a JSON structure with 1 results should be possible. I'm really appreciate if you comment on the questions above. PS: It's time to pitch, how much https://issues.apache.org/jira/browse/SOLR-4085 Commit-free ExternalFileField can help you? It looks very interesting :) Does it make it possible to avoid re-reading the EFF on every commit, and only re-read the values that have actually changed? /Martin On Tue, Nov 20, 2012 at 1:16 PM, Martin Koch m...@issuu.com wrote: Solr 4.0 does support using EFFs, but it might not give you what you're hoping fore. We tried using Solr Cloud, and have given up again. The EFF is placed in the parent of the index directory in each core; each core reads the entire EFF and picks out the IDs that it is responsible for. In the current 4.0.0 release of solr, solr blocks (doesn't answer queries) while re-reading the EFF. Even worse, it seems that the time to re-read the EFF is multiplied by the number of cores in use (i.e. the EFF is re-read by each core sequentially). The contents of the EFF become active after the first EXTERNAL commit (commitWithin does NOT work here) after the file has been updated. In our case, the EFF was quite large - around 450MB - and we use 16 shards, so when we triggered an external commit to force re-reading, the whole system would block for several (10-15) minutes. This won't work in a production environment. The reason for the size of the EFF is that we have around 7M documents in the index; each document has a 45 character ID. We got some help to try to fix the problem so that the re-read of the EFF proceeds in the background (see herehttps://issues.apache.org/jira/browse/SOLR-3985 for a fix on the 4.1 branch). However, even though the re-read proceeds in the background, the time required to launch solr now takes at least as long as re-reading the EFFs. Again, this is not good enough for our needs. The next issue is that you cannot sort on EFF fields (though you can return them as values using fl=field(my_eff_field). This is also fixed in the 4.1 branch here https://issues.apache.org/jira/browse/SOLR-4022. So: Even after these fixes, EFF performance is not that great. Our solution is as follows: The actual value of the popularity measure (say, reads) that we want to report to the user is inserted into the search response post-query by our query front
Re: SolrCloud and exernal file fields
Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores you run per box? This box is a 16-virtual core (8 hyperthreaded cores) with 60GB of RAM. We run 16 solr cores on this box in Jetty. Do boxes has plenty of ram to cache filesystem beside of jvm heaps? Yes. We've allocated 10GB for jetty, and left the rest for the OS. I assume you use 64 bit linux and mmap directory. Please confirm that. We use 64-bit linux. I'm not sure about the mmap directory or where that would be configured in solr - can you explain that? causes scalability problem or long time to reload? Will it help if we'll have, let's say ExternalDatabaseField which will pull values from jdbc. ie. I think the possibility of having some fields being retrieved from an external, dynamically updatable store would be really interesting. This could be JDBC, something in-memory like redis, or a NoSql product (e.g. Cassandra). Ok. Let's have it in mind as a possible direction. Alternatively, an API that would allow updating a single field for a document might be an option. why all cores can't read these values simultaneously? Again, this is a solr implementation detail that I can't answer :) Can you confirm that IDs in the file is ordered by the index term order? Yes, we sorted the files (standard UNIX sort). AFAIK it can impact load time. Yes, it does Ok, I've got that you aware of it, and your IDs are just strings, not integers. Yes, ids are strings. Regarding your post-query solution can you tell me if query found 1 docs, but I need to display only first page with 100 rows, whether I need to pull all 10K results to frontend to order them by the rank? In our architecture, the clients query an API that generates the SOLR query, retrieves the relevant additional fields that we needs, and returns the relevant JSON to the front-end. In our use case, results are returned from SOLR by the 10's, not by the 1000's, so it is a manageable job. Even so, if solr returned thousands of results, it would be up to the implementation of the api to augment only the results that needed to be returned to the front-end. Even so, patching up a JSON structure with 1 results should be possible
Re: SolrCloud and exernal file fields
Mikhail I appreciate your input, it's very useful :) On Wed, Nov 21, 2012 at 6:30 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, This deployment seems a little bit confusing to me. You have 16-way fairy virtual box, and send 16 request for really heavy operation at the same moment, it does not surprise me that you loosing it for some period of time. At that time you should have more than 16 in load average metrics. I suggest to send commit to those cores one-by-one and have inconsistency and some sort of blinking as a trade-off for availability. In this case only single virtual CPU will be fully consumed by the commit's _thread divergence action_ and others will serve requests. I wasn't aware until now that it is possible to send a commit to one core only. What we observed was the effect of curl localhost:8080/solr/update?commit=true but perhaps we should experiment with solr/coreN/update?commit=true. A quick trial run seems to indicate that a commit to a single core causes commits on all cores. Perhaps I should clarify that we are using SOLR as a black box; we do not touch the code at all - we only install the distribution WAR file and proceed from there. Also from my POV such deployments should start at least from *16* 4-way vboxes, it's more expensive, but should be much better available during cpu-consuming operations. Do you mean that you recommend 16 hosts with 4 cores each? Or 4 hosts with 16 cores? Or am I misunderstanding something :) ? Other details, if you use single jetty for all of them, are you sure that jetty's threadpool doesn't limit requests? is it large enough? You have 60G and set -Xmx=10G. are you sure that total size of cores index directories is less than 45G? The total index size is 230 GB, so it won't fit in ram, but we're using an SSD disk to minimize disk access time. We have tried putting the EFF onto a ram disk, but this didn't have a measurable effect. Thanks, /Martin Thanks On Wed, Nov 21, 2012 at 2:07 AM, Martin Koch m...@issuu.com wrote: Mikhail PSB On Tue, Nov 20, 2012 at 7:22 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Please find additional question from me below. Simone, I'm sorry for hijacking your thread. The only what I've heard about it at recent ApacheCon sessions is that Zookeeper is supposed to replicate those files as configs under solr home. And I'm really looking forward to know how it works with huge files in production. Thank You, Guys! 20.11.2012 18:06 пользователь Martin Koch m...@issuu.com написал: Hi Mikhail Please see answers below. On Tue, Nov 20, 2012 at 12:28 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Thank you for telling your own war-story. It's really useful for community. The first question might seems not really conscious, but would you tell me what blocks searching during EFF reload, when it's triggered by handler or by listener? We continuously index new documents using CommitWithin to get regular commits. However, we observed that the EFFs were not re-read, so we had to do external commits (curl '.../solr/update?commit=true') to force reload. When this is done, solr blocks. I can't tell you exactly why it's doing that (it was related to SOLR-3985). Is there a chance to get a thread dump when they are blocked? Well I could try to recreate the situation. But the setup is fairly simple: Create a large EFF in a largeish index with many shards. Issue a commit, and then try to do a search. Solr will not respond to the search before the commit has completed, and this will take a long time. I don't really get the sentence about sequential commits and number of cores. Do I get right that file is replicated via Zookeeper? Doesn't it Again, this is observed behavior. When we issue a commit on a system with a system with many solr cores using EFFs, the system blocks for a long time (15 minutes). We do NOT use zookeeper for anything. The EFF is a symlink from each cores index dir to the actual file, which is updated by an external process. Hold on, I asked about Zookeeper because the subj mentions SolrCloud. Do you use SolrCloud, SolrShards, or these cores are just replicas of the same index? Ah - we use solr 4 out of the box, so I guess this is SolrCloud. I'm a bit unsure about the terminology here, but we've got a single index divided into 16 shard. Each shard is hosted in a solr core. Also, about simlink - Don't you share that file via some NFS? No, we generate the EFF on the local solr host (there is only one physical host that holds all shards), so there is no need for NFS or copying files around. No need for Zookeeper either. how many cores
Re: solr blocking on commit
Are you using solr 4.0? We had some problems similar to this (not in a master/slave setup, though), where the resolution was to disable the transaction log, i.e. remove updateLog in the updateHandler section - we don't need NRT get, so this isn't important to us. Cheers, /Martin Koch On Thu, Nov 1, 2012 at 1:25 AM, dbabits dbab...@gmail.com wrote: I second the original poster- all selects are blocked during commits. I have Master replicating to Slave. Indexing happens to Master, few docs/about every 30 secs Selects are run against Slave. This is the pattern from the Slave log: Oct 30, 2012 12:33:23 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1349195567630 Oct 30, 2012 12:33:42 AM org.apache.solr.core.SolrCore execute INFO: [core3] webapp=/solr path=/select During the 19 seconds that you see between the 2 lines, the /select is blocked, until the commit is done. This has nothing to do with jvm, I'm monitoring the memory and GC stats with jConsole and log. I played with all settings imaginable: commitWithin, commit=true, useColdSearcher, autoWarming settings from 0 on-nothing helps. The environment is: 3.6.0, RHEL Lunux 5.3.2, 64-bit, 96G RAM, 6 CPU cores, java 1.6.0_24, ~70 million docs. As soon as I suspend replication (command=disablepoll), everything becomes fast. As soon as I enable it - it pretty much becomes useless. Querying Master directly exibits the same problem of course. Thanks a lot for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-blocking-on-commit-tp474874p4017416.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: is it possible to index
In my experience, about as fast as you can push the new data :) Depending on the size of your records, this should be a matter of seconds. /Martin Koch On Wed, Oct 24, 2012 at 9:01 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Erick, Thanks for the help, it sure helps a lot to read that, as it gives me more confidence I am not crazy about what I am thinking. The only problem I see by de-normalizing data as you said is that if any relation between customer and vendor changes, I will have to update the index for all the vendors. I could have about 10 000 customers per vendor. Anyway, by what you're saying, it's more common than I was imagining, right? I wonder how long solr will take to reindex 1 records when this happens. Thanks, Marcelo Valle. 2012/10/24 Erick Erickson erickerick...@gmail.com One, take off your RDBMS cap G... DB folks regularly reject the idea of de-normalizing data to make best use of Solr, but that's what I would explore first. Yes, this repeats the, in your case, vendor information perhaps many times, but try that first, even though that causes you to update multiple customers whenever a vendor changes. You haven't specified how many customers and vendors you're talking abou there, but unless the total number of documents (where each document is a customer+vendor combination) is multiple tens of millions, you probably will be fine. You can get a list of just customers by using grouping where you group on customer, although that may not be the most efficient. You could index a field, call it cust_filter that was set to true for the first customer/vendor you indexed and false (or just left out) for all the rest and q=blahblahfq=cust_filter:true. Hope that helps Erick On Wed, Oct 24, 2012 at 12:01 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am new to Solr and I have a scenario where I want to use it, but I might be misunderstanding some concepts. I will explain what I want here, if someone has a solution for this, I would gladly accept the help. I have a core indexing customers. I have another core indexing vendors. Both are related to each other. Here is what I want to do in my application: I want to find all the customers that follow some criteria and them find the vendors related to them. My first option was to to have just vendor core and in for each document in vendor core I would have all the customers related to it. However, I would write the same customer several times to the index, as more than one vendor could be related to the same customer. Besides, I wonder how would I write a query to list just the different customers. Another problem is that I update customers in a different frequency I update vendors, but have vendor + customers in a single document would obly me to do the full update. Does anyone have a good solution for this I am not being able to see? I might be missing some basic concept here... Thanks, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Reloading ExternalFileField blocks Solr
Hi List We're using Solr-4.0.0-Beta with a 7M document index running on a single host with 16 shards. We'd like to use an ExternalFileField to hold a value that changes often. However, we've discovered that the file is apparently re-read by every shard/core on *every commit*; the index is unresponsive in this period (around 20s on the host we're running on). This is unacceptable for our needs. In the future, we'd like to add other values as ExternalFileFields, and this will make the problem worse. It would be better if the external file were instead read in in the background, updating previously read relevant values for each shard as they are read in. I guess a change in the ExternalFileField code would be required to achieve this, but I have no experience here, so suggestions are very welcome. Thanks, /Martin Koch - Issuu - Senior Systems Architect.
Re: Reloading ExternalFileField blocks Solr
Sure: We're boosting search results based on user actions which could be e.g. the number of times a particular document has been read. In future, we'd also like to boost by e.g. impressions (the number of times a document has been displayed) and other values. /Martin On Mon, Oct 8, 2012 at 7:02 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Martin, Can you tell me what's the content of that field, and how it should affect search result? On Mon, Oct 8, 2012 at 12:55 PM, Martin Koch m...@issuu.com wrote: Hi List We're using Solr-4.0.0-Beta with a 7M document index running on a single host with 16 shards. We'd like to use an ExternalFileField to hold a value that changes often. However, we've discovered that the file is apparently re-read by every shard/core on *every commit*; the index is unresponsive in this period (around 20s on the host we're running on). This is unacceptable for our needs. In the future, we'd like to add other values as ExternalFileFields, and this will make the problem worse. It would be better if the external file were instead read in in the background, updating previously read relevant values for each shard as they are read in. I guess a change in the ExternalFileField code would be required to achieve this, but I have no experience here, so suggestions are very welcome. Thanks, /Martin Koch - Issuu - Senior Systems Architect. -- Sincerely yours Mikhail Khludnev Tech Lead Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: SolrCloud 4.0 ALHPA, replicas, large commit times
(I'm working with Raghav on this): We've got several parallel workers that add documents in batches of 16 through pysolr, and using commitWithin at 60 seconds when the commit causes solr to freeze; if the commit is only 5 seconds, then everything seems to work fine. In both cases, throughput is around 500 documents / second. We can certainly give it a try with the Beta. Thanks, /Martin On Mon, Aug 27, 2012 at 7:30 PM, Mark Miller markrmil...@gmail.com wrote: How are you adding the docs? In batch, streaming, a doc at a time? Any chance you can try with the Beta? On Mon, Aug 27, 2012 at 9:35 AM, Raghav Karol r...@issuu.com wrote: Hello *, We are using SolrClould 4.0 - Alpha and have a 4 machine setup. Machine 1 - 16 Solr cores - Shard 1 - 16 Machine 2 - 16 Solr cores - Shard 17 - 32 Machine 3 - 16 Solr cores - Replica 1 - 16 Machine 4 - 16 Solr cores - Replice 17 - 32 Index at 500 docs/sec and committing every 60 seconds, i.e., 30,000 Documents causes Solr to freeze. There is nothing in the logs to indicate errors or replication activity - Solr just appeared to freeze. Increasing the commit frequency we observed that commits of at most 2,500 docs worked fine. Are we using SolrCloud and replications incorrectly? -- Raghav -- - Mark http://www.lucidimagination.com
Re: SolrCloud 4.0 ALHPA, replicas, large commit times
It actually is Beta that we're working with. /Martin On Mon, Aug 27, 2012 at 10:38 PM, Martin Koch m...@issuu.com wrote: (I'm working with Raghav on this): We've got several parallel workers that add documents in batches of 16 through pysolr, and using commitWithin at 60 seconds when the commit causes solr to freeze; if the commit is only 5 seconds, then everything seems to work fine. In both cases, throughput is around 500 documents / second. We can certainly give it a try with the Beta. Thanks, /Martin On Mon, Aug 27, 2012 at 7:30 PM, Mark Miller markrmil...@gmail.comwrote: How are you adding the docs? In batch, streaming, a doc at a time? Any chance you can try with the Beta? On Mon, Aug 27, 2012 at 9:35 AM, Raghav Karol r...@issuu.com wrote: Hello *, We are using SolrClould 4.0 - Alpha and have a 4 machine setup. Machine 1 - 16 Solr cores - Shard 1 - 16 Machine 2 - 16 Solr cores - Shard 17 - 32 Machine 3 - 16 Solr cores - Replica 1 - 16 Machine 4 - 16 Solr cores - Replice 17 - 32 Index at 500 docs/sec and committing every 60 seconds, i.e., 30,000 Documents causes Solr to freeze. There is nothing in the logs to indicate errors or replication activity - Solr just appeared to freeze. Increasing the commit frequency we observed that commits of at most 2,500 docs worked fine. Are we using SolrCloud and replications incorrectly? -- Raghav -- - Mark http://www.lucidimagination.com
Re: Solr advanced boosting
We're doing something similar: We want to combine search relevancy with a fitness value computed from several other data sources. For this, we pre-compute the fitness value for each document and store it a flat file (lines of the format document_id=fitness_score) that we use an externalFileFieldhttp://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html to access from Solr. This file can be updated at regular intervals, e.g. to reflect recent views or up/downvotes. It is re-read by solr on every commit. The fitness field can then be included as a boost field in a (e)dismax query. /Martin On Thu, Mar 29, 2012 at 9:56 AM, mads mads...@yahoo.dk wrote: Hello everyone! I am new to Solr and I have been doing a bit of reading about boosting search results. My search index consists of products with different attributes like a title, a description, a brand, a price, a discount percent and so on. I would like to do a fairly complex boosting, so that for example a hit on the brand name, a low price, a high discount percent is boosted compared to a hit in the title, higher prices etc. Basically I would like to make a more intelligent search with a my self-defined boosting algorithm of definition. I hope it makes sense. My question is if more experienced Solr people considers this possible, and how I can get started on this project? Is it possible to do a kind of a plugin, or? Regards Mads -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-advanced-boosting-tp3867025p3867025.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Practical Optimization
Thanks for writing this up. These are good tips. /Martin On Fri, Mar 23, 2012 at 9:57 PM, dw5ight dw5i...@gmail.com wrote: Hey All- we run a http://carsabi.com car search engine with Solr and did some benchmarking recently after we switched from a hosted service to self-hosting. In brief, we went from 800ms complex range queries on a 1.5M document corpus to 43ms. The major shifts were switching from EC2 Large to EC2 CC8XL which got us down to 282ms (2.82x speed gain due to 2.75x CPU speed increase we think), and then down to 43ms when we sharded to 8 cores. We tried sharding to 12 and 16 but saw negligible gains after this point. Anyway, hope this might be useful to someone - we write up exact stats and a step by step sharding procedure on our http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ tech blog if anyone's interested. best Dwight -- View this message in context: http://lucene.472066.n3.nabble.com/Practical-Optimization-tp3852776p3852776.html Sent from the Solr - User mailing list archive at Nabble.com.
Error 500 seek past EOF : SOLR bug?
) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.io.FileNotFoundException: File does not exist /mnt/solr.data.0/index.20120323132730/_16c_5s.del at org.apache.solr.common.util.FileUtils.sync(FileUtils.java:64) at org.apache.solr.handler.SnapPuller$FileFetcher$1.run(SnapPuller.java:923) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) ... 3 more Thanks, /Martin Koch
Re: Simple Slave Replication Question
I guess this would depend on network bandwidth, but we move around 150G/hour when hooking up a new slave to the master. /Martin On Fri, Mar 23, 2012 at 12:33 PM, Ben McCarthy ben.mccar...@tradermedia.co.uk wrote: Hello, Im looking at the replication from a master to a number of slaves. I have configured it and it appears to be working. When updating 40K records on the master is it standard to always copy over the full index, currently 5gb in size. If this is standard what do people do who have massive 200gb indexs, does it not take a while to bring the slaves inline with the master? Thanks Ben This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Re: Commit without an update handler?
Yes. However, something must actually have been updated in the index before a commit on the master causes the slave to update (this is what was confusing me). Since I'll be updating the index fairly often, this will not be a problem for me. If, however, the external file field is updated often, but the index proper isn't, this could be a problem. Thanks, /Martin On Thu, Jan 5, 2012 at 2:56 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, does it work just to put this in the masters index and let replication to its tricks and issue your commit on the master? Or am I missing something here? Best Erick On Tue, Jan 3, 2012 at 1:33 PM, Martin Koch m...@issuu.com wrote: Hi List I have a Solr cluster set up in a master/slave configuration where the master acts as an indexing node and the slaves serve user requests. To avoid accidental posts of new documents to the slaves, I have disabled the update handlers. However, I use an externalFileField. When the file is updated, I need to issue a commit to reload the new file. This requires an update handler. Is there an update handler that doesn't accept new documents, but will effect a commit? Thanks, /Martin
Commit without an update handler?
Hi List I have a Solr cluster set up in a master/slave configuration where the master acts as an indexing node and the slaves serve user requests. To avoid accidental posts of new documents to the slaves, I have disabled the update handlers. However, I use an externalFileField. When the file is updated, I need to issue a commit to reload the new file. This requires an update handler. Is there an update handler that doesn't accept new documents, but will effect a commit? Thanks, /Martin
Re: Indexing problem
Could it be a commit you're needing? curl 'localhost:8983/solr/update?commit=true' /Martin On Wed, Dec 28, 2011 at 11:47 AM, mumairshamsi mumairsha...@gmail.comwrote: http://lucene.472066.n3.nabble.com/file/n3616191/02.xml 02.xml i am trying to index this file for this i am using this command java -jar post.jar *.xml commands run fine but when i search not result is displaying I think it is encoding problem can any one help ?? -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-problem-tp3616191p3616191.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Default Search UI not working
Have you looked here http://wiki.apache.org/solr/VelocityResponseWriter ? /Martin On Mon, Dec 19, 2011 at 12:44 PM, remi tassing tassingr...@yahoo.comwrote: Hello guys, the default search UI doesn't work for me. http://localhost:8983/solr/browse gives me an HTTP 404 error. I'm using Solr-1.4. Any idea how to fix this? Remi
Re: Large RDBMS dataset
Instead of handling it from within solr, I'd suggest writing an external application (e.g. in python using pysolr) that wraps the (fast) SQL query you like. Then retrieve a batch of documents, and write them to solr. For extra speed, don't commit until you're done. /Martin On Wed, Dec 14, 2011 at 11:18 AM, Finotti Simone tech...@yoox.com wrote: Hello, I have a very large dataset ( 1 Mrecords) on the RDBMS which I want my Solr application to pull data from. Problem is that the document fields which I have to index aren't in the same table, but I have to join records with two other tables. Well, in fact they are views, but I don't think that this makes any difference. That's the data import handler that I've actually written: ?xml version=1.0? dataConfig dataSource type=JdbcDataSource driver=net.sourceforge.jtds.jdbc.Driver url=jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1 instance=SVCSQLDEV / document name=Products entity name=fd query=SELECT * FROM clust_w_fast_dump ORDER BY endeca_id; entity name=fd2 query=SELECT macrocolor_id, color_descr, gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=cpd query=SELECT DepartmentCode, Ranking, DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ entity name=env query=SELECT Environment FROM clust_w_Environment WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;/ /entity /document /dataConfig It works, but it takes 1'38 to parse 100 records: it means 1 rec/s! That means that digesting the whole dataset would take 1 Ms (= 12 days). The problem is that for each record in fd, Solr makes three distinct SELECT on the other three tables. Of course, this is absolutely inefficient. Is there a way to have Solr loading every record in the four tables and join them when they are already loaded in memory? TIA
Re: Solr using very high I/O
Do you commit often? If so, try committing less often :) /Martin On Wed, Dec 7, 2011 at 12:16 PM, Adrian Fita adrian.f...@gmail.com wrote: Hi. I experience an issue where Solr is using huge ammounts of I/O. Basically it uses the whole HDD continously, leaving nothing to the other processes. Solr is called by a script which continously indexes some files. The index has around 800MB and I can't understand why it could trash the HDD so much. I could use some help on how to optimize Solr so it doesn't use so much I/O. Thank you. -- Fita Adrian
Comparing apples oranges?
Hi List I have a solr index where I want to include numerical fields in my ranking function as well as keyword relevance. For example, each document has a document view count, and I'd like to increase the relevancy of documents that are read often, and penalize documents with a very low view count. I'm aware that this could be achieved with a filter as well, but ignore that for this question :) since this will be extended to other numerical fields. The keyword scoring works just fine and I can include the view count as a factor in the scoring, but I would like to somehow express that the view count accounts for e.g. 25% of the total score. This could be achieved by mapping the view count into some predetermined fixed range and then performing suitable arithmetic to scale to the score of the query. The score of the term query is normalized to queryNorm, so I'd like somehow to express that the view count score should be normalized to the queryNorm. If I look at the explain of how the score below is computed, the 17.4 is the part of the score that comes from term relevancy. Searching for another (set of) terms yields a different queryNorm, so I can't see how I can a-priori pick a scaling function (I've used log for this example) and boost factor that will give control of the final contribution of the view count to the score. 19.14161 = (MATCH) sum of: 17.403849 = (MATCH) max plus 0.1 times others of: 16.747877 = (MATCH) weight(document:water^4.0 in 1076362), product of: 0.22298127 = queryWeight(document:water^4.0), product of: 4.0 = boost 2.939238 = idf(docFreq=527730, maxDocs=3669552) 0.018965907 = queryNorm 75.108894 = (MATCH) fieldWeight(document:water in 1076362), product of: 25.553865 = tf(termFreq(document:water)=653) 2.939238 = idf(docFreq=527730, maxDocs=3669552) 1.0 = fieldNorm(field=document, doc=1076362) [snip] 1.7377597 = (MATCH) FunctionQuery(log(map(int(views),0.0,0.0,1.0))), product of: 1.8325089 = log(map(int(views)=68,min=0.0,max=0.0,target=1.0)) 50.0 = boost 0.018965907 = queryNorm Thanks in advance for your help, /Martin