Re: Facet Double Counting
Still the same. Can the reason be that if there are duplicate logs/documents, then the Facet query will count them, but when I do the Search Query, solr eliminates the duplicates? On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Harish, What happens when you purge deleted terms with 'solr/core/update?commit=trueexpungeDeletes=true' ahmet On Sunday, January 25, 2015 1:59 AM, harish singh harish.sing...@gmail.com wrote: Hi, I am noticing a strange behavior with solr facet searching: This is my facet query: - params: { - facet: true, - sort: startTimeISO desc, - debugQuery: true, - facet.mincount: 1, - facet.sort: count, - start: 0, - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)), - facet.limit: 100, - facet.field: loginUserName, - wt: json, - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO 2015-01-23T00:00:00.000Z], - rows: 0 } The result I am getting is: facet_counts: { - facet_queries: { }, - facet_fields: { - loginUserName: [ - harry, - 36, - larry, - 10, - Carey ] }, - facet_dates: { }, - facet_ranges: { } } As you see, the result is showing Facet-Count for loginUserName= harry is 36. So when I do a Solr Search for logs, I should get 36 logs. But I am getting 18. This happening for all the searches now. For some reason, I see double counting. Either Facetting is Double counting or Search is half-counting ? This is my Solr Search Query: - params: { - sort: startTimeISO desc, - debugQuery: true, - start: 0, - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND ( loginUserName:(harry)), - wt: json, - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO 2015-01-23T00:00:00.000Z], - rows: 200 } This query gives only 18 logs. But Solr Facet Query gave 36. Is there something incorrect in any of my (or both) queries? I am trying to debug but it I think I am missing something silly.
RE: Facet Double Counting
harish singh [harish.sing...@gmail.com] wrote: As you see, the result is showing Facet-Count for loginUserName= harry is 36. So when I do a Solr Search for logs, I should get 36 logs. But I am getting 18. This happening for all the searches now. If you have recently added or changed uniqueKey and if your index has multiple documents with the same key, that would explain the behaviour you describe. If that is so, I recommend you delete the index and rebuild it from scratch. - Toke Eskildsen
Re: Facet Double Counting
weird, optimize or expungeDeletes=true should do the trick. Can you try to optimise this time? On Sunday, January 25, 2015 11:08 AM, harish singh harish.sing...@gmail.com wrote: Still the same. Can the reason be that if there are duplicate logs/documents, then the Facet query will count them, but when I do the Search Query, solr eliminates the duplicates? On Sat, Jan 24, 2015 at 11:47 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Harish, What happens when you purge deleted terms with 'solr/core/update?commit=trueexpungeDeletes=true' ahmet On Sunday, January 25, 2015 1:59 AM, harish singh harish.sing...@gmail.com wrote: Hi, I am noticing a strange behavior with solr facet searching: This is my facet query: - params: { - facet: true, - sort: startTimeISO desc, - debugQuery: true, - facet.mincount: 1, - facet.sort: count, - start: 0, - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)), - facet.limit: 100, - facet.field: loginUserName, - wt: json, - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO 2015-01-23T00:00:00.000Z], - rows: 0 } The result I am getting is: facet_counts: { - facet_queries: { }, - facet_fields: { - loginUserName: [ - harry, - 36, - larry, - 10, - Carey ] }, - facet_dates: { }, - facet_ranges: { } } As you see, the result is showing Facet-Count for loginUserName= harry is 36. So when I do a Solr Search for logs, I should get 36 logs. But I am getting 18. This happening for all the searches now. For some reason, I see double counting. Either Facetting is Double counting or Search is half-counting ? This is my Solr Search Query: - params: { - sort: startTimeISO desc, - debugQuery: true, - start: 0, - q: requestType:(*login* or *LOGIN*) AND (user:(blabla*)) AND ( loginUserName:(harry)), - wt: json, - fq: startTimeISO:[2015-01-22T00:00:00.000Z TO 2015-01-23T00:00:00.000Z], - rows: 200 } This query gives only 18 logs. But Solr Facet Query gave 36. Is there something incorrect in any of my (or both) queries? I am trying to debug but it I think I am missing something silly.
Re: Facet Double Counting
Oh yes!! :) I tried the Faceting on the UUID field. All the uuids have count = 2 == which probably explains why I am getting Double counting in Facet result. So does this mean, when I do a facet query on facet.field= loginUserName, Solr does not look at the UUID? And the unique field (UUID in this case) is considered only while Search Queries? On Sun, Jan 25, 2015 at 3:15 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: harish singh [harish.sing...@gmail.com] wrote: As you see, the result is showing Facet-Count for loginUserName= harry is 36. So when I do a Solr Search for logs, I should get 36 logs. But I am getting 18. This happening for all the searches now. If you have recently added or changed uniqueKey and if your index has multiple documents with the same key, that would explain the behaviour you describe. If that is so, I recommend you delete the index and rebuild it from scratch. - Toke Eskildsen
RE: Facet Double Counting
harish singh [harish.sing...@gmail.com] wrote: I tried the Faceting on the UUID field. Nice debug trick. I'll remember that to next time. So does this mean, when I do a facet query on facet.field= loginUserName, Solr does not look at the UUID? Yes. For faceting, Solr only uses the internal docIDs and the facet field data. And the unique field (UUID in this case) is considered only while Search Queries? For a distributed setup, the documents are resolved from the shards using uniqueKey. I did not think this was the case for a non-distributed setup - for such setup, I thought that the documents were resolved using internal docIDs. If your index is single-shard, then I was wrong. - Toke Eskildsen
Re: solr replication vs. rsync
On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!).Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
Sorting on a computed value
I'll bet some super user has figured this out. How can I perform a sort on a single computed field? I have a QParserPlugin that is collapsing docs based on data from multiple fields. I am summing the values from one numerical field 'X'. I was going to use a DocTransformer to inject that summed value into the search results as a new field. But I have now realized that I have to be able to sort on this summed field. Without retrieving all results (which could be 1M+) in my app and sorting manually, is there any way to sort on my computed field within Solr? (using Solr 4.9) -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-a-computed-value-tp4181875.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: replicas goes in recovery mode right after update
Shawn directed you over here to the user list, but I see this note on SOLR-7030: All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process... So you have 12G physical memory and have allocated 12G to the Java process? This is an anti-pattern. If that's the case, your operating system is being starved for memory, probably hitting a state where it spends all of its time in stop-the-world garbage collection, eventually it doesn't respond to Zookeeper's ping so Zookeeper thinks the node is down and puts it into recovery. Where it spends a lot of time doing... essentially nothing. About the hard and soft commits: I suspect these are entirely unrelated, but here's a blog on what they do, you should pick the configuration that supports your use case (i.e. how much latency can you stand between indexing and being able to search?). https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Here's one very good reason you shouldn't starve your op system by allocating all the physical memory to the JVM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html But your biggest problem is that you have far too much of your physical memory allocated to the JVM. This will cause you endless problems, you just need more physical memory on those boxes. It's _possible_ you could get by with less memory for the JVM, counterintuitive as it seems try 8G or maybe even 6G. At some point you'll hit OOM errors, but that'll give you a lower limit on what the JVM needs. Unless I've mis-interpreted what you've written, though, I doubt you'll get stable with that much memory allocated to the JVM. Best, Erick On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com wrote: We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows *autoSoftCommit* *maxDocs50/maxDocs* *maxTime18/maxTime* */autoSoftCommit* *autoCommit* *maxDocs200/maxDocs* *maxTime18/maxTime* *openSearcherfalse/openSearcher* */autoCommit* We indexed roughly 90 Million docs. We have two different ways to index documents a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run. When we do incremental indexing we do it in the search1 collection which is serving live traffic. All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process. We have observed that the heap memory of the java process average around 8 - 10 GB. All searchers have final index size of 9 GB. So in total there are 9X10 (shards) = 90GB worth of index files. We have observed the following issue when we trigger indexing . In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes. We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery . If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown. We have tried different commit settings like below a) No auto soft commit, no auto hard commit and a commit triggered at the end of indexing b) No auto soft commit, yes auto hard commit and a commit in the end of indexing c) Yes auto
Re: Unexplained leader initiated recovery after updates - SolrCmdDistributor no longer retries on RemoteSolrException
Hi Lindsey Were you every able to figure out the reason for this behavior? We are experiencing the same issue with solr cloud version 4.10 http://lucene.472066.n3.nabble.com/jira-Commented-SOLR-7030-replicas-goes-in-recovery-mode-right-after-update-td4181881.html https://issues.apache.org/jira/browse/SOLR-7030 We even tried to remove the replicas to get around this issue. However we cannot do that for the collection that is serving our live traffic. Any suggestions ? Vijay -- View this message in context: http://lucene.472066.n3.nabble.com/Re-Unexplained-leader-initiated-recovery-after-updates-SolrCmdDistributor-no-longer-retries-on-Remotn-tp4179309p4181882.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr replication vs. rsync
bq: I thought SolrCloud replicas were replication, and you imply parallel indexing Absolutely! You couldn't get near-real-time indexing if you relied on replication a-la 3x. And you also couldn't guarantee consistency. Say you have 1 shard, a leader and a follower (i.e. 2 replicas). Now you throw a doc to be indexed. The sequence is: leader gets the doc leader forwards the doc to the follower leader and follower both add the doc to their local index (and tlog). follower acks back to leader leader acks back to client. So yes, the raw document is forwarded to all replicas before the leader responds to the client, the docs all get written to the tlogs, etc. That's the only way to guarantee that if the leader goes down, the follower can take over without losing documents. Best, Erick On Sun, Jan 25, 2015 at 6:15 PM, Dan Davis dansm...@gmail.com wrote: @Erick, Problem space is not constant indexing. I thought SolrCloud replicas were replication, and you imply parallel indexing. Good to know. On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote: @Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
Re: replicas goes in recovery mode right after update
Ah, OK. Whew! because I was wondering how you were running at _all_ if all the memory was allocated to the JVM ;).. What is your Zookeeper timeout? The original default was 15 seconds and this has caused problems like this. Here's the scenario: You send a bunch of docs at the server, and eventually you hit a stop-the-world GC that takes longer than the Zookeeper timeout. So ZK thinks the node is down and initiates recovery. Eventually, you hit this on all the replicas. Sometimes I've seen situations where the answer is giving a bit more memory to the JVM, say 2-4G in your case. The theory here (and this is a shot in the dark) that your peak JVM requirements are close to your 12G, so the garbage collection spends enormous amounts of time collecting a small bit of memory, runs for some fraction of a second and does it again. Adding more to the JVMs memory allows the parallel collections to work without so many stop-the-world GC pauses. So what I'd do is turn on GC logging (probably on the replicas) and look for very long GC pauses. Mark Miller put together a blog here: https://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ See the getting a view into garbage collection. The smoking gun here is if you see full GC pauses that are longer than the ZK timeout. 90M docs in 4 hours across 10 shards is only 625/sec or so per shard. I've seen sustained indexing rates significantly above this, YMMV or course, a lot depends on the size of the docs. What version of Solr BTW? And when you say you fire a bunch of indexers, I'm assuming these are SolrJ clients and use CloudSolrServer? Best, Erick On Sun, Jan 25, 2015 at 4:10 PM, Vijay Sekhri sekhrivi...@gmail.com wrote: Thank you for the reply Eric. I am sorry I had wrong information posted. I posted our DEV env configuration by mistake. After double checking our stress and Prod Beta env where we have found the original issue, I found all the searchers have around 50 GB of RAM available and two instances of JVM running (2 different ports). Both instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st instance on a host has search1 collection (live collection) and the 2nd instance on the same host has search2 collection (for full indexing ). There is plenty room for OS related tasks. Our issue is not in anyway related to OS starving as shown from our dashboards. We have been through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ a lot of times but we have two modes of operation a) 1st collection (Live traffic) - heavy searches and medium indexing b) 2nd collection (Not serving traffic) - very heavy indexing, no searches When our indexing finishes we swap the alias for these collection . So essentially we need to have a configuration that can support both the use cases together. We have tried a lot of different configuration options and none of them seems to work. My suspicion is that solr cloud is unable to keep up with the updates at the rate we are sending while it is trying to be consistent with all the replicas. On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com wrote: Shawn directed you over here to the user list, but I see this note on SOLR-7030: All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process... So you have 12G physical memory and have allocated 12G to the Java process? This is an anti-pattern. If that's the case, your operating system is being starved for memory, probably hitting a state where it spends all of its time in stop-the-world garbage collection, eventually it doesn't respond to Zookeeper's ping so Zookeeper thinks the node is down and puts it into recovery. Where it spends a lot of time doing... essentially nothing. About the hard and soft commits: I suspect these are entirely unrelated, but here's a blog on what they do, you should pick the configuration that supports your use case (i.e. how much latency can you stand between indexing and being able to search?). https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Here's one very good reason you shouldn't starve your op system by allocating all the physical memory to the JVM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html But your biggest problem is that you have far too much of your physical memory allocated to the JVM. This will cause you endless problems, you just need more physical memory on those boxes. It's _possible_ you could get by with less memory for the JVM, counterintuitive as it seems try 8G or maybe even 6G. At some point you'll hit OOM errors, but that'll give you a lower limit on what the JVM needs. Unless I've mis-interpreted
Re: Indexed epoch time in Solr
Perhaps could you use a DocTransformer to convert the unix time field into any representation you want? You'll need to write a custom DocTransformer but this is no complex task. Regards, - Original Message - From: Ahmed Adel ahmed.a...@badrit.com To: solr-user@lucene.apache.org Sent: Monday, January 26, 2015 12:35:54 AM Subject: Indexed epoch time in Solr Hi All, Is there a way to convert unix time field that is already indexed to ISO-8601 format in query response? If this is not possible on the query level, what is the best way to copy this field to a new Solr standard date field. Thanks, -- *Ahmed Adel* http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Re: solr replication vs. rsync
@Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
Indexed epoch time in Solr
Hi All, Is there a way to convert unix time field that is already indexed to ISO-8601 format in query response? If this is not possible on the query level, what is the best way to copy this field to a new Solr standard date field. Thanks, -- *Ahmed Adel* http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2F
Re: [MASSMAIL]Weighting of prominent text in HTML
Hi Dan: Agreed, this question is more Nutch related than Solr ;) Nutch doesn't send any data into /update/extract request handler, all the text and metadata extraction happens in Nutch side rather than relying in the ExtractRequestHandler provided by Solr. Underneath Nutch use Tika the same technology as the ExtractRequestHandler provided by Solr so shouldn't be any greater difference. By default Nutch doesn't boost anything as is Solr job to boost the different content in the different fields, which is what happens when you do a query against Solr. Nutch calculates the LinkRank which is a variation of the famous PageRank (or the OPIC score, which is another scoring algorithm implemented in Nutch, which I believe is the default in Nutch 2.x). What you can do is use the headings and map the heading tags into different fields and then apply different boosts to each field. The general idea with Nutch is to make pieces of the web page and store each piece in a different field in Solr, then you can tweak your relevance function using the values yo see fit, so you don't need to write any plugin to accomplish this (at least for the h1, h2, etc. example you provided, if you want to extract other parts of the webpage you'll need to write your own plugin to do so). Nutch is highly customizable, you can write a plugin for almost any piece of logic, from parsers to indexers, passing from URL filters, scoring algorithms, protocols and a long long list, usually the plugins are not so difficult to write, but the problem comes to know which extension point you need to use, this comes with experience and taking a good dive in the source code. Hope this helps, - Original Message - From: Dan Davis dansm...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Monday, January 26, 2015 12:08:13 AM Subject: [MASSMAIL]Weighting of prominent text in HTML By examining solr.log, I can see that Nutch is using the /update request handler rather than /update/extract. So, this may be a more appropriate question for the nutch mailing list. OTOH, y'all know the anwser off the top of your head. Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a normal paragraph?Can this weighting be tuned without writing a plugin? Is writing a plugin often needed because of the flexibility that is needed in practice? I wanted to call this post *Anatomy of a small scale search engine*, but lacked the nerve ;) Thanks, all and many, Dan Davis, Systems/Applications Architect National Library of Medicine --- XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.
Weighting of prominent text in HTML
By examining solr.log, I can see that Nutch is using the /update request handler rather than /update/extract. So, this may be a more appropriate question for the nutch mailing list. OTOH, y'all know the anwser off the top of your head. Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a normal paragraph?Can this weighting be tuned without writing a plugin? Is writing a plugin often needed because of the flexibility that is needed in practice? I wanted to call this post *Anatomy of a small scale search engine*, but lacked the nerve ;) Thanks, all and many, Dan Davis, Systems/Applications Architect National Library of Medicine
Re: replicas goes in recovery mode right after update
Thank you for the reply Eric. I am sorry I had wrong information posted. I posted our DEV env configuration by mistake. After double checking our stress and Prod Beta env where we have found the original issue, I found all the searchers have around 50 GB of RAM available and two instances of JVM running (2 different ports). Both instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st instance on a host has search1 collection (live collection) and the 2nd instance on the same host has search2 collection (for full indexing ). There is plenty room for OS related tasks. Our issue is not in anyway related to OS starving as shown from our dashboards. We have been through https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ a lot of times but we have two modes of operation a) 1st collection (Live traffic) - heavy searches and medium indexing b) 2nd collection (Not serving traffic) - very heavy indexing, no searches When our indexing finishes we swap the alias for these collection . So essentially we need to have a configuration that can support both the use cases together. We have tried a lot of different configuration options and none of them seems to work. My suspicion is that solr cloud is unable to keep up with the updates at the rate we are sending while it is trying to be consistent with all the replicas. On Sun, Jan 25, 2015 at 5:30 PM, Erick Erickson erickerick...@gmail.com wrote: Shawn directed you over here to the user list, but I see this note on SOLR-7030: All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process... So you have 12G physical memory and have allocated 12G to the Java process? This is an anti-pattern. If that's the case, your operating system is being starved for memory, probably hitting a state where it spends all of its time in stop-the-world garbage collection, eventually it doesn't respond to Zookeeper's ping so Zookeeper thinks the node is down and puts it into recovery. Where it spends a lot of time doing... essentially nothing. About the hard and soft commits: I suspect these are entirely unrelated, but here's a blog on what they do, you should pick the configuration that supports your use case (i.e. how much latency can you stand between indexing and being able to search?). https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Here's one very good reason you shouldn't starve your op system by allocating all the physical memory to the JVM: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html But your biggest problem is that you have far too much of your physical memory allocated to the JVM. This will cause you endless problems, you just need more physical memory on those boxes. It's _possible_ you could get by with less memory for the JVM, counterintuitive as it seems try 8G or maybe even 6G. At some point you'll hit OOM errors, but that'll give you a lower limit on what the JVM needs. Unless I've mis-interpreted what you've written, though, I doubt you'll get stable with that much memory allocated to the JVM. Best, Erick On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri sekhrivi...@gmail.com wrote: We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows *autoSoftCommit* *maxDocs50/maxDocs* *maxTime18/maxTime* */autoSoftCommit* *autoCommit* *maxDocs200/maxDocs* *maxTime18/maxTime* *openSearcherfalse/openSearcher* */autoCommit* We indexed roughly 90 Million docs. We have two different ways to index documents a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run. When we do incremental indexing we do it in the search1 collection which is serving live traffic. All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process. We have observed
Re: solr replication vs. rsync
@Erick, Problem space is not constant indexing. I thought SolrCloud replicas were replication, and you imply parallel indexing. Good to know. On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote: @Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
Re: solr replication vs. rsync
Thanks! On Sunday, January 25, 2015, Erick Erickson erickerick...@gmail.com wrote: @Shawn: Cool table, thanks! @Dan: Just to throw a different spin on it, if you migrate to SolrCloud, then this question becomes moot as the raw documents are sent to each of the replicas so you very rarely have to copy the full index. Kind of a tradeoff between constant load because you're sending the raw documents around whenever you index and peak usage when the index replicates. There are a bunch of other reasons to go to SolrCloud, but you know your problem space best. FWIW, Erick On Sun, Jan 25, 2015 at 9:26 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 1/24/2015 10:56 PM, Dan Davis wrote: When I polled the various projects already using Solr at my organization, I was greatly surprised that none of them were using Solr replication, because they had talked about replicating the data. But we are not Pinterest, and do not expect to be taking in changes one post at a time (at least the engineers don't - just wait until its used for a Crud app that wants full-text search on a description field!). Still, rsync can be very, very fast with the right options (-W for gigabit ethernet, and maybe -S for sparse files). I've clocked it at 48 MB/s over GigE previously. Does anyone have any numbers for how fast Solr replication goes, and what to do to tune it? I'm not enthusiastic to give-up recently tested cluster stability for a home grown mess, but I am interested in numbers that are out there. Numbers are included on the Solr replication wiki page, both in graph and numeric form. Gathering these numbers must have been pretty easy -- before the HTTP replication made it into Solr, Solr used to contain an rsync-based implementation. http://wiki.apache.org/solr/SolrReplication#Performance_numbers Other data on that wiki page discusses the replication config. There's not a lot to tune. I run a redundant non-SolrCloud index myself through a different method -- my indexing program indexes each index copy completely independently. There is no replication. This separation allows me to upgrade any component, or change any part of solrconfig or schema, on either copy of the index without affecting the other copy at all. With replication, if something is changed on the master or the slave, you might find that the slave no longer works, because it will be handling an index created by different software or a different config. Thanks, Shawn
replicas goes in recovery mode right after update
We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows *autoSoftCommit* *maxDocs50/maxDocs* *maxTime18/maxTime* */autoSoftCommit* *autoCommit* *maxDocs200/maxDocs* *maxTime18/maxTime* *openSearcherfalse/openSearcher* */autoCommit* We indexed roughly 90 Million docs. We have two different ways to index documents a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run. When we do incremental indexing we do it in the search1 collection which is serving live traffic. All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process. We have observed that the heap memory of the java process average around 8 - 10 GB. All searchers have final index size of 9 GB. So in total there are 9X10 (shards) = 90GB worth of index files. We have observed the following issue when we trigger indexing . In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes. We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery . If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown. We have tried different commit settings like below a) No auto soft commit, no auto hard commit and a commit triggered at the end of indexing b) No auto soft commit, yes auto hard commit and a commit in the end of indexing c) Yes auto soft commit , no auto hard commit d) Yes auto soft commit , yes auto hard commit e) Different frequency setting for commits for above. Please NOTE that we have tried 15 minute soft commit setting and 30 minutes hard commit settings. Same time settings for both, 30 minute soft commit and an hour hard commit setting Unfortunately all the above yields the same behavior . The replicas still goes in recovery We have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists. Is there any setting that would fix this issue ? -- * Vijay Sekhri *