Lucene test framework documentation?
Is there any good document about Lucene Test Framework? I can only find API docs. Mimicking the unit test I've found in Lucene trunk, I tried to write a unit test that tests a TokenFilter I am writing. But it is failing with an error message like: java.lang.AssertionError: close() called in wrong state: SETREADER at __randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0) at org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261) at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58) During a few round of try and error, I got an error message that the Test Framework JAR has to be before Lucene Core. And the above stack trace indicates that the Test Framework has its own Analyzer implementation, and it has a certain assumption but it is not clear what the assumption is. This exception was thrown from one of these lines, I believe: TokenStream ts = deuAna.tokenStream(text, new StringReader(testText)); TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new PrintWriter(System.out)); ts.close(); (I'm not too sure what TokenStreamToDot is about. I was hoping it would dump a token stream.) Kuro
Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?
Problem solved - it's caused by a system outside of Solr. Thank you all for the prompt replies! :) On Thu, Jan 8, 2015 at 12:40 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Thank you for your reply Chris :) Solr is producing the correct result on : its own. The problem is that I am calling a dataload class to call Solr, : which worked for assigned ID and composite ID, but not for UUID. Is there a Sorry -- still confused: are you confirming that you've tracked down the problem you are having to a system outside of Solr? that the problem (of duplicate documents) is introduced by your dataload class prior to sending the docs to Solr? : place to delete my question on the mailing list? nope - once the emails have gone out, they've gone out -- just replying back and confirming the resolutionn to the problem you saw is good enough. -Hoss http://www.lucidworks.com/
Re: Lucene test framework documentation?
(semi-relevant aside) We do happen to ship this test framework with Solr distribution (in dist/test-framework). Why, I don't know! Regards, Alex. Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 8 January 2015 at 23:23, Shawn Heisey apa...@elyograg.org wrote: On 1/8/2015 8:31 PM, TK Solr wrote: Is there any good document about Lucene Test Framework? I can only find API docs. Mimicking the unit test I've found in Lucene trunk, I tried to write a unit test that tests a TokenFilter I am writing. But it is failing with an error message like: java.lang.AssertionError: close() called in wrong state: SETREADER at __randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0) at org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261) at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58) During a few round of try and error, I got an error message that the Test Framework JAR has to be before Lucene Core. And the above stack trace indicates that the Test Framework has its own Analyzer implementation, and it has a certain assumption but it is not clear what the assumption is. This exception was thrown from one of these lines, I believe: TokenStream ts = deuAna.tokenStream(text, new StringReader(testText)); TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new PrintWriter(System.out)); ts.close(); (I'm not too sure what TokenStreamToDot is about. I was hoping it would dump a token stream.) This question is probably more appropriate for the dev list than the solr-user list, especially since it has more to do with Lucene than Solr. If the javadocs for the classes you want to use are not providing enough info, then you may be able to learn more by looking into the tests included in the Lucene source code that use the framework classes you'd like to try. Thanks, Shawn
Re: Lucene test framework documentation?
On 1/8/2015 8:31 PM, TK Solr wrote: Is there any good document about Lucene Test Framework? I can only find API docs. Mimicking the unit test I've found in Lucene trunk, I tried to write a unit test that tests a TokenFilter I am writing. But it is failing with an error message like: java.lang.AssertionError: close() called in wrong state: SETREADER at __randomizedtesting.SeedInfo.seed([2899FF2F02A64CCB:47B7F94117CE7067]:0) at org.apache.lucene.analysis.MockTokenizer.close(MockTokenizer.java:261) at org.apache.lucene.analysis.TokenFilter.close(TokenFilter.java:58) During a few round of try and error, I got an error message that the Test Framework JAR has to be before Lucene Core. And the above stack trace indicates that the Test Framework has its own Analyzer implementation, and it has a certain assumption but it is not clear what the assumption is. This exception was thrown from one of these lines, I believe: TokenStream ts = deuAna.tokenStream(text, new StringReader(testText)); TokenStreamToDot tstd = new TokenStreamToDot(testText, ts, new PrintWriter(System.out)); ts.close(); (I'm not too sure what TokenStreamToDot is about. I was hoping it would dump a token stream.) This question is probably more appropriate for the dev list than the solr-user list, especially since it has more to do with Lucene than Solr. If the javadocs for the classes you want to use are not providing enough info, then you may be able to learn more by looking into the tests included in the Lucene source code that use the framework classes you'd like to try. Thanks, Shawn
GC tuning question - can improving GC pauses cause indexing to slow down?
Is it possible that tuning garbage collection to achieve much better pause characteristics might actually *decrease* index performance? Rebuilds that I did while still using a tuned CMS config would take between 5.5 and 6 hours, sometimes going slightly over 6 hours. A rebuild that I did recently with G1 took 6.82 hours. A rebuild that I did yesterday with further tuned G1 settings (which seemed to result in much smaller pauses than the previous G1 settings) took 8.97 hours, and that was on slightly faster hardware than the rebuild that took 6.82 hours. These rebuilds are done with DIH from MySQL. It seems completely counter-intuitive that settings which show better GC pause characteristics would result in indexing performance going down ... so can anyone shed light on this, tell me whether I'm out of my mind? Thanks, Shawn
Re: GC tuning question - can improving GC pauses cause indexing to slow down?
In the abstract, it sounds like you are seeing the difference between tuning for latency vs tuning for throughput My hunch would be you are seeing more (albeit individually quicker) GC events with your new settings during the rebuild I imagine that in most cases a solr rebuild is relatively rare compared to the amount of times where a lower latency request is desired. If the rebuild times are problematic for you, use tunings specific to that workload during the times you need it and then switch back to your low latency settings after. If you are doing that you can probably run with a bigger heap temporarily during the rebuild as you aren't likely to be fielding queries and don't benefit from having a larger OS cache available Sent from my iPhone On Jan 8, 2015, at 20:54, Shawn Heisey apa...@elyograg.org wrote: Is it possible that tuning garbage collection to achieve much better pause characteristics might actually *decrease* index performance? Rebuilds that I did while still using a tuned CMS config would take between 5.5 and 6 hours, sometimes going slightly over 6 hours. A rebuild that I did recently with G1 took 6.82 hours. A rebuild that I did yesterday with further tuned G1 settings (which seemed to result in much smaller pauses than the previous G1 settings) took 8.97 hours, and that was on slightly faster hardware than the rebuild that took 6.82 hours. These rebuilds are done with DIH from MySQL. It seems completely counter-intuitive that settings which show better GC pause characteristics would result in indexing performance going down ... so can anyone shed light on this, tell me whether I'm out of my mind? Thanks, Shawn
Re: GC tuning question - can improving GC pauses cause indexing to slow down?
I would not be surprised at all. Optimizing for minimum pauses usually increases overhead that decreases overall throughput. This is a pretty common tradeoff. For maximum throughput, when you don’t care about pauses, the simplest non-concurrent GC is often the best. That might be the right choice for running big map-reduce jobs, for example. Low-pause GCs do lots of extra work in parallel. Some of that work is making guesses which get thrown away, or doing “just in case” analysis. To quote Oracle: When you evaluate or tune any garbage collection, there is always a latency versus throughput trade-off. The G1 GC is an incremental garbage collector with uniform pauses, but also more overhead on the application threads. The throughput goal for the G1 GC is 90 percent application time and 10 percent garbage collection time. When you compare this to Java HotSpot VM's throughput collector, the goal there is 99 percent application time and 1 percent garbage collection time.” http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Jan 8, 2015, at 8:53 PM, Shawn Heisey apa...@elyograg.org wrote: Is it possible that tuning garbage collection to achieve much better pause characteristics might actually *decrease* index performance? Rebuilds that I did while still using a tuned CMS config would take between 5.5 and 6 hours, sometimes going slightly over 6 hours. A rebuild that I did recently with G1 took 6.82 hours. A rebuild that I did yesterday with further tuned G1 settings (which seemed to result in much smaller pauses than the previous G1 settings) took 8.97 hours, and that was on slightly faster hardware than the rebuild that took 6.82 hours. These rebuilds are done with DIH from MySQL. It seems completely counter-intuitive that settings which show better GC pause characteristics would result in indexing performance going down ... so can anyone shed light on this, tell me whether I'm out of my mind? Thanks, Shawn
Re: GC tuning question - can improving GC pauses cause indexing to slow down?
On 1/8/2015 11:05 PM, Boogie Shafer wrote: In the abstract, it sounds like you are seeing the difference between tuning for latency vs tuning for throughput My hunch would be you are seeing more (albeit individually quicker) GC events with your new settings during the rebuild I imagine that in most cases a solr rebuild is relatively rare compared to the amount of times where a lower latency request is desired. If the rebuild times are problematic for you, use tunings specific to that workload during the times you need it and then switch back to your low latency settings after. If you are doing that you can probably run with a bigger heap temporarily during the rebuild as you aren't likely to be fielding queries and don't benefit from having a larger OS cache available Full rebuilds are indeed relatively rare. Avoiding long pauses and keeping query latency low are usually a lot more important than how quickly the index rebuilds. Quick rebuilds are nice, but not strictly necessary. We do incremental updates that start at the top of every minute, unless an update is already running. Exactly how long those updates take is of little importance, unless that time is easier to measure in minutes rather than seconds. If I ever find myself in a situation where completing a rebuild as fast as possible becomes extremely important, does anyone have suggestions for GC tuning options that will optimize for throughput? Thanks, Shawn
Re: How to return child documents with parent
Did you check [child] at https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents ? On Thu, Jan 8, 2015 at 5:53 PM, yliu y...@mathworks.com wrote: Hi, What is the best way to return both parent document and child documents in one query? I used SolrJ to create a document and added a few child documents using addChildDocuments() method and indexed the parent document. All documents are indexed successfully (parent and children). When I tried to retrieve the parent document along with the child documents. I used expand=trueexpand.field=_root_. I was able to get the parent back in result section and children in expandedResults section. Is there some other type of query I should use so I can get the child documents back as children instead of expanded result? Thanks, Y -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Determining the Number of Solr Shards
Thanks guys for your inputs I would be looking at around 100 Tb of total index size with 5100 million documents for a period of 30 days before we purge the indexes.I had estimated it slightly on the higher side of things but that's where I feel we would be. Thanks, Nishanth On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 1/7/2015 7:14 PM, Nishanth S wrote: Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads for the moment would be in the 1000 reads/second. Guess finding out the right number of shards would be my starting point. I don't think indexing 12000 docs per second would be too much for Solr to handle, as long as you architect the indexing application properly. You would likely need to have several indexing threads or processes that index in parallel. Solr is fully thread-safe and can handle several indexing requests at the same time. If the indexing application is single-threaded, indexing speed will not reach its full potential. Be aware that indexing at the same time as querying will reduce the number of queries per second that you can handle. In an environment where both reads and writes are heavy like you have described, more shards and/or more replicas might be required. For the query side ... even 1000 queries per second is a fairly heavy query rate. You're likely to need at least a few replicas, possibly several, to handle that. The type and complexity of the queries you do will make a big difference as well. To handle that query level, I would still recommend only running one shard replica on each server. If you have three shards and three replicas, that means 9 Solr servers. How many documents will you have in total? You said they are about 6KB each ... but depending on the fieldType definitions (and the analysis chain for TextField types), 6KB might be very large or fairly small. Do you have any idea how large the Solr index will be with all your documents? Estimating that will require indexing a significant percentage of your documents with the actual schema and config that you will use in production. If I know how many documents you have, how large the full index will be, and can see an example of the more complex queries you will do, I can make *preliminary* guesses about the number of shards you might need. I do have to warn you that it will only be a guess. You'll have to experiment to see what works best. Thanks, Shawn
Re: Determining the Number of Solr Shards
My final advice would be my standard proof of concept implementation advice - test a configuration with 10% (or 5%) of the target data size and 10% (or 5%) of the estimated resource requirements (maybe 25% of the estimated RAM) and see how well it performs. Take the actual index size and multiply by 10 (or 20 for a 5% load) to get a closer estimate of total storage required. If a 10% load fails to perform well with 25% of the total estimated RAM, then you can be sure that you'll have problems with 10x the data and only 4x the RAM. Increase the RAM for that 10 load until you get acceptable performance for both indexing and a full range of queries, and then use 10x that RAM for the RAM for the 100% load. That's the OS system memory for file caching, not the total system RAM. -- Jack Krupansky On Thu, Jan 8, 2015 at 4:55 PM, Nishanth S nishanth.2...@gmail.com wrote: Thanks guys for your inputs I would be looking at around 100 Tb of total index size with 5100 million documents for a period of 30 days before we purge the indexes.I had estimated it slightly on the higher side of things but that's where I feel we would be. Thanks, Nishanth On Wed, Jan 7, 2015 at 7:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 1/7/2015 7:14 PM, Nishanth S wrote: Thanks Shawn and Walter.Yes those are 12,000 writes/second.Reads for the moment would be in the 1000 reads/second. Guess finding out the right number of shards would be my starting point. I don't think indexing 12000 docs per second would be too much for Solr to handle, as long as you architect the indexing application properly. You would likely need to have several indexing threads or processes that index in parallel. Solr is fully thread-safe and can handle several indexing requests at the same time. If the indexing application is single-threaded, indexing speed will not reach its full potential. Be aware that indexing at the same time as querying will reduce the number of queries per second that you can handle. In an environment where both reads and writes are heavy like you have described, more shards and/or more replicas might be required. For the query side ... even 1000 queries per second is a fairly heavy query rate. You're likely to need at least a few replicas, possibly several, to handle that. The type and complexity of the queries you do will make a big difference as well. To handle that query level, I would still recommend only running one shard replica on each server. If you have three shards and three replicas, that means 9 Solr servers. How many documents will you have in total? You said they are about 6KB each ... but depending on the fieldType definitions (and the analysis chain for TextField types), 6KB might be very large or fairly small. Do you have any idea how large the Solr index will be with all your documents? Estimating that will require indexing a significant percentage of your documents with the actual schema and config that you will use in production. If I know how many documents you have, how large the full index will be, and can see an example of the more complex queries you will do, I can make *preliminary* guesses about the number of shards you might need. I do have to warn you that it will only be a guess. You'll have to experiment to see what works best. Thanks, Shawn
Re: How large is your solr index?
On 01/07/2015 05:42 PM, Erick Erickson wrote: True, and you can do this if you take explicit control of the document routing, but... that's quite tricky. You forever after have to send any _updates_ to the same shard you did the first time, whereas SPLITSHARD will do the right thing. Hmm. That is a good point. I wonder if there's some kind of middle ground here? Something that lets me send an update (or new document) to an arbitrary node/shard but which is still routed according to my specific requirements? Maybe this can already be achieved by messing with the routing? snip there are some components that don't do the right thing in distributed mode, joins for instance. The list is actually quite small and is getting smaller all the time. That's fine. We have a lot of query (pre-)processing outside of Solr. It's no problem for us to send a couple of queries to a couple of shards and aggregate the result ourselves. It would, of course, be nice if everything worked in distributed mode, but at least for us it's not an issue. This is a side effect of our complex reporting requirements -- we do aggregation, filtering and other magic on data that is partially in Solr and partially elsewhere. Not true if the other shards have had any indexing activity. The commit is usually forwarded to all shards. If the individual index on a particular shard is unchanged then it should be a no-op though. I think a no-op commit no longer clears the caches either, so that's great. But the usage pattern here is its own bit of a trap. If all your indexing is going to a single shard, then also the entire indexing _load_ is happening on that shard. So the CPU utilization will be higher on that shard than the older ones. Since distributed requests need to get a response from every shard before returning to the client, the response time will be bounded by the response from the slowest shard and this may actually be slower. Probably only noticeable when the CPU is maxed anyway though. This is a very good point. But I don't think SPLITSHARD is the magical answer here. If you have N shards on N boxes, and they are all getting nearly full and you decide to split one and move half to a new box, you'll end up with N-2 nearly full boxes and 2 half-full boxes. What happens if the disks fill up further? Do I have to split each shard? That sounds pretty nightmareish! - Bram
Re: How large is your solr index?
On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote: Thank you Toke - yes - the data is indexed throughout the day. We are handling very few searches - probably 50 a day; this is an RD system. If your searches are in small bundles, you could pause the indexing flow while the searches are executed, for better performance. Our HDFS cache, I believe, is too small at 10GBytes per shard. That depends a lot on your corpus, your searches and underlying storage. But with our current level of information, it is a really good bet: Having 10GB cache per 130GB (270GB?) data is not a lot with spinning drives. Current parameters for running each shard are: JAVA_OPTS=-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 [...] -Xmx10752m One Solr/shard? You could probably win a bit by having one Solr/machine instead. Anyway, it's quite a high Xmx, but I presume you have measured the memory needs. I'd love to try SSDs, but don't have the budget at present to go that route. We find the price/performance for SSD + moderate RAM to be quite a better deal than spinning drives + a lot of RAM, even when buying enterprise hardware. For consumer SSDs (used in our large server) it is even cheaper to use SSDs. It all depends on use pattern of course, but your setup with non-concurrent searches seems like it would fit well. Note: I am sure that the RAM == index size would deliver very high performance. With enough RAM you can use tape to hold the index. Whether it is cost effective is another matter. I'd really like to get the HDFS option to work well as it reduces system complexity. That is very understandable. We examined the option of networked storage (Isilon) with underlying spindles, and it performed adequately for our needs up to 2-3TB of index data. Unfortunately the heavy random read load from Solr meant a noticeable degradation of other services using the networked storage. I am sure it could be solved with more centralized hardware, but in the end we found it cheaper and simpler to use local storage for search. This will of course differ across organizations and setups. - Toke Eskildsen
Re: Solr: IndexNotFoundException: no segments* file HdfsDirectoryFactory
Hi,did you solve this problem? I met the same problem when I setted up solr+hdfs. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-IndexNotFoundException-no-segments-file-HdfsDirectoryFactory-tp4138737p4178034.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: leader split-brain at least once a day - need help
Hi Alan, thanks for the pointer, I'll look at our gc logs Am 07.01.2015 um 15:46 schrieb Alan Woodward: I had a similar issue, which was caused by https://issues.apache.org/jira/browse/SOLR-6763. Are you getting long GC pauses or similar before the leader mismatches occur? Alan Woodward www.flax.co.uk On 7 Jan 2015, at 10:01, Thomas Lamy wrote: Hi there, we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere collections. The 2 biggest collections are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java 1.7.0_72-b14 10 of the 12 collections (the small ones) get filled by DIH full-import once a day starting at 1am. The second biggest collection is updated usind DIH delta-import every 10 minutes, the biggest one gets bulk json updates with commits once in 5 minutes. On a regular basis, we have a leader information mismatch: org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is coming from leader, but we are the leader or the opposite org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so One of these pop up once a day at around 8am, making either some cores going to recovery failed state, or all cores of at least one cloud node into state gone. This started out of the blue about 2 weeks ago, without changes to neither software, data, or client behaviour. Most of the time, we get things going again by restarting solr on the current leader node, forcing a new election - can this be triggered while keeping solr (and the caches) up? But sometimes this doesn't help, we had an incident last weekend where our admins didn't restart in time, creating millions of entries in /solr/oversser/queue, making zk close the connection, and leader re-elect fails. I had to flush zk, and re-upload collection config to get solr up again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7). We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 requests/s) up and running, which does not have these problems since upgrading to 4.10.2. Any hints on where to look for a solution? Kind regards Thomas -- Thomas Lamy Cytainment AG Co KG Nordkanalstrasse 52 20097 Hamburg Tel.: +49 (40) 23 706-747 Fax: +49 (40) 23 706-139 Sitz und Registergericht Hamburg HRA 98121 HRB 86068 Ust-ID: DE213009476 -- Thomas Lamy Cytainment AG Co KG Nordkanalstrasse 52 20097 Hamburg Tel.: +49 (40) 23 706-747 Fax: +49 (40) 23 706-139 Sitz und Registergericht Hamburg HRA 98121 HRB 86068 Ust-ID: DE213009476
Re: Solr startup script in version 4.10.3
Versions 4.10.3 and beyond already use server rather than example, which still finds a reference in the script purely for back compat. A major release 5.0 is coming soon, perhaps the back compat can be removed for that. On 6 Jan 2015 09:30, Dominique Bejean dominique.bej...@eolya.fr wrote: Hi, In release 4.10.3, the following lines were removed from solr starting script (bin/solr) # TODO: see SOLR-3619, need to support server or example # depending on the version of Solr if [ -e $SOLR_TIP/server/start.jar ]; then DEFAULT_SERVER_DIR=$SOLR_TIP/server else DEFAULT_SERVER_DIR=$SOLR_TIP/example fi However, the usage message always say -d dir Specify the Solr server directory; defaults to server Either the usage have to be fixed or the removed lines put back to the script. Personally, I like the default to server directory. My installation process in order to have a clean empty solr instance is to copy examples into server and remove directories like example-DIH, example-shemaless, multicore and solr/collection1 Solr server (or node) can be started without the -d parameter. If this makes sense, a Jira issue could be open. Dominique http://www.eolya.fr/
Re: Solr startup script in version 4.10.3
Things have changed reasonably for the 5.0 release. In case of a standalone mode, it still defaults to the server directory. So you'd find your logs in server/logs. In case of solrcloud mode e.g. if you ran bin/solr -e cloud -noprompt this would default to stuff being copied into example directory (leaving server directory untouched) and everything would run from there. You will also have the option of just creating a new SOLR home and using that instead. See the following: https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud The link above is for the upcoming Solr 5.0 and is still work in progress but should give you more information. Hope that helps. On Tue, Jan 6, 2015 at 1:29 AM, Dominique Bejean dominique.bej...@eolya.fr wrote: Hi, In release 4.10.3, the following lines were removed from solr starting script (bin/solr) # TODO: see SOLR-3619, need to support server or example # depending on the version of Solr if [ -e $SOLR_TIP/server/start.jar ]; then DEFAULT_SERVER_DIR=$SOLR_TIP/server else DEFAULT_SERVER_DIR=$SOLR_TIP/example fi However, the usage message always say -d dir Specify the Solr server directory; defaults to server Either the usage have to be fixed or the removed lines put back to the script. Personally, I like the default to server directory. My installation process in order to have a clean empty solr instance is to copy examples into server and remove directories like example-DIH, example-shemaless, multicore and solr/collection1 Solr server (or node) can be started without the -d parameter. If this makes sense, a Jira issue could be open. Dominique http://www.eolya.fr/ -- Anshum Gupta http://about.me/anshumgupta
Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?
Thank you for your reply Chris :) Solr is producing the correct result on its own. The problem is that I am calling a dataload class to call Solr, which worked for assigned ID and composite ID, but not for UUID. Is there a place to delete my question on the mailing list? Thank you, Jia On Wed, Jan 7, 2015 at 8:47 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : It's a single Solr Instance, and in my files, I used 'doc_key' everywhere, : but I changed it to id in the email I sent out wanting to make it easier : to read, sorry don't mean to confuse you :) https://wiki.apache.org/solr/UsingMailingLists - what version of solr? - how exactly are you doing the update? curl? post.jar? - what exactly is the HTTP response from your update? - what does your log file show during the update? - what exactly do all of your configs look like (you said you made a mistake in your email by trying to make the data easier to read that could easily be masking some other mistake in your actual cnfigs I did my best to try and reproduce what you describe, but i had no luck -- here's exactly what i did... hossman@frisbee:~/lucene/lucene-4.10.3_tag$ svn diff Index: solr/example/solr/collection1/conf/solrconfig.xml === --- solr/example/solr/collection1/conf/solrconfig.xml (revision 1650199) +++ solr/example/solr/collection1/conf/solrconfig.xml (working copy) @@ -1076,7 +1076,17 @@ str name=update.chaindedupe/str /lst -- +lst name=defaults + str name=update.chainautoGenId/str +/lst /requestHandler + updateRequestProcessorChain name=autoGenId +processor class=solr.UUIDUpdateProcessorFactory + str name=fieldNameid/str +/processor +processor class=solr.LogUpdateProcessorFactory / +processor class=solr.RunUpdateProcessorFactory / + /updateRequestProcessorChain !-- The following are implicitly added requestHandler name=/update/json class=solr.UpdateRequestHandler hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl -X POST ' http://localhost:8983/solr/collection1/update?commit=true' -H Content-Type: application/csv --data-binary 'foo_s,bar_s aaa,cat bbb,dog ccc,yak ' ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime350/int/lst /response hossman@frisbee:~/lucene/lucene-4.10.3_tag$ curl ' http://localhost:8983/solr/collection1/select?q=*:*wt=jsonindent=true' { responseHeader:{ status:0, QTime:7, params:{ indent:true, q:*:*, wt:json}}, response:{numFound:3,start:0,docs:[ { foo_s:aaa, bar_s:cat, id:025c69cd-6407-4c70-903b-dfde170d373b, _version_:1489692576651935744}, { foo_s:bbb, bar_s:dog, id:5c7b3d65-1274-4bad-a671-4d643531e2ae, _version_:1489692576673955840}, { foo_s:ccc, bar_s:yak, id:25a3893f-c538-4b47-aa79-1f4268d66c39, _version_:1489692576673955841}] }} -Hoss http://www.lucidworks.com/
Solr with Tomcat - enabling SSL problem
Hi, I am using Solr 4.10.2 with tomcat and embedded Zookeeper. I followed https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud to enable SSL. I am currently doing the following: Starting tomcat Running: ../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put /clusterprops.json '{urlScheme:https}' Restarting tomcat Accessing Solr from my client using org.apache.solr.client.solrj.impl.CloudSolrServer. And this works. If I don't restart tomcat again after running zkcli.sh, I get the following error: IOException occured when talking to server at: http://ip:port/solr/ (http, not https). Is it possible to do this without the second restart? Thanks, Tali
Re: How large is your solr index?
On 1/8/2015 9:39 AM, Joseph Obernberger wrote: Yes - it would be 20GBytes of cache per 270GBytes of data. That's not a lot of cache. One rule of thumb is that you should have at least 50% of the index size available as cache, with 100% being a lot better. The caching should happen on the Solr server itself so there isn't a network bottleneck. This is one of several reasons why local storage on regular filesystems is preferred for Solr. We've tried lower Xmx but we get OOM errors during faceting of large datasets. Right now we're running two JVMs per physical box (2 shards per box), but we're going to be changing that to on JVM and one shard per box. This wiki page has some info on what can cause high heap requirements and some general ideas for what you can do about it: http://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap If you want to discuss your specific situation, we can use the list, direct email, or the #solr IRC channel. http://wiki.apache.org/solr/IRCChannels Thanks, Shawn
RE: Solr Cloud, 100 shards, shards progressively become slower
Extrapolating what Jack was saying on his reply ... with 100 shards and 4 replicas, you have 400 cores that are each about 2.8GB. That results in a total index size of just over a terabyte, with 140GB of index data on each of the eight servers. Assuming you have only one Solr instance per server, an ideal setup would have enough RAM for that 140GB of index plus the 16GB max heap, so 156GB of RAM. Because the ideal setup is rarely a strict requirement unless the query load is high, if you have 128 GB of RAM per server, then I would not be worried about performance. If you have less than that, then I would be worried. Have less than this :/ :( - with not much likelihood to upgrade anytime soon - just out of curiosity, if the performance is proportional to the RAM, why am I seeing such good query times for the initial shard queries? (they are all under 100ms). The behavior with the same shard listed multiple times is a little strange. That behavior could indicate problems with garbage collection pauses -- as Solr is building the memory structures necessary to compose the final response, it might fill up one of the heap generations to its current size limit and each subsequent allocation might require a significant garbage collection, stopping the world while it happens, but not freeing up any significant amount of memory in that particular heap generation. Have you tuned your garbage collection? If not, that is a likely suspect. If you run with the latest Oracle Java, you can use my settings and probably see good GC performance: https://wiki.apache.org/solr/ShawnHeisey Further down on the page is a good set of CMS parameters for earlier Java versions, if you can't run the latest. We will look into this thank you, if this can decrease the last few shards qtime, then we should still see reasonable speeds (if not the fastest if it has to load from disk, but hopefully faster than the 50 seconds we have been seeing) The weird thing is, if I query each shard inidividually with distrib=false the query time never goes over 100ms (I concurrently hammer 1 shard like I did with my test in the previous email but not using shard= and I never get a query over 100ms) ... which leads me to believe there is some bottleneck with the distrib=/shard= parameters.
Re: UUIDUpdateProcessorFactory causes repeated documents when uploading csv files?
: Thank you for your reply Chris :) Solr is producing the correct result on : its own. The problem is that I am calling a dataload class to call Solr, : which worked for assigned ID and composite ID, but not for UUID. Is there a Sorry -- still confused: are you confirming that you've tracked down the problem you are having to a system outside of Solr? that the problem (of duplicate documents) is introduced by your dataload class prior to sending the docs to Solr? : place to delete my question on the mailing list? nope - once the emails have gone out, they've gone out -- just replying back and confirming the resolutionn to the problem you saw is good enough. -Hoss http://www.lucidworks.com/
RE: Solr Cloud, 100 shards, shards progressively become slower
Andrew Butkus [andrew.but...@c6-intelligence.com] wrote: [Shawn/Jack: Ideal amount of RAM] Have less than this :/ :( - with not much likelihood to upgrade anytime soon The right amount of RAM is what satisfies your requirements and is tightly correlated to the speed of your underlying storage. We have yet to build a machine with anywhere near the same amount of RAM as index size, and do have requirements of hundreds of searches/second on two of them. - just out of curiosity, if the performance is proportional to the RAM, why am I seeing such good query times for the initial shard queries? (they are all under 100ms). That is the real mystery here and does not seem to be related to overall performance. Guessing wildly: Maybe the last reported time is the total time spend? As your test works when you specify the same shard over and over again, perhaps you could specify the same shard A 30 times, followed by shard B 1 time and see if shard B reports a QTime of 100ms or 50,000ms? - Toke Eskildsen
Re: Solr Cloud, 100 shards, shards progressively become slower
On 1/8/2015 8:57 AM, Andrew Butkus wrote: We have 4gb usage (because the shards are split by 100 each shard is approx. 2.8gb on disk), we have allocated 14gb min and 16gb max of ram to solr, so it has plenty to use (the ram in the dashboard never goes above about 8gb - so still plenty ). I've managed to reproduce the issue with shards= parameter, and think I have proven the disk cache issue to not be the problem I'm simply querying the same shard, on the same server, multiple times (so the shards index should always be in memory and never loaded from disk)? All but the last query are low ms ... Extrapolating what Jack was saying on his reply ... with 100 shards and 4 replicas, you have 400 cores that are each about 2.8GB. That results in a total index size of just over a terabyte, with 140GB of index data on each of the eight servers. Assuming you have only one Solr instance per server, an ideal setup would have enough RAM for that 140GB of index plus the 16GB max heap, so 156GB of RAM. Because the ideal setup is rarely a strict requirement unless the query load is high, if you have 128 GB of RAM per server, then I would not be worried about performance. If you have less than that, then I would be worried. The behavior with the same shard listed multiple times is a little strange. That behavior could indicate problems with garbage collection pauses -- as Solr is building the memory structures necessary to compose the final response, it might fill up one of the heap generations to its current size limit and each subsequent allocation might require a significant garbage collection, stopping the world while it happens, but not freeing up any significant amount of memory in that particular heap generation. Have you tuned your garbage collection? If not, that is a likely suspect. If you run with the latest Oracle Java, you can use my settings and probably see good GC performance: https://wiki.apache.org/solr/ShawnHeisey Further down on the page is a good set of CMS parameters for earlier Java versions, if you can't run the latest. Thanks, Shawn
Re: Solr on HDFS in a Hadoop cluster
Thanks a lot Otis, While reading the SolrCloud documentation to understand how SolrCloud could run on HDFS, I got confused with leader, replica, non-replica shards, core, index, and collections. Once it is specified that one cannot add shards, then that one can add replica-only shards, then that last Shard Splitting paragraph states that something changed starting with Solr 4.3. But it doesn't states that splitting shards can end in a new non-replica shard, in a just added node, thus increasing the amount of storage available to the index / collection. It states that split action effectively makes two copies of the data as new shards instead, which tastes a lot like replica style shards. So does it? Could there be some sort of tutorial describing how to add available storage capacity for index / collection, thus adding a node / shard - core that one can send new documents to be indexed? (of course, load-balancing would be trigered, so it looks like documents would be added to shards out of a set of nodes). Thanks, Charles VALLEE Centre de compétence Big data EDF – DSP - CSP IT-O DATACENTER - Expertise en Energie Informatique (EEI) 32 avenue Pablo Picasso 92000 Nanterre charles.val...@edf.fr Tél. : + (0) 1 78 66 69 81 Un geste simple pour l'environnement, n'imprimez ce message que si vous en avez l'utilité. De :otis.gospodne...@gmail.com A : solr-user@lucene.apache.org Date : 06/01/2015 18:55 Objet : Re: Solr on HDFS in a Hadoop cluster Oh, and https://issues.apache.org/jira/browse/SOLR-6743 Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Charles, See http://search-lucene.com/?q=solr+hdfs and https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr wrote: I am considering using *Solr* to extend *Hortonworks Data Platform* capabilities to search. - I found tutorials to index documents into a Solr instance from *HDFS*, but I guess this solution would require a Solr cluster distinct to the Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop cluster instead? - *With the index stored in HDFS?* - Where would the processing take place (could it be handed down to Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to integrate with *Yarn*? - What about *SolrCloud*: what does it bring regarding Hadoop based use-cases? Does it stand for a Solr-only cluster? - Well, if that could lead to something working with a roles-based authorization-compliant *Banana*, it would be Christmass again! Thanks a lot for any help! Charles Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à l'intention exclusive des destinataires et les informations qui y figurent sont strictement confidentielles. Toute utilisation de ce Message non conforme à sa destination, toute diffusion ou toute publication totale ou partielle, est interdite sauf autorisation expresse. Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de votre système, ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support que ce soit. Nous vous remercions également d'en avertir immédiatement l'expéditeur par retour du message. Il est impossible de garantir que les communications par messagerie électronique arrivent en temps utile, sont sécurisées ou dénuées de toute erreur ou virus. This message and any attachments (the 'Message') are intended solely for the addressees. The information contained in this Message is confidential. Any use of information contained in this Message not in accord with its purpose, any dissemination or disclosure, either whole or partial, is prohibited except formal approval. If you are not the addressee, you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return message. E-mail communication cannot be guaranteed to be timely secure, error or virus-free.
Re: How large is your solr index?
bq: you'll end up with N-2 nearly full boxes and 2 half-full boxes. True, you'd have to repeat the process N times. At that point, though, as Shawn mentions it's often easier to just re-index the whole thing. Do note that one strategy is to create more shards than you need at the beginning. Say you determine that 10 shards will work fine, but you expect to grow your corpus by 2x. _Start_ with 20 shards (multiple shards can be hosted in the same JVM, no problem, see maxShardsPerNode in the collections API CREATE action. Then as your corpus grows you can move the shards to their own boxes. This just kicks the can down the road of course, if your corpus grows by 5x instead of 2x you're back to this discussion Best, Erick On Thu, Jan 8, 2015 at 7:08 AM, Shawn Heisey apa...@elyograg.org wrote: On 1/8/2015 4:37 AM, Bram Van Dam wrote: Hmm. That is a good point. I wonder if there's some kind of middle ground here? Something that lets me send an update (or new document) to an arbitrary node/shard but which is still routed according to my specific requirements? Maybe this can already be achieved by messing with the routing? snip That's fine. We have a lot of query (pre-)processing outside of Solr. It's no problem for us to send a couple of queries to a couple of shards and aggregate the result ourselves. It would, of course, be nice if everything worked in distributed mode, but at least for us it's not an issue. This is a side effect of our complex reporting requirements -- we do aggregation, filtering and other magic on data that is partially in Solr and partially elsewhere. SolrCloud, when you do fully automatic document routing, does handle everything for you. You can query any node and send updates to any node, and they will end up in the right place. There is currently a strong caveat: Indexing performance sucks when updates are initially sent to the wrong node. The performance hit is far larger than we expected it to be, so there is an issue in Jira to try and make that better. No visible work has been done on the issue yet: https://issues.apache.org/jira/browse/SOLR-6717 The Java client (SolrJ, specifically CloudSolrServer) sends all updates to the correct nodes, because it can access the clusterstate and knows where updates need to go and where the shard leaders are. This is a very good point. But I don't think SPLITSHARD is the magical answer here. If you have N shards on N boxes, and they are all getting nearly full and you decide to split one and move half to a new box, you'll end up with N-2 nearly full boxes and 2 half-full boxes. What happens if the disks fill up further? Do I have to split each shard? That sounds pretty nightmareish! Planning ahead for growth is critical with SolrCloud, but there is something you can do if you discover that you need to radically re-shard: Create a whole new collection with the number of shards you want, likely using the original set of Solr servers plus some new ones. Rebuild the index into that collection. Delete the old collection, and create a collection alias pointing the original name at the new collection. The alias will work for both queries and updates. Thanks, Shawn
Re: Solr Cloud, 100 shards, shards progressively become slower
On 1/8/2015 7:26 AM, Andrew Butkus wrote: Hi, we have 8 solr servers, split 4x4 across 2 data centers. We have a collection of around ½ billion documents, split over 100 shards, each is replicated 4 times on separate nodes (evenly distributed across both data centers). The problem we have is that when we use cursormark (and also when we don't use cursormark the pattern below is the same but just shorter in time) the time it takes to query each shard gets progressively longer when distrib=true , I have tried to query shards directly (with shards=) and select my own shards to query to see if it was a bandwidth bottleneck and the performance is normal / fine - when using pre-defined shards. Does anyone know why the shards become progressively slower when distrib=true? Or any suggestions on how I can fix, or how to debug the problem further? I have monitored the performance of CPU and it never goes above 10% on each server, so its not cpu, also the memory usage is about 4gb out of 16gb so its not a memory issue either. I have tried all shard shuffling strategies incase it was a bottleneck at a server being over used but as above, the cpu never goes above 10%, and when I use shards= there are never any querytime bottlenecks. The part about memory usage is not clear. That 4GB and 16GB could refer to the operating system view of memory, or the view of memory within the JVM. I'm curious about how much total RAM each machine has, how large the Java heap is, and what the total size of the indexes that live on each machine is. Even if they are individually very small, 500 million documents will result in a very large index, so I'm guessing that you don't have enough RAM on each server for your index size. What can happen with a highly sharded index that is too large for available RAM: Index data for the initial queries gets read from the OS disk cache, but as those queries run, the information required for the shards that come later in the distributed query gets pushed out of the disk cache, so Solr must actually read the disk to do those later queries. Disks are slow, so if the machine has to actually read from the disk, Solr will be slow. http://wiki.apache.org/solr/SolrPerformanceProblems#RAM Thanks, Shawn
RE: Solr Cloud, 100 shards, shards progressively become slower
Hi, we have 8 solr servers, split 4x4 across 2 data centers. We have a collection of around ½ billion documents, split over 100 shards, each is replicated 4 times on separate nodes (evenly distributed across both data centers). The problem we have is that when we use cursormark (and also when we don't use cursormark the pattern below is the same but just shorter in time) the time it takes to query each shard gets progressively longer when distrib=true , I have tried to query shards directly (with shards=) and select my own shards to query to see if it was a bandwidth bottleneck and the performance is normal / fine - when using pre-defined shards. Does anyone know why the shards become progressively slower when distrib=true? Or any suggestions on how I can fix, or how to debug the problem further? I have monitored the performance of CPU and it never goes above 10% on each server, so its not cpu, also the memory usage is about 4gb out of 16gb so its not a memory issue either. I have tried all shard shuffling strategies incase it was a bottleneck at a server being over used but as above, the cpu never goes above 10%, and when I use shards= there are never any querytime bottlenecks. Around http://2.2.213:8985/solr/Collection_shard16_replica1/|http://1.1.1.16:8985/solr/Collection_shard16_replica2/|http://1.1.1.17:8985/solr/Collection_shard16_replica3/|http://2.2.216:8985/solr/Collection_shard16_replica4/:{ numFound:242899, maxScore:null, shardAddress:http://1.1.1.17:8985/solr/Collection_shard16_replica3;, time:134}, The timings get progressively worse, (there is a pattern, the time it takes to run queries on shards increasingly gets worse after about the first 60 entries - even though the earlier ones took a few milliseconds) Here is my trace output { responseHeader:{ status:0, QTime:50093, params:{ shard.shuffling.strategy:query, sort:id ASC, indent:true, q:spec_country:\United Kingdom\, shards.info:true, distrib:true, cursorMark:*, wt:json, rows:0}}, shards.info:{ http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http://2.2.216:8985/solr/Collection_shard78_replica4/|http://2.2.213:8985/solr/Collection_shard78_replica1/|http://1.1.1.16:8985/solr/Collection_shard78_replica2/:{ numFound:243009, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard78_replica2;, time:24}, http://2.2.213:8985/solr/Collection_shard24_replica1/|http://1.1.1.16:8985/solr/Collection_shard24_replica2/|http://1.1.1.17:8985/solr/Collection_shard24_replica3/|http://2.2.216:8985/solr/Collection_shard24_replica4/:{ numFound:242309, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard24_replica2;, time:23}, http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http://2.2.213:8985/solr/Collection_shard70_replica1/|http://1.1.1.16:8985/solr/Collection_shard70_replica2/|http://2.2.216:8985/solr/Collection_shard70_replica4/:{ numFound:242727, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard70_replica2;, time:23}, http://2.2.216:8985/solr/Collection_shard76_replica4/|http://1.1.1.17:8985/solr/Collection_shard76_replica3/|http://2.2.213:8985/solr/Collection_shard76_replica1/|http://1.1.1.16:8985/solr/Collection_shard76_replica2/:{ numFound:243324, maxScore:null, shardAddress:http://2.2.216:8985/solr/Collection_shard76_replica4;, time:26}, http://2.2.214:8985/solr/Collection_shard29_replica1/|http://1.1.1.18:8985/solr/Collection_shard29_replica2/|http://1.1.1.15:8985/solr/Collection_shard29_replica3/|http://2.2.215:8985/solr/Collection_shard29_replica4/:{ numFound:242559, maxScore:null, shardAddress:http://2.2.214:8985/solr/Collection_shard29_replica1;, time:25}, http://1.1.1.17:8985/solr/Collection_shard74_replica3/|http://2.2.213:8985/solr/Collection_shard74_replica1/|http://2.2.216:8985/solr/Collection_shard74_replica4/|http://1.1.1.16:8985/solr/Collection_shard74_replica2/:{ numFound:242419, maxScore:null, shardAddress:http://2.2.216:8985/solr/Collection_shard74_replica4;, time:24}, http://1.1.1.18:8985/solr/Collection_shard33_replica2/|http://1.1.1.15:8985/solr/Collection_shard33_replica3/|http://2.2.214:8985/solr/Collection_shard33_replica1/|http://2.2.215:8985/solr/Collection_shard33_replica4/:{ numFound:242571, maxScore:null, shardAddress:http://2.2.214:8985/solr/Collection_shard33_replica1;, time:25}, http://1.1.1.18:8985/solr/Collection_shard77_replica2/|http://2.2.215:8985/solr/Collection_shard77_replica4/|http://1.1.1.15:8985/solr/Collection_shard77_replica3/|http://2.2.214:8985/solr/Collection_shard77_replica1/:{ numFound:242901, maxScore:null, shardAddress:http://2.2.215:8985/solr/Collection_shard77_replica4;, time:27},
How to return child documents with parent
Hi, What is the best way to return both parent document and child documents in one query? I used SolrJ to create a document and added a few child documents using addChildDocuments() method and indexed the parent document. All documents are indexed successfully (parent and children). When I tried to retrieve the parent document along with the child documents. I used expand=trueexpand.field=_root_. I was able to get the parent back in result section and children in expandedResults section. Is there some other type of query I should use so I can get the child documents back as children instead of expanded result? Thanks, Y -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Cloud, 100 shards, shards progressively become slower
Hi, we have 8 solr servers, split 4x4 across 2 data centers. We have a collection of around ½ billion documents, split over 100 shards, each is replicated 4 times on separate nodes (evenly distributed across both data centers). The problem we have is that when we use cursormark (and also when we don't use cursormark the pattern below is the same but just shorter in time) the time it takes to query each shard gets progressively longer when distrib=true , I have tried to query shards directly (with shards=) and select my own shards to query to see if it was a bandwidth bottleneck and the performance is normal / fine - when using pre-defined shards. Does anyone know why the shards become progressively slower when distrib=true? Or any suggestions on how I can fix, or how to debug the problem further? I have monitored the performance of CPU and it never goes above 10% on each server, so its not cpu, also the memory usage is about 4gb out of 16gb so its not a memory issue either. I have tried all shard shuffling strategies incase it was a bottleneck at a server being over used but as above, the cpu never goes above 10%, and when I use shards= there are never any querytime bottlenecks. Around http://2.2.213:8985/solr/Collection_shard16_replica1/|http://1.1.1.16:8985/solr/Collection_shard16_replica2/|http://1.1.1.17:8985/solr/Collection_shard16_replica3/|http://2.2.216:8985/solr/Collection_shard16_replica4/http://2.2.213:8985/solr/Collection_shard16_replica1/|http:/1.1.1.16:8985/solr/Collection_shard16_replica2/|http:/1.1.1.17:8985/solr/Collection_shard16_replica3/|http:/2.2.216:8985/solr/Collection_shard16_replica4/:{ numFound:242899, maxScore:null, shardAddress:http://1.1.1.17:8985/solr/Collection_shard16_replica3;, time:134}, The timings get progressively worse, (there is a pattern, the time it takes to run queries on shards increasingly gets worse after about the first 60 entries - even though the earlier ones took a few milliseconds) Here is my trace output { responseHeader:{ status:0, QTime:50093, params:{ shard.shuffling.strategy:query, sort:id ASC, indent:true, q:spec_country:\United Kingdom\, shards.info:true, distrib:true, cursorMark:*, wt:json, rows:0}}, shards.info:{ http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http://2.2.216:8985/solr/Collection_shard78_replica4/|http://2.2.213:8985/solr/Collection_shard78_replica1/|http://1.1.1.16:8985/solr/Collection_shard78_replica2/http://1.1.1.17:8985/solr/Collection_shard78_replica3/|http:/2.2.216:8985/solr/Collection_shard78_replica4/|http:/2.2.213:8985/solr/Collection_shard78_replica1/|http:/1.1.1.16:8985/solr/Collection_shard78_replica2/:{ numFound:243009, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard78_replica2;, time:24}, http://2.2.213:8985/solr/Collection_shard24_replica1/|http://1.1.1.16:8985/solr/Collection_shard24_replica2/|http://1.1.1.17:8985/solr/Collection_shard24_replica3/|http://2.2.216:8985/solr/Collection_shard24_replica4/http://2.2.213:8985/solr/Collection_shard24_replica1/|http:/1.1.1.16:8985/solr/Collection_shard24_replica2/|http:/1.1.1.17:8985/solr/Collection_shard24_replica3/|http:/2.2.216:8985/solr/Collection_shard24_replica4/:{ numFound:242309, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard24_replica2;, time:23}, http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http://2.2.213:8985/solr/Collection_shard70_replica1/|http://1.1.1.16:8985/solr/Collection_shard70_replica2/|http://2.2.216:8985/solr/Collection_shard70_replica4/http://1.1.1.17:8985/solr/Collection_shard70_replica3/|http:/2.2.213:8985/solr/Collection_shard70_replica1/|http:/1.1.1.16:8985/solr/Collection_shard70_replica2/|http:/2.2.216:8985/solr/Collection_shard70_replica4/:{ numFound:242727, maxScore:null, shardAddress:http://1.1.1.16:8985/solr/Collection_shard70_replica2;, time:23}, http://2.2.216:8985/solr/Collection_shard76_replica4/|http://1.1.1.17:8985/solr/Collection_shard76_replica3/|http://2.2.213:8985/solr/Collection_shard76_replica1/|http://1.1.1.16:8985/solr/Collection_shard76_replica2/http://2.2.216:8985/solr/Collection_shard76_replica4/|http:/1.1.1.17:8985/solr/Collection_shard76_replica3/|http:/2.2.213:8985/solr/Collection_shard76_replica1/|http:/1.1.1.16:8985/solr/Collection_shard76_replica2/:{ numFound:243324, maxScore:null, shardAddress:http://2.2.216:8985/solr/Collection_shard76_replica4;, time:26},
Re: leader split-brain at least once a day - need help
It's worth noting that those messages alone don't necessarily signify a problem with the system (and it wouldn't be called split brain). The async nature of updates (and thread scheduling) along with stop-the-world GC pauses that can change leadership, cause these little windows of inconsistencies that we detect and log. -Yonik http://heliosearch.org - native code faceting, facet functions, sub-facets, off-heap data On Wed, Jan 7, 2015 at 5:01 AM, Thomas Lamy t.l...@cytainment.de wrote: Hi there, we are running a 3 server cloud serving a dozen single-shard/replicate-everywhere collections. The 2 biggest collections are ~15M docs, and about 13GiB / 2.5GiB size. Solr is 4.10.2, ZK 3.4.5, Tomcat 7.0.56, Oracle Java 1.7.0_72-b14 10 of the 12 collections (the small ones) get filled by DIH full-import once a day starting at 1am. The second biggest collection is updated usind DIH delta-import every 10 minutes, the biggest one gets bulk json updates with commits once in 5 minutes. On a regular basis, we have a leader information mismatch: org.apache.solr.update.processor.DistributedUpdateProcessor; Request says it is coming from leader, but we are the leader or the opposite org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState says we are the leader, but locally we don't think so One of these pop up once a day at around 8am, making either some cores going to recovery failed state, or all cores of at least one cloud node into state gone. This started out of the blue about 2 weeks ago, without changes to neither software, data, or client behaviour. Most of the time, we get things going again by restarting solr on the current leader node, forcing a new election - can this be triggered while keeping solr (and the caches) up? But sometimes this doesn't help, we had an incident last weekend where our admins didn't restart in time, creating millions of entries in /solr/oversser/queue, making zk close the connection, and leader re-elect fails. I had to flush zk, and re-upload collection config to get solr up again (just like in https://gist.github.com/isoboroff/424fcdf63fa760c1d1a7). We have a much bigger cloud (7 servers, ~50GiB Data in 8 collections, 1500 requests/s) up and running, which does not have these problems since upgrading to 4.10.2. Any hints on where to look for a solution? Kind regards Thomas -- Thomas Lamy Cytainment AG Co KG Nordkanalstrasse 52 20097 Hamburg Tel.: +49 (40) 23 706-747 Fax: +49 (40) 23 706-139 Sitz und Registergericht Hamburg HRA 98121 HRB 86068 Ust-ID: DE213009476
Re: How large is your solr index?
On 1/8/2015 4:37 AM, Bram Van Dam wrote: Hmm. That is a good point. I wonder if there's some kind of middle ground here? Something that lets me send an update (or new document) to an arbitrary node/shard but which is still routed according to my specific requirements? Maybe this can already be achieved by messing with the routing? snip That's fine. We have a lot of query (pre-)processing outside of Solr. It's no problem for us to send a couple of queries to a couple of shards and aggregate the result ourselves. It would, of course, be nice if everything worked in distributed mode, but at least for us it's not an issue. This is a side effect of our complex reporting requirements -- we do aggregation, filtering and other magic on data that is partially in Solr and partially elsewhere. SolrCloud, when you do fully automatic document routing, does handle everything for you. You can query any node and send updates to any node, and they will end up in the right place. There is currently a strong caveat: Indexing performance sucks when updates are initially sent to the wrong node. The performance hit is far larger than we expected it to be, so there is an issue in Jira to try and make that better. No visible work has been done on the issue yet: https://issues.apache.org/jira/browse/SOLR-6717 The Java client (SolrJ, specifically CloudSolrServer) sends all updates to the correct nodes, because it can access the clusterstate and knows where updates need to go and where the shard leaders are. This is a very good point. But I don't think SPLITSHARD is the magical answer here. If you have N shards on N boxes, and they are all getting nearly full and you decide to split one and move half to a new box, you'll end up with N-2 nearly full boxes and 2 half-full boxes. What happens if the disks fill up further? Do I have to split each shard? That sounds pretty nightmareish! Planning ahead for growth is critical with SolrCloud, but there is something you can do if you discover that you need to radically re-shard: Create a whole new collection with the number of shards you want, likely using the original set of Solr servers plus some new ones. Rebuild the index into that collection. Delete the old collection, and create a collection alias pointing the original name at the new collection. The alias will work for both queries and updates. Thanks, Shawn
Re: Solr with Tomcat - enabling SSL problem
On 1/8/2015 6:25 AM, Tali Finelt wrote: I am using Solr 4.10.2 with tomcat and embedded Zookeeper. I followed https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud to enable SSL. I am currently doing the following: Starting tomcat Running: ../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put /clusterprops.json '{urlScheme:https}' Restarting tomcat Accessing Solr from my client using org.apache.solr.client.solrj.impl.CloudSolrServer. And this works. If I don't restart tomcat again after running zkcli.sh, I get the following error: IOException occured when talking to server at: http://ip:port/solr/ (http, not https). Is it possible to do this without the second restart? Solr will only read parameters like the urlScheme at startup. Once it's running, that information is never accessed again, so in order to get it to change those parameters, a restart is required. It might be possible to change the code so a re-read of these parameters takes place ... but writing code to make fundamental changes to program operation can be risky. Restarting the program is much safer. Thanks, Shawn
Re: Solr with Tomcat - enabling SSL problem
On 1/8/2015 8:50 AM, Tali Finelt wrote: Thanks for clarifying this. Is there a different way to set the embedded Zookeeper urlScheme parameter before ever starting tomcat? (some configuration file etc.) This way I won't need to start tomcat twice. Most of the cloud options can be specified with system properties on the java commandline. I believe you would use this: |-DurlScheme=https I had thought maybe urlScheme could be specified in solr.xml, but I can't find any examples, so it might not be possible. Thanks, Shawn |
Re: How large is your solr index?
On 1/8/2015 3:16 AM, Toke Eskildsen wrote: On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote: Thank you Toke - yes - the data is indexed throughout the day. We are handling very few searches - probably 50 a day; this is an RD system. If your searches are in small bundles, you could pause the indexing flow while the searches are executed, for better performance. Our HDFS cache, I believe, is too small at 10GBytes per shard. That depends a lot on your corpus, your searches and underlying storage. But with our current level of information, it is a really good bet: Having 10GB cache per 130GB (270GB?) data is not a lot with spinning drives. Yes - it would be 20GBytes of cache per 270GBytes of data. Current parameters for running each shard are: JAVA_OPTS=-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3 [...] -Xmx10752m One Solr/shard? You could probably win a bit by having one Solr/machine instead. Anyway, it's quite a high Xmx, but I presume you have measured the memory needs. We've tried lower Xmx but we get OOM errors during faceting of large datasets. Right now we're running two JVMs per physical box (2 shards per box), but we're going to be changing that to on JVM and one shard per box. I'd love to try SSDs, but don't have the budget at present to go that route. We find the price/performance for SSD + moderate RAM to be quite a better deal than spinning drives + a lot of RAM, even when buying enterprise hardware. For consumer SSDs (used in our large server) it is even cheaper to use SSDs. It all depends on use pattern of course, but your setup with non-concurrent searches seems like it would fit well. Note: I am sure that the RAM == index size would deliver very high performance. With enough RAM you can use tape to hold the index. Whether it is cost effective is another matter. Ha! Yes - our index is accessible via a 2400 baud modem, but we have lots of cache! ;) I'd really like to get the HDFS option to work well as it reduces system complexity. That is very understandable. We examined the option of networked storage (Isilon) with underlying spindles, and it performed adequately for our needs up to 2-3TB of index data. Unfortunately the heavy random read load from Solr meant a noticeable degradation of other services using the networked storage. I am sure it could be solved with more centralized hardware, but in the end we found it cheaper and simpler to use local storage for search. This will of course differ across organizations and setups. We're going to experiment with the one shard per box and more RAM cache per shard and see where that gets us; we'll also be adding more shards. Thanks for the tips! Interesting that you mention Isilon as we're planning on doing an eval with their product this year where we'll be testing out their HDFS layer. It's a potential way to balance computer and storage since you can add HDFS storage without adding compute. - Toke Eskildsen -Joe
Re: Solr: IndexNotFoundException: no segments* file HdfsDirectoryFactory
I've missed Norgorn's reply above. But in the past and also as suggested above, I think the following lock type solved the problem for me. lockType${solr.lock.type:hdfs}/lockType in your indexConfig in solrconfig.xml -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-IndexNotFoundException-no-segments-file-HdfsDirectoryFactory-tp4138737p4178098.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr with Tomcat - enabling SSL problem
Hi Shawn, Thanks for clarifying this. Is there a different way to set the embedded Zookeeper urlScheme parameter before ever starting tomcat? (some configuration file etc.) This way I won't need to start tomcat twice. Thanks, Tali From: Shawn Heisey apa...@elyograg.org To: solr-user@lucene.apache.org Date: 08/01/2015 05:14 PM Subject:Re: Solr with Tomcat - enabling SSL problem On 1/8/2015 6:25 AM, Tali Finelt wrote: I am using Solr 4.10.2 with tomcat and embedded Zookeeper. I followed https://cwiki.apache.org/confluence/display/solr/Enabling+SSL#EnablingSSL-SolrCloud to enable SSL. I am currently doing the following: Starting tomcat Running: ../scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd put /clusterprops.json '{urlScheme:https}' Restarting tomcat Accessing Solr from my client using org.apache.solr.client.solrj.impl.CloudSolrServer. And this works. If I don't restart tomcat again after running zkcli.sh, I get the following error: IOException occured when talking to server at: http://ip:port/solr/ (http, not https). Is it possible to do this without the second restart? Solr will only read parameters like the urlScheme at startup. Once it's running, that information is never accessed again, so in order to get it to change those parameters, a restart is required. It might be possible to change the code so a re-read of these parameters takes place ... but writing code to make fundamental changes to program operation can be risky. Restarting the program is much safer. Thanks, Shawn
RE: Solr Cloud, 100 shards, shards progressively become slower
Hi Shawn, Thank you for your reply The part about memory usage is not clear. That 4GB and 16GB could refer to the operating system view of memory, or the view of memory within the JVM. I'm curious about how much total RAM each machine has, how large the Java heap is, and what the total size of the indexes that live on each machine is. Even if they are individually very small, 500 million documents will result in a very large index, so I'm guessing that you don't have enough RAM on each server for your index size. What can happen with a highly sharded index that is too large for available RAM: Index data for the initial queries gets read from the OS disk cache, but as those queries run, the information required for the shards that come later in the distributed query gets pushed out of the disk cache, so Solr must actually read the disk to do those later queries. Disks are slow, so if the machine has to actually read from the disk, Solr will be slow. http://wiki.apache.org/solr/SolrPerformanceProblems#RAM We have 4gb usage (because the shards are split by 100 each shard is approx. 2.8gb on disk), we have allocated 14gb min and 16gb max of ram to solr, so it has plenty to use (the ram in the dashboard never goes above about 8gb - so still plenty ). I've managed to reproduce the issue with shards= parameter, and think I have proven the disk cache issue to not be the problem I'm simply querying the same shard, on the same server, multiple times (so the shards index should always be in memory and never loaded from disk)? All but the last query are low ms ... { responseHeader:{ status:0, QTime:50190, params:{ sort:id ASC, shards:1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection,1.1.1.18:8985/solr/Collection, indent:true, q:spec_country:\United Kingdom\, shards.info:true, distrib:false, cursorMark:*, wt:json, rows:10}}, shards.info:{ 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:24}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:35}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:35}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:52}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:55}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:59}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:58}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:75}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:79}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:78}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:79}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null, shardAddress:http://1.1.1.18:8985/solr/Collection;, time:80}, 1.1.1.18:8985/solr/Collection:{ numFound:242731, maxScore:null,
Re: ignoring bad documents during index
i don't have specific answers toall of your questions, but you should probably look at SOLR-445 where a lot of this has already ben discussed and multiple patches with different approaches have been started... https://issues.apache.org/jira/browse/SOLR-445 : Date: Wed, 7 Jan 2015 12:38:47 -0700 (MST) : From: SolrUser1543 osta...@gmail.com : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Re: ignoring bad documents during index : : I have implemented an update processor as described above. : : On single solr instance it works fine. : : When I testing it on solr cloud with several nodes and trying to index few : documents , when some of them are incorrect , each instance is creating its : response, but it is not aggregated by the instance which got a request . : : I also tried to use QueryReponseWriter , but it is also was not aggregated . : : The questions are : : 1. how to make it be aggregated ? : 2. what kind of update processor it should be : UpdateRequestProcessor or : DistributedUpdateRequestProcessor ? : : : : : -- : View this message in context: http://lucene.472066.n3.nabble.com/ignoring-bad-documents-during-index-tp4176947p4177911.html : Sent from the Solr - User mailing list archive at Nabble.com. : -Hoss http://www.lucidworks.com/