(Newbie Help!) Seeking guidance in regards to Solr's suggestor and others
Hi Friends, I'm new to solr, been working on it for the past 2-3 months trying to really get my feet wet with it so that I can transition the current search engine at my current job to solr. (Eww sphinx haha) anyway I need some help. I was running around the net getting my suggester working and im stuck and I need some help. This is what I have so far. (I will explain after I posted links to the config files) here is a link to my managed-schema.xml http://pastebin.com/MiEWwESP solr config.xml http://pastebin.com/fq2yxbvp I am currently using Solr 6.2.1, my issue is.. I am trying to build a suggester that builds search terms or phrases based off of the index that is in memory. I was playing around with the analyzers and the tokenizers as well as reading some very old books that touch base on solr 4. And I came up with this set of tokenizers and analyzer chain. Please correct it if its wrong. But my index contains Medical Abstracts published by Doctors and terms that I would really need to search for are "brain cancer" , "anti-inflammatory" , "hiv-1" kinda see where im going with? So i need to sorta preserve the white space and some sort of hyphen delimiter. After I discovered that, (now here comes the fun part) I type in the url: http://localhost:8983/solr/AbstractSuggest/suggest/?spellcheck.build=true then after when its built I query, http://localhost:8983/solr/AbstractSuggest/suggest/?spellcheck.q=suggest_field:%22anti-infl%22 Which is perfectly fine It works great. I can see the collations so that In my dropdown search bar for when clients search these medical articles they can see these terms. Now In regards to PHP (solarium api to talk to solr) now. Since this is a website and I intend on making an AJAX call to php I cannot see the collation list. Solarium fails on hyphenated terms as well as fails on building the collations list. For example if I would type in "brain canc" ( i want to search brain cancer) It auto suggests brain , then cancer but in collations nothing is shown. If I would to send this to the URL (localhost url, which will soon change when moved to prod enviornment) i can see the collations. A screenshot is here.. brain can (url) -> https://gyazo.com/30a9d11e4b9b73b0768a12d342223dc3 bran canc(solarium) -> https://gyazo.com/507b02e50d0e39d7daa96655dff83c76 php code ->https://gyazo.com/1d2b8c90013784d7cde5301769cd230c So here is where I am. The ideal goal is to have the PHP api produce the same results just like the URL so when users type into a search bar they can see the collations. Can someone please help? Im looking towards the community as the savior to all my problems. I want to learn about solr at the same time so if future problems popup I can solve them accordingly. Thanks! Happy Holidays Kevin.
Re: Does sharding improve or degrade performance?
On 12/12/2016 1:14 PM, Piyush Kunal wrote: > We did the following change: > > 1. Previously we had 1 shard and 32 replicas for 1.2million documents of > size 5 GB. > 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of > size 5GB How many machines and shards per machine were you running in both situations? For either setup, I would recommend at least 32 machines, where each one handles exactly one shard replica. For the latter setup, you may need even more machines, so there are more replicas. > We have a combined RPM of around 20k rpm for solr. Twenty thousand queries per minute is over 300 per second. This is a very high query rate, which is going to require many replicas. Your replica count has gone down significantly with the change you made. > But unfortunately we saw a degrade in performance with RTs going insanely > high when we moved to setup 2. With such a high query rate, I'm not really surprised that this caused the performance to go down, even if you actually do have 32 machines. Distributed queries are a two-phase process where the coordinating node sends individual queries to each shard to find out how to sort the sub-results into one final result, and then sends a second query to relevant shards to request the individual documents it needs for the result. The total number of individual queries goes up significantly. Before Solr was doing one query for one result. Now it is doing between five and nine queries for one result (the initial query from your client to the coordinating node, the first query to each of the four shards, and then a possible second query to each shard). If the number of search hits is more than zero, it will be at least six queries. This is why one shard is preferred for high query rates if you can fit the whole index into one shard. Five gigabytes is a pretty small Solr index. Sharding is most effective when the query rate is low, because Solr can take advantage of idle CPUs. It makes it possible to have a much larger index. A high query rate means that there are no idle CPUs. Thanks, Shawn
Re: Does sharding improve or degrade performance?
Sharding adds inevitable overhead. Particularly each request, rather than being serviced on a single replica has to send out a first request to each replica, get the ID and sort criteria back, then send out a second request to get the actual docs. Especially if you're asking for a lot of rows this can get very expensive. And you're now serving your queries on 1/4 of the machines. In the first setup, an incoming request was completely serviced on 1 node. Now you're requiring 4 nodes to participate. Sharding is always a second choice and always has some overhead. As long as your QTimes are acceptable, you should stick with only a single replica. Best, Erick On Mon, Dec 12, 2016 at 12:14 PM, Piyush Kunal wrote: > All our shards and replicas reside on different machines with 16GB RAM and > 4 cores. > > On Tue, Dec 13, 2016 at 1:44 AM, Piyush Kunal > wrote: > >> We did the following change: >> >> 1. Previously we had 1 shard and 32 replicas for 1.2million documents of >> size 5 GB. >> 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of >> size 5GB >> >> We have a combined RPM of around 20k rpm for solr. >> >> But unfortunately we saw a degrade in performance with RTs going insanely >> high when we moved to setup 2. >> >> What could be probable reasons and how it can be fixed? >>
Re: How to check optimized or disk free status via solrj for a particular collection?
bq: We are indexing with autocommit at 30 minutes OK, check the size of your tlogs. What this means is that all the updates accumulate for 30 minutes in a single tlog. That tlog will be closed when autocommit happens and a new one opened for the next 30 minutes. The first tlog won't be purged until the second one is closed. All this is detailed in the link I provided. If the tlogs are significant in size this may be the entire problem. Best, Erick On Mon, Dec 12, 2016 at 12:46 PM, Susheel Kumar wrote: > One option: > > First you may purge all documents before full-reindex that you don't need > to run optimize unless you need the data to serve queries same time. > > i think you are running into out of space because your 43 million may be > consuming 30% of total disk space and when you re-index the total disk > space usage goes to 60%. Now if you run optimize, it may require double > another 60% disk space making to 120% which causes out of disk space. > > The other option is to increase disk space if you want to run optimize at > the end. > > > On Mon, Dec 12, 2016 at 3:36 PM, Michael Joyner wrote: > >> We are having an issue with running out of space when trying to do a full >> re-index. >> >> We are indexing with autocommit at 30 minutes. >> >> We have it set to only optimize at the end of an indexing cycle. >> >> >> >> On 12/12/2016 02:43 PM, Erick Erickson wrote: >> >>> First off, optimize is actually rarely necessary. I wouldn't bother >>> unless you have measurements to prove that it's desirable. >>> >>> I would _certainly_ not call optimize every 10M docs. If you must call >>> it at all call it exactly once when indexing is complete. But see >>> above. >>> >>> As far as the commit, I'd just set the autocommit settings in >>> solrconfig.xml to something "reasonable" and forget it. I usually use >>> time rather than doc count as it's a little more predictable. I often >>> use 60 seconds, but it can be longer. The longer it is, the bigger >>> your tlog will grow and if Solr shuts down forcefully the longer >>> replaying may take. Here's the whole writeup on this topic: >>> >>> https://lucidworks.com/blog/2013/08/23/understanding-transac >>> tion-logs-softcommit-and-commit-in-sorlcloud/ >>> >>> Running out of space during indexing with about 30% utilization is >>> very odd. My guess is that you're trying to take too much control. >>> Having multiple optimizations going on at once would be a very good >>> way to run out of disk space. >>> >>> And I'm assuming one replica's index per disk or you're reporting >>> aggregate index size per disk when you sah 30%. Having three replicas >>> on the same disk each consuming 30% is A Bad Thing. >>> >>> Best, >>> Erick >>> >>> On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner >>> wrote: >>> Halp! I need to reindex over 43 millions documents, when optimized the collection is currently < 30% of disk space, we tried it over this weekend and it ran out of space during the reindexing. I'm thinking for the best solution for what we are trying to do is to call commit/optimize every 10,000,000 documents or so and then wait for the optimize to complete. How to check optimized status via solrj for a particular collection? Also, is there is a way to check free space per shard by collection? -Mike >>
Re: OOMs in Solr
bq: ...so I wonder if reducing the heap is going to help or it won’t matter that much... Well, if you're hitting OOM errors than you have no _choice_ but to reduce the heap. Or increase the memory. And you don't have much physical memory to grow into. Longer term, reducing the JVM size (assuming you can w/o hitting OOM errors) is always to the good. The more heap, the more GC you have, the longer stop-the-world GC pauses will take etc. The OS memory management for GC is vastly more efficient (because it's simpler) than Java's is. Note, however, that this "more art than science". I've seen situations where the JVM requires very close to the max heap size at some point. >From there I've seen situations where the GC kicks in and recovers just enough memory to continue for a few milliseconds and then go right back into a GC cycle. So you need some overhead. Or are you talking about SSDs for the OS to use for swapping? Assuming you're swapping we're talking about query response time here, SSDs will be much faster if you're swapping. But you _really_ want to strive to _not_ swap. SSD access is faster than spinning disk for sure, but still vastly slower than RAM access. I applaud you changing one thing at a time BTW. You probably want to use GCViewer or similar on the GC logs (turn them on first!) for Solr for a quick take on how GC is performing when you test. And the one other thing I'd do: Mine your Solr (or servelet container) logs for the real queries over one of these periods. Then use something like jmeter (or roll your own) test program to fire them at your test instance to evaluate the effects of your changes. Best, Erick On Mon, Dec 12, 2016 at 1:03 PM, Alfonso Muñoz-Pomer Fuentes wrote: > According to the post you linked to, it strongly advises to buy SSDs. I got > in touch with the systems department in my organization and it turns out > that our VM storage is SSD-backed, so I wonder if reducing the heap is going > to help or it won’t matter that much. Of course, there’s nothing like trying > and check out the results. I’ll do that in due time, though. At the moment > I’ve reduced the filter cache and will change all parameters one at a time > to see what affects performance the most. > > Thanks again for the feedback. > > On 12/12/2016 19:36, Erick Erickson wrote: >> >> The biggest bang for the buck is _probably_ docValues for the fields >> you facet on. If that's the culprit, you can also reduce your JVM heap >> considerably, as Toke says, leaving this little memory for the OS is >> bad. Here's the writeup on why: >> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html >> >> Roughly what's happening is that all the values you facet on have to >> be read into memory somewhere. docvalues puts almost all of that into >> the OS memory rather than JVM heap. It's much faster to load, reduces >> JVM GC pressure, OOMs, and allows the pages to be swapped out. >> >> However, this is somewhat pushing the problem around. Moving the >> memory consumption to the OS memory space will have a huge impact on >> your OOM errors but the cost will be that you'll probably start >> swapping pages out of the OS memory, which will impact search speed. >> Slower searches are preferable to OOMs, certainly. That said you'll >> probably need more physical memory at some point, or go to SolrCloud >> or >> >> Best, >> Erick >> >> On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar >> wrote: >>> >>> Double check if your queries are not running into deep pagination >>> (q=*:*...&start=). This is something i recently >>> experienced >>> and was the only cause of OOM. You may have the gc logs when OOM >>> happened >>> and drawing it on GC Viewer may give insight how gradual your heap got >>> filled and run into OOM. >>> >>> Thanks, >>> Susheel >>> >>> On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes < >>> amu...@ebi.ac.uk> wrote: >>> Thanks again. I’m learning more about Solr in this thread than in my previous months reading about it! Moving to Solr Cloud is a possibility we’ve discussed and I guess it will eventually happen, as the index will grow no matter what. I’ve already lowered filterCache from 512 to 64 and I’m looking forward to seeing what happens in the next few days. Our filter cache hit ratio was 0.99, so I would expect this to go down but if we can have a more efficiente memory usage I think e.g. an extra second for each search is still acceptable. Regarding the startup scripts we’re using the ones included with Solr. As for the use of filters we’re always using the same four filters, IIRC. In any case we’ll review the code to ensure that that’s the case. I’m aware of the need to reindex when the schema changes, but thanks for the reminder. We’ll add docValues because I think that’ll make a significant difference in our case. We’ll also try to leave s
Re: OOMs in Solr
According to the post you linked to, it strongly advises to buy SSDs. I got in touch with the systems department in my organization and it turns out that our VM storage is SSD-backed, so I wonder if reducing the heap is going to help or it won’t matter that much. Of course, there’s nothing like trying and check out the results. I’ll do that in due time, though. At the moment I’ve reduced the filter cache and will change all parameters one at a time to see what affects performance the most. Thanks again for the feedback. On 12/12/2016 19:36, Erick Erickson wrote: The biggest bang for the buck is _probably_ docValues for the fields you facet on. If that's the culprit, you can also reduce your JVM heap considerably, as Toke says, leaving this little memory for the OS is bad. Here's the writeup on why: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Roughly what's happening is that all the values you facet on have to be read into memory somewhere. docvalues puts almost all of that into the OS memory rather than JVM heap. It's much faster to load, reduces JVM GC pressure, OOMs, and allows the pages to be swapped out. However, this is somewhat pushing the problem around. Moving the memory consumption to the OS memory space will have a huge impact on your OOM errors but the cost will be that you'll probably start swapping pages out of the OS memory, which will impact search speed. Slower searches are preferable to OOMs, certainly. That said you'll probably need more physical memory at some point, or go to SolrCloud or Best, Erick On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar wrote: Double check if your queries are not running into deep pagination (q=*:*...&start=). This is something i recently experienced and was the only cause of OOM. You may have the gc logs when OOM happened and drawing it on GC Viewer may give insight how gradual your heap got filled and run into OOM. Thanks, Susheel On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes < amu...@ebi.ac.uk> wrote: Thanks again. I’m learning more about Solr in this thread than in my previous months reading about it! Moving to Solr Cloud is a possibility we’ve discussed and I guess it will eventually happen, as the index will grow no matter what. I’ve already lowered filterCache from 512 to 64 and I’m looking forward to seeing what happens in the next few days. Our filter cache hit ratio was 0.99, so I would expect this to go down but if we can have a more efficiente memory usage I think e.g. an extra second for each search is still acceptable. Regarding the startup scripts we’re using the ones included with Solr. As for the use of filters we’re always using the same four filters, IIRC. In any case we’ll review the code to ensure that that’s the case. I’m aware of the need to reindex when the schema changes, but thanks for the reminder. We’ll add docValues because I think that’ll make a significant difference in our case. We’ll also try to leave space for the disk cache as we’re using spinning disk storage. Thanks again to everybody for the useful and insightful replies. Alfonso On 12/12/2016 14:12, Shawn Heisey wrote: On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) An OOM indicates that a Java application is requesting more memory than it has been told it can use. There are only two remedies for OOM errors: Increase the heap, or make the program use less memory. In this email, I have concentrated on ways to reduce the memory requirements. These index sizes and document counts are relatively small to Solr -- as long as you have enough memory and are smart about how it's used. Solr 5.1.0 comes with GC tuning built into the startup scripts, using some well-tested CMS settings. If you are using those startup scripts, then the parallel collector will NOT be default. No matter what collector is in use, it cannot fix OOM problems. It may change when and how frequently they occur, but it can't do anything about them. We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). Like Toke, I suspect two things: a very large filterCache, and the heavy facet usage, maybe both. Enabling docValues on the fields you're using for faceting and reindexing will make the latter more memory efficient, and likely faster. Reducing the filterCache size would help the forme
Re: How to check optimized or disk free status via solrj for a particular collection?
One option: First you may purge all documents before full-reindex that you don't need to run optimize unless you need the data to serve queries same time. i think you are running into out of space because your 43 million may be consuming 30% of total disk space and when you re-index the total disk space usage goes to 60%. Now if you run optimize, it may require double another 60% disk space making to 120% which causes out of disk space. The other option is to increase disk space if you want to run optimize at the end. On Mon, Dec 12, 2016 at 3:36 PM, Michael Joyner wrote: > We are having an issue with running out of space when trying to do a full > re-index. > > We are indexing with autocommit at 30 minutes. > > We have it set to only optimize at the end of an indexing cycle. > > > > On 12/12/2016 02:43 PM, Erick Erickson wrote: > >> First off, optimize is actually rarely necessary. I wouldn't bother >> unless you have measurements to prove that it's desirable. >> >> I would _certainly_ not call optimize every 10M docs. If you must call >> it at all call it exactly once when indexing is complete. But see >> above. >> >> As far as the commit, I'd just set the autocommit settings in >> solrconfig.xml to something "reasonable" and forget it. I usually use >> time rather than doc count as it's a little more predictable. I often >> use 60 seconds, but it can be longer. The longer it is, the bigger >> your tlog will grow and if Solr shuts down forcefully the longer >> replaying may take. Here's the whole writeup on this topic: >> >> https://lucidworks.com/blog/2013/08/23/understanding-transac >> tion-logs-softcommit-and-commit-in-sorlcloud/ >> >> Running out of space during indexing with about 30% utilization is >> very odd. My guess is that you're trying to take too much control. >> Having multiple optimizations going on at once would be a very good >> way to run out of disk space. >> >> And I'm assuming one replica's index per disk or you're reporting >> aggregate index size per disk when you sah 30%. Having three replicas >> on the same disk each consuming 30% is A Bad Thing. >> >> Best, >> Erick >> >> On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner >> wrote: >> >>> Halp! >>> >>> I need to reindex over 43 millions documents, when optimized the >>> collection >>> is currently < 30% of disk space, we tried it over this weekend and it >>> ran >>> out of space during the reindexing. >>> >>> I'm thinking for the best solution for what we are trying to do is to >>> call >>> commit/optimize every 10,000,000 documents or so and then wait for the >>> optimize to complete. >>> >>> How to check optimized status via solrj for a particular collection? >>> >>> Also, is there is a way to check free space per shard by collection? >>> >>> -Mike >>> >>> >
Re: How to check optimized or disk free status via solrj for a particular collection?
We are having an issue with running out of space when trying to do a full re-index. We are indexing with autocommit at 30 minutes. We have it set to only optimize at the end of an indexing cycle. On 12/12/2016 02:43 PM, Erick Erickson wrote: First off, optimize is actually rarely necessary. I wouldn't bother unless you have measurements to prove that it's desirable. I would _certainly_ not call optimize every 10M docs. If you must call it at all call it exactly once when indexing is complete. But see above. As far as the commit, I'd just set the autocommit settings in solrconfig.xml to something "reasonable" and forget it. I usually use time rather than doc count as it's a little more predictable. I often use 60 seconds, but it can be longer. The longer it is, the bigger your tlog will grow and if Solr shuts down forcefully the longer replaying may take. Here's the whole writeup on this topic: https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Running out of space during indexing with about 30% utilization is very odd. My guess is that you're trying to take too much control. Having multiple optimizations going on at once would be a very good way to run out of disk space. And I'm assuming one replica's index per disk or you're reporting aggregate index size per disk when you sah 30%. Having three replicas on the same disk each consuming 30% is A Bad Thing. Best, Erick On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner wrote: Halp! I need to reindex over 43 millions documents, when optimized the collection is currently < 30% of disk space, we tried it over this weekend and it ran out of space during the reindexing. I'm thinking for the best solution for what we are trying to do is to call commit/optimize every 10,000,000 documents or so and then wait for the optimize to complete. How to check optimized status via solrj for a particular collection? Also, is there is a way to check free space per shard by collection? -Mike
Re: Does sharding improve or degrade performance?
All our shards and replicas reside on different machines with 16GB RAM and 4 cores. On Tue, Dec 13, 2016 at 1:44 AM, Piyush Kunal wrote: > We did the following change: > > 1. Previously we had 1 shard and 32 replicas for 1.2million documents of > size 5 GB. > 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of > size 5GB > > We have a combined RPM of around 20k rpm for solr. > > But unfortunately we saw a degrade in performance with RTs going insanely > high when we moved to setup 2. > > What could be probable reasons and how it can be fixed? >
Does sharding improve or degrade performance?
We did the following change: 1. Previously we had 1 shard and 32 replicas for 1.2million documents of size 5 GB. 2. We changed it to 4 shards and 8 replicas for 1.2 million documents of size 5GB We have a combined RPM of around 20k rpm for solr. But unfortunately we saw a degrade in performance with RTs going insanely high when we moved to setup 2. What could be probable reasons and how it can be fixed?
Re: How to check optimized or disk free status via solrj for a particular collection?
How much difference between below two parameters from your Solr stats screen. For e.g. in our case we have very frequent updates which results into max docs = num docs x2 over the period of time and in that case I have seen optimization helps in query performance. Unless you have huge difference, optimization may not be necessary. Num Docs:39183404Max Doc:78056265 Thanks, Susheel On Mon, Dec 12, 2016 at 2:43 PM, Erick Erickson wrote: > First off, optimize is actually rarely necessary. I wouldn't bother > unless you have measurements to prove that it's desirable. > > I would _certainly_ not call optimize every 10M docs. If you must call > it at all call it exactly once when indexing is complete. But see > above. > > As far as the commit, I'd just set the autocommit settings in > solrconfig.xml to something "reasonable" and forget it. I usually use > time rather than doc count as it's a little more predictable. I often > use 60 seconds, but it can be longer. The longer it is, the bigger > your tlog will grow and if Solr shuts down forcefully the longer > replaying may take. Here's the whole writeup on this topic: > > https://lucidworks.com/blog/2013/08/23/understanding- > transaction-logs-softcommit-and-commit-in-sorlcloud/ > > Running out of space during indexing with about 30% utilization is > very odd. My guess is that you're trying to take too much control. > Having multiple optimizations going on at once would be a very good > way to run out of disk space. > > And I'm assuming one replica's index per disk or you're reporting > aggregate index size per disk when you sah 30%. Having three replicas > on the same disk each consuming 30% is A Bad Thing. > > Best, > Erick > > On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner > wrote: > > Halp! > > > > I need to reindex over 43 millions documents, when optimized the > collection > > is currently < 30% of disk space, we tried it over this weekend and it > ran > > out of space during the reindexing. > > > > I'm thinking for the best solution for what we are trying to do is to > call > > commit/optimize every 10,000,000 documents or so and then wait for the > > optimize to complete. > > > > How to check optimized status via solrj for a particular collection? > > > > Also, is there is a way to check free space per shard by collection? > > > > -Mike > > >
Re: regex-urlfilter help
sorry my mistake.. sent to wrong list. - Original Message - From: "Shawn Heisey" To: solr-user@lucene.apache.org Sent: Monday, December 12, 2016 2:36:26 PM Subject: Re: regex-urlfilter help On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote: > I'm using nutch 1.12 and Solr 5.4.1. > > Crawling a website and indexing into nutch. > > AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. > > what if I have > https:///inside/default.cfm as my seed url... > I want the links on this page to be crawled and indexed but I do not want > this page to be indexed into SOLR. > How would I set this up? > > I'm thnking that the regex.urlfilter.txt file is NOT the right place. These sound like questions about how to configure Nutch. This is a Solr mailing list. Nutch is a completely separate Apache product with its own mailing list. Although there may be people here who do use Nutch, it's not the purpose of this list. Please use support resources for Nutch. http://nutch.apache.org/mailing_lists.html I'm reasonably certain that this cannot be controlled by Solr's configuration. Solr will index anything that is sent to it, so the choice of what to send or not send in this situation will be decided by Nutch. Thanks, Shawn
error diagnosis help.
ive scoured my nutch and solr config files and I cant find any cause. suggestions? Monday, December 12, 2016 2:37:13 PMERROR nullRequestHandlerBase org.apache.solr.common.SolrException: Unexpected character '&' (code 38) in epilog; expected '<' org.apache.solr.common.SolrException: Unexpected character '&' (code 38) in epilog; expected '<' at [row,col {unknown-source}]: [1,36] at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:180) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:95) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:70) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2073) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:457) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:223) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:499) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745)
Setting Shard Count at Initial Startup of SolrCloud
Hi, I have an external Zookeeper. I don't wanna use SolrCloud as test. I upload confs to Zookeeper: server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd upconfig -confdir server/solr/my_collection/conf -confname my_collection Start servers: Server 1: bin/solr start -cloud -d server -p 8983 -z localhost:2181 Server 2: bin/solr start -cloud -d server -p 8984 -z localhost:2181 As usual, shard count will be 1 with this approach. I want 2 shards. I know that I can create shard with: bin/solr create However, I have to delete existing collection and than I can create shards. Is there any possibility to set number of shards and maximum shards per node etc. at initial start of Solr? Kind Regards, Furkan KAMACI
Re: How to check optimized or disk free status via solrj for a particular collection?
First off, optimize is actually rarely necessary. I wouldn't bother unless you have measurements to prove that it's desirable. I would _certainly_ not call optimize every 10M docs. If you must call it at all call it exactly once when indexing is complete. But see above. As far as the commit, I'd just set the autocommit settings in solrconfig.xml to something "reasonable" and forget it. I usually use time rather than doc count as it's a little more predictable. I often use 60 seconds, but it can be longer. The longer it is, the bigger your tlog will grow and if Solr shuts down forcefully the longer replaying may take. Here's the whole writeup on this topic: https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Running out of space during indexing with about 30% utilization is very odd. My guess is that you're trying to take too much control. Having multiple optimizations going on at once would be a very good way to run out of disk space. And I'm assuming one replica's index per disk or you're reporting aggregate index size per disk when you sah 30%. Having three replicas on the same disk each consuming 30% is A Bad Thing. Best, Erick On Mon, Dec 12, 2016 at 8:36 AM, Michael Joyner wrote: > Halp! > > I need to reindex over 43 millions documents, when optimized the collection > is currently < 30% of disk space, we tried it over this weekend and it ran > out of space during the reindexing. > > I'm thinking for the best solution for what we are trying to do is to call > commit/optimize every 10,000,000 documents or so and then wait for the > optimize to complete. > > How to check optimized status via solrj for a particular collection? > > Also, is there is a way to check free space per shard by collection? > > -Mike >
Re: OOMs in Solr
The biggest bang for the buck is _probably_ docValues for the fields you facet on. If that's the culprit, you can also reduce your JVM heap considerably, as Toke says, leaving this little memory for the OS is bad. Here's the writeup on why: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Roughly what's happening is that all the values you facet on have to be read into memory somewhere. docvalues puts almost all of that into the OS memory rather than JVM heap. It's much faster to load, reduces JVM GC pressure, OOMs, and allows the pages to be swapped out. However, this is somewhat pushing the problem around. Moving the memory consumption to the OS memory space will have a huge impact on your OOM errors but the cost will be that you'll probably start swapping pages out of the OS memory, which will impact search speed. Slower searches are preferable to OOMs, certainly. That said you'll probably need more physical memory at some point, or go to SolrCloud or Best, Erick On Mon, Dec 12, 2016 at 10:57 AM, Susheel Kumar wrote: > Double check if your queries are not running into deep pagination > (q=*:*...&start=). This is something i recently experienced > and was the only cause of OOM. You may have the gc logs when OOM happened > and drawing it on GC Viewer may give insight how gradual your heap got > filled and run into OOM. > > Thanks, > Susheel > > On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes < > amu...@ebi.ac.uk> wrote: > >> Thanks again. >> >> I’m learning more about Solr in this thread than in my previous months >> reading about it! >> >> Moving to Solr Cloud is a possibility we’ve discussed and I guess it will >> eventually happen, as the index will grow no matter what. >> >> I’ve already lowered filterCache from 512 to 64 and I’m looking forward to >> seeing what happens in the next few days. Our filter cache hit ratio was >> 0.99, so I would expect this to go down but if we can have a more >> efficiente memory usage I think e.g. an extra second for each search is >> still acceptable. >> >> Regarding the startup scripts we’re using the ones included with Solr. >> >> As for the use of filters we’re always using the same four filters, IIRC. >> In any case we’ll review the code to ensure that that’s the case. >> >> I’m aware of the need to reindex when the schema changes, but thanks for >> the reminder. We’ll add docValues because I think that’ll make a >> significant difference in our case. We’ll also try to leave space for the >> disk cache as we’re using spinning disk storage. >> >> Thanks again to everybody for the useful and insightful replies. >> >> Alfonso >> >> >> On 12/12/2016 14:12, Shawn Heisey wrote: >> >>> On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: >>> I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) >>> >>> An OOM indicates that a Java application is requesting more memory than >>> it has been told it can use. There are only two remedies for OOM errors: >>> Increase the heap, or make the program use less memory. In this email, >>> I have concentrated on ways to reduce the memory requirements. >>> >>> These index sizes and document counts are relatively small to Solr -- as >>> long as you have enough memory and are smart about how it's used. >>> >>> Solr 5.1.0 comes with GC tuning built into the startup scripts, using >>> some well-tested CMS settings. If you are using those startup scripts, >>> then the parallel collector will NOT be default. No matter what >>> collector is in use, it cannot fix OOM problems. It may change when and >>> how frequently they occur, but it can't do anything about them. >>> >>> We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). >>> >>> Like Toke, I suspect two things: a very large filterCache, and the heavy >>> facet usage, maybe both. Enabling docValues on the fields you're using >>> for faceting and reindexing will make the latter more memory efficient, >>> and likely faster. Reducing the filterCache size would help the >>> former. Note that if you have a completely static index, then it is >>> more likely that you will fill up the filterCache over time. >>> >>> I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. What do you think? Any us
Re: regex-urlfilter help
On 12/12/2016 12:19 PM, KRIS MUSSHORN wrote: > I'm using nutch 1.12 and Solr 5.4.1. > > Crawling a website and indexing into nutch. > > AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. > > what if I have > https:///inside/default.cfm as my seed url... > I want the links on this page to be crawled and indexed but I do not want > this page to be indexed into SOLR. > How would I set this up? > > I'm thnking that the regex.urlfilter.txt file is NOT the right place. These sound like questions about how to configure Nutch. This is a Solr mailing list. Nutch is a completely separate Apache product with its own mailing list. Although there may be people here who do use Nutch, it's not the purpose of this list. Please use support resources for Nutch. http://nutch.apache.org/mailing_lists.html I'm reasonably certain that this cannot be controlled by Solr's configuration. Solr will index anything that is sent to it, so the choice of what to send or not send in this situation will be decided by Nutch. Thanks, Shawn
RE: Unicode Character Problem
> I don't see any weird character when I manual copy it to any text editor. That's a good diagnostic step, but there's a chance that Adobe (or your viewer) got it right, and Tika or PDFBox isn't getting it right. If you run tika-app on the file [0], do you get the same problem? See our stub on common text extraction challenges with PDFs [1] and how to run PDFBox's ExtractText against your file [2]. [0] java -jar tika-app.jar -i -o [1] https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29 [2] https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems -Original Message- From: Furkan KAMACI [mailto:furkankam...@gmail.com] Sent: Monday, December 12, 2016 10:55 AM To: solr-user@lucene.apache.org; Ahmet Arslan Subject: Re: Unicode Character Problem Hi Ahmet, I don't see any weird character when I manual copy it to any text editor. On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan wrote: > Hi Furkan, > > I am pretty sure this is a pdf extraction thing. > Turkish characters caused us trouble in the past during extracting > text from pdf files. > You can confirm by performing manual copy-paste from original pdf file. > > Ahmet > > > On Friday, December 9, 2016 8:44 PM, Furkan KAMACI > > wrote: > Hi, > > I'm trying to index Turkish characters. These are what I see at my > index (I see both of them at different places of my content): > > aç klama > açıklama > > These are same words but indexed different (same weird character at > first one). I see that there is not a weird character when I check the > original PDF file. > > What do you think about it. Is it related to Solr or Tika? > > PS: I use text_general for analyser of content field. > > Kind Regards, > Furkan KAMACI >
regex-urlfilter help
I'm using nutch 1.12 and Solr 5.4.1. Crawling a website and indexing into nutch. AFAIK the regex-urlfilter.txt file will cause content to not be crawled.. what if I have https:///inside/default.cfm as my seed url... I want the links on this page to be crawled and indexed but I do not want this page to be indexed into SOLR. How would I set this up? I'm thnking that the regex.urlfilter.txt file is NOT the right place. TIA Kris
Re: Copying Tokens
Multilingual is - hard - fun. What you are trying to do is probably not super-doable as copyField copies original text representation. You don't want to copy tokens anyway, as your query-time analysis chains are different too. I would recommend looking at the books first. Mine talks about languages (for older Solr version) and happens to use English and Russian :-) You can read it for free at: * https://www.packtpub.com/mapt/book/Big%20Data%20&%20Business%20Intelligence/9781782164845 (Free sample is the whole book :-) ) * multilingual setup is the last section/chapter * Source code is at: https://github.com/arafalov/solr-indexing-book There is also large chapter in the "Solr in Action" (chapter 14) that has 3 different strategies, including one that multiplexes code using custom field type. There might be others, but I can't remember off the top of my head. But it is a problem books tend to cover, because it is known to be thorny. Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 12 December 2016 at 11:00, Furkan KAMACI wrote: > Hi, > > I'm testing language identification. I've enabled it solrconfig.xml. Here > is my dynamic fields at schema: > > > > > So, after indexing, I see that fields are generated: > > content_en > content_ru > > I copy my fields into a text field: > > > > > Here is my text field: > > multiValued="true"/> > > I want to let users only search on only *text* field. However, when I copy > that fields into *text *field, they are indexed according to text_general. > > How can I copy *tokens* to *text *field? > > Kind Regards, > Furkan KAMACI
Re: OOMs in Solr
Double check if your queries are not running into deep pagination (q=*:*...&start=). This is something i recently experienced and was the only cause of OOM. You may have the gc logs when OOM happened and drawing it on GC Viewer may give insight how gradual your heap got filled and run into OOM. Thanks, Susheel On Mon, Dec 12, 2016 at 10:32 AM, Alfonso Muñoz-Pomer Fuentes < amu...@ebi.ac.uk> wrote: > Thanks again. > > I’m learning more about Solr in this thread than in my previous months > reading about it! > > Moving to Solr Cloud is a possibility we’ve discussed and I guess it will > eventually happen, as the index will grow no matter what. > > I’ve already lowered filterCache from 512 to 64 and I’m looking forward to > seeing what happens in the next few days. Our filter cache hit ratio was > 0.99, so I would expect this to go down but if we can have a more > efficiente memory usage I think e.g. an extra second for each search is > still acceptable. > > Regarding the startup scripts we’re using the ones included with Solr. > > As for the use of filters we’re always using the same four filters, IIRC. > In any case we’ll review the code to ensure that that’s the case. > > I’m aware of the need to reindex when the schema changes, but thanks for > the reminder. We’ll add docValues because I think that’ll make a > significant difference in our case. We’ll also try to leave space for the > disk cache as we’re using spinning disk storage. > > Thanks again to everybody for the useful and insightful replies. > > Alfonso > > > On 12/12/2016 14:12, Shawn Heisey wrote: > >> On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: >> >>> I’m writing because in our web application we’re using Solr 5.1.0 and >>> currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are >>> dedicated to Solr and nothing else is running there). We have four >>> cores, that are this size: >>> - 25.56 GB, Num Docs = 57,860,845 >>> - 12.09 GB, Num Docs = 173,491,631 >>> >>> (The other two cores are about 10 MB, 20k docs) >>> >> >> An OOM indicates that a Java application is requesting more memory than >> it has been told it can use. There are only two remedies for OOM errors: >> Increase the heap, or make the program use less memory. In this email, >> I have concentrated on ways to reduce the memory requirements. >> >> These index sizes and document counts are relatively small to Solr -- as >> long as you have enough memory and are smart about how it's used. >> >> Solr 5.1.0 comes with GC tuning built into the startup scripts, using >> some well-tested CMS settings. If you are using those startup scripts, >> then the parallel collector will NOT be default. No matter what >> collector is in use, it cannot fix OOM problems. It may change when and >> how frequently they occur, but it can't do anything about them. >> >> We aren’t indexing on this machine, and we’re getting OOM relatively >>> quickly (after about 14 hours of regular use). Right now we have a >>> Cron job that restarts Solr every 12 hours, so it’s not pretty. We use >>> faceting quite heavily and mostly as a document storage server (we >>> want full data sets instead of the n most relevant results). >>> >> >> Like Toke, I suspect two things: a very large filterCache, and the heavy >> facet usage, maybe both. Enabling docValues on the fields you're using >> for faceting and reindexing will make the latter more memory efficient, >> and likely faster. Reducing the filterCache size would help the >> former. Note that if you have a completely static index, then it is >> more likely that you will fill up the filterCache over time. >> >> I don’t know if what we’re experiencing is usual given the index size >>> and memory constraint of the VM, or something looks like it’s wildly >>> misconfigured. What do you think? Any useful pointers for some tuning >>> we could do to improve the service? Would upgrading to Solr 6 make sense? >>> >> >> As I already mentioned, the first thing I'd check is the size of the >> filterCache. Reduce it, possibly so it's VERY small. Do everything you >> can to assure that you are re-using filters, not sending many unique >> filters. One of the most common things that leads to low filter re-use >> is using the bare NOW keyword in date filters and queries. Use NOW/HOUR >> or NOW/DAY instead -- NOW changes once a millisecond, so it is typically >> unique for every query. FilterCache entries are huge, as you were told >> in another reply. >> >> Unless you use docValues, or utilize the facet.method parameter VERY >> carefully, each field you facet on will tie up a large section of memory >> containing the value for that field in EVERY document in the index. >> With the document counts you've got, this is a LOT of memory. >> >> It is strongly recommended to have docValues enabled on every field >> you're using for faceting. If you change the schema in this manner, a >> full reindex will be required before you can use that field again.
How to check optimized or disk free status via solrj for a particular collection?
Halp! I need to reindex over 43 millions documents, when optimized the collection is currently < 30% of disk space, we tried it over this weekend and it ran out of space during the reindexing. I'm thinking for the best solution for what we are trying to do is to call commit/optimize every 10,000,000 documents or so and then wait for the optimize to complete. How to check optimized status via solrj for a particular collection? Also, is there is a way to check free space per shard by collection? -Mike
Re: Distribution Packages
We use jdeb maven plugin to build the debian packages, we use it for Solr as well On Dec 12, 2016 9:03 AM, "Adjamilton Junior" wrote: > Hi folks, > > I am new here and I wonder to know why there's no Solr 6.x packages for > ubuntu/debian? > > Thank you. > > Adjamilton Junior >
Map Highlight Field into Another Field
Hi, One can use * at highlight fields. As like: content_* So, content_de and content_en can match to it. However response will include such fields: "highlighting":{ "my query":{ "content_de": "content_en": ... Is it possible to map matched fields into a pre defined field. As like: content_* => content So, one can handle a generic name for such cases at response? If not, I can implement such a feature. Kind Regards, Furkan KAMACI
Copying Tokens
Hi, I'm testing language identification. I've enabled it solrconfig.xml. Here is my dynamic fields at schema: So, after indexing, I see that fields are generated: content_en content_ru I copy my fields into a text field: Here is my text field: I want to let users only search on only *text* field. However, when I copy that fields into *text *field, they are indexed according to text_general. How can I copy *tokens* to *text *field? Kind Regards, Furkan KAMACI
Re: Unicode Character Problem
Hi Ahmet, I don't see any weird character when I manual copy it to any text editor. On Sat, Dec 10, 2016 at 6:19 PM, Ahmet Arslan wrote: > Hi Furkan, > > I am pretty sure this is a pdf extraction thing. > Turkish characters caused us trouble in the past during extracting text > from pdf files. > You can confirm by performing manual copy-paste from original pdf file. > > Ahmet > > > On Friday, December 9, 2016 8:44 PM, Furkan KAMACI > wrote: > Hi, > > I'm trying to index Turkish characters. These are what I see at my index (I > see both of them at different places of my content): > > aç �klama > açıklama > > These are same words but indexed different (same weird character at first > one). I see that there is not a weird character when I check the original > PDF file. > > What do you think about it. Is it related to Solr or Tika? > > PS: I use text_general for analyser of content field. > > Kind Regards, > Furkan KAMACI >
Re: OOMs in Solr
Thanks again. I’m learning more about Solr in this thread than in my previous months reading about it! Moving to Solr Cloud is a possibility we’ve discussed and I guess it will eventually happen, as the index will grow no matter what. I’ve already lowered filterCache from 512 to 64 and I’m looking forward to seeing what happens in the next few days. Our filter cache hit ratio was 0.99, so I would expect this to go down but if we can have a more efficiente memory usage I think e.g. an extra second for each search is still acceptable. Regarding the startup scripts we’re using the ones included with Solr. As for the use of filters we’re always using the same four filters, IIRC. In any case we’ll review the code to ensure that that’s the case. I’m aware of the need to reindex when the schema changes, but thanks for the reminder. We’ll add docValues because I think that’ll make a significant difference in our case. We’ll also try to leave space for the disk cache as we’re using spinning disk storage. Thanks again to everybody for the useful and insightful replies. Alfonso On 12/12/2016 14:12, Shawn Heisey wrote: On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) An OOM indicates that a Java application is requesting more memory than it has been told it can use. There are only two remedies for OOM errors: Increase the heap, or make the program use less memory. In this email, I have concentrated on ways to reduce the memory requirements. These index sizes and document counts are relatively small to Solr -- as long as you have enough memory and are smart about how it's used. Solr 5.1.0 comes with GC tuning built into the startup scripts, using some well-tested CMS settings. If you are using those startup scripts, then the parallel collector will NOT be default. No matter what collector is in use, it cannot fix OOM problems. It may change when and how frequently they occur, but it can't do anything about them. We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). Like Toke, I suspect two things: a very large filterCache, and the heavy facet usage, maybe both. Enabling docValues on the fields you're using for faceting and reindexing will make the latter more memory efficient, and likely faster. Reducing the filterCache size would help the former. Note that if you have a completely static index, then it is more likely that you will fill up the filterCache over time. I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. What do you think? Any useful pointers for some tuning we could do to improve the service? Would upgrading to Solr 6 make sense? As I already mentioned, the first thing I'd check is the size of the filterCache. Reduce it, possibly so it's VERY small. Do everything you can to assure that you are re-using filters, not sending many unique filters. One of the most common things that leads to low filter re-use is using the bare NOW keyword in date filters and queries. Use NOW/HOUR or NOW/DAY instead -- NOW changes once a millisecond, so it is typically unique for every query. FilterCache entries are huge, as you were told in another reply. Unless you use docValues, or utilize the facet.method parameter VERY carefully, each field you facet on will tie up a large section of memory containing the value for that field in EVERY document in the index. With the document counts you've got, this is a LOT of memory. It is strongly recommended to have docValues enabled on every field you're using for faceting. If you change the schema in this manner, a full reindex will be required before you can use that field again. There is another problem lurking here that Toke already touched on: Leaving only 2GB of RAM for the OS to handle disk caching will result in terrible performance. What you've been told by me and and in other replies is discussed here: https://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Re: empty result set for a sort query
Ah, 2-phase distributed search is the most likely answer (and currently classified as more of a limitation than a bug)... Phase 1 collects the top N ids from each shard (and merges them to find the global top N) Phase 2 retrieves the stored fields for the global top N If any of the ids have been deleted between Phase 1 and Phase 2, then you can get less than N docs back. -Yonik On Mon, Dec 12, 2016 at 4:26 AM, moscovig wrote: > I am not sure that it's related, > but with local tests we got to a scenario where we > Add doc that somehow has * empty key* and then, when querying with sort over > creationTime with rows=1, we get empty result set. > When specifying the recent doc shard with shards=shard2 we do have results. > > I don't think we have empty keys in our production schema but maybe it can > give a clue. > > Thanks > Gilad
Re: Distribution Packages
On 12/12/2016 7:03 AM, Adjamilton Junior wrote: > I am new here and I wonder to know why there's no Solr 6.x packages > for ubuntu/debian? There are no official Solr packages for ANY operating system. We have binary releases that include an installation script for UNIX-like operating systems with typical open source utilities, but there are no RPM or DEB packages. It takes considerable developer time and effort to maintain such packages. The idea has been discussed, but nobody has volunteered to do it, and Apache Infra has not been approached about the resources required for hosting the repositories. Thanks, Shawn
Re: OOMs in Solr
On 12/12/2016 3:13 AM, Alfonso Muñoz-Pomer Fuentes wrote: > I’m writing because in our web application we’re using Solr 5.1.0 and > currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are > dedicated to Solr and nothing else is running there). We have four > cores, that are this size: > - 25.56 GB, Num Docs = 57,860,845 > - 12.09 GB, Num Docs = 173,491,631 > > (The other two cores are about 10 MB, 20k docs) An OOM indicates that a Java application is requesting more memory than it has been told it can use. There are only two remedies for OOM errors: Increase the heap, or make the program use less memory. In this email, I have concentrated on ways to reduce the memory requirements. These index sizes and document counts are relatively small to Solr -- as long as you have enough memory and are smart about how it's used. Solr 5.1.0 comes with GC tuning built into the startup scripts, using some well-tested CMS settings. If you are using those startup scripts, then the parallel collector will NOT be default. No matter what collector is in use, it cannot fix OOM problems. It may change when and how frequently they occur, but it can't do anything about them. > We aren’t indexing on this machine, and we’re getting OOM relatively > quickly (after about 14 hours of regular use). Right now we have a > Cron job that restarts Solr every 12 hours, so it’s not pretty. We use > faceting quite heavily and mostly as a document storage server (we > want full data sets instead of the n most relevant results). Like Toke, I suspect two things: a very large filterCache, and the heavy facet usage, maybe both. Enabling docValues on the fields you're using for faceting and reindexing will make the latter more memory efficient, and likely faster. Reducing the filterCache size would help the former. Note that if you have a completely static index, then it is more likely that you will fill up the filterCache over time. > I don’t know if what we’re experiencing is usual given the index size > and memory constraint of the VM, or something looks like it’s wildly > misconfigured. What do you think? Any useful pointers for some tuning > we could do to improve the service? Would upgrading to Solr 6 make sense? As I already mentioned, the first thing I'd check is the size of the filterCache. Reduce it, possibly so it's VERY small. Do everything you can to assure that you are re-using filters, not sending many unique filters. One of the most common things that leads to low filter re-use is using the bare NOW keyword in date filters and queries. Use NOW/HOUR or NOW/DAY instead -- NOW changes once a millisecond, so it is typically unique for every query. FilterCache entries are huge, as you were told in another reply. Unless you use docValues, or utilize the facet.method parameter VERY carefully, each field you facet on will tie up a large section of memory containing the value for that field in EVERY document in the index. With the document counts you've got, this is a LOT of memory. It is strongly recommended to have docValues enabled on every field you're using for faceting. If you change the schema in this manner, a full reindex will be required before you can use that field again. There is another problem lurking here that Toke already touched on: Leaving only 2GB of RAM for the OS to handle disk caching will result in terrible performance. What you've been told by me and and in other replies is discussed here: https://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
RE: OOMs in Solr
You can also try following: 1. reduced stack size of thread using -Xss flag. 2. Try to use sharding instead of single large instance (if possible). 3. reduce cache size in solrconfig.xml Regards, Prateek Jain -Original Message- From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk] Sent: 12 December 2016 01:31 PM To: solr-user@lucene.apache.org Subject: Re: OOMs in Solr I wasn’t aware of docValues and filterCache policies. We’ll try to fine-tune it and see if it helps. Thanks so much for the info! On 12/12/2016 12:13, Toke Eskildsen wrote: > On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote: >> I’m writing because in our web application we’re using Solr 5.1.0 and >> currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are >> dedicated to Solr and nothing else is running there). > > This leaves very little memory for disk cache. I hope your underlying > storage is local SSDs and not spinning drives over the network. > >> We have four cores, that are this size: >> - 25.56 GB, Num Docs = 57,860,845 >> - 12.09 GB, Num Docs = 173,491,631 > > Smallish in bytes, largish in document count. > >> We aren’t indexing on this machine, and we’re getting OOM relatively >> quickly (after about 14 hours of regular use). > > The usual suspect for OOMs after some time is the filterCache. Worst- > case entries in that one takes up 1 bit/document, which means 7MB and > 22MB respectively for the two collections above. If your filterCache > is set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap. > > >> Right now we have a Cron job that restarts Solr every 12 hours, so >> it’s not pretty. We use faceting quite heavily > > Hopefully on docValued fields? > >> and mostly as a document storage server (we want full data sets >> instead of the n most relevant results). > > Hopefully with deep paging, as opposed to rows=173491631? > >> I don’t know if what we’re experiencing is usual given the index size >> and memory constraint of the VM, or something looks like it’s wildly >> misconfigured. > > I would have guessed that your heap was quite large enough for a > static index, but that is just ... guesswork. > > Would upgrading to Solr 6 make sense? > > It would not hep in itself, but if you also switched to using > streaming for your assumedly large exports, it would lower memory > requirements. > > - Toke Eskildsen, State and University Library, Denmark > >> -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Distribution Packages
Hi folks, I am new here and I wonder to know why there's no Solr 6.x packages for ubuntu/debian? Thank you. Adjamilton Junior
Re: Antw: Re: Solr 6.2.1 :: Collection Aliasing
On 12/12/2016 3:56 AM, Rainer Gnan wrote: > Do the query this way: > http://hostname.de:8983/solr/live/select?indent=on&q=*:* > > I have no idea whether the behavior you are seeing is correct or wrong, > but if you send the traffic directly to the alias it should work correctly. > > It might turn out that this is a bug, but I believe the above workaround > should take care of the issue in your environment. It's standard SolrCloud usage. You use the name of a collection in the URL path after /solr, where normally (non-cloud) you would use a core name. All aliases, even those with multiple collections, will work for queries, and single-collection aliases would have predictable results for updates. I do not have any documentation to point you at, although it's possible that this IS mentioned in the docs. Thanks, Shawn
Re: OOMs in Solr
I wasn’t aware of docValues and filterCache policies. We’ll try to fine-tune it and see if it helps. Thanks so much for the info! On 12/12/2016 12:13, Toke Eskildsen wrote: On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote: I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). This leaves very little memory for disk cache. I hope your underlying storage is local SSDs and not spinning drives over the network. We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 Smallish in bytes, largish in document count. We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). The usual suspect for OOMs after some time is the filterCache. Worst- case entries in that one takes up 1 bit/document, which means 7MB and 22MB respectively for the two collections above. If your filterCache is set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap. Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily Hopefully on docValued fields? and mostly as a document storage server (we want full data sets instead of the n most relevant results). Hopefully with deep paging, as opposed to rows=173491631? I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. I would have guessed that your heap was quite large enough for a static index, but that is just ... guesswork. Would upgrading to Solr 6 make sense? It would not hep in itself, but if you also switched to using streaming for your assumedly large exports, it would lower memory requirements. - Toke Eskildsen, State and University Library, Denmark -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Re: OOMs in Solr
Thanks for the reply. Here’s some more info... Disk space: 39 GB / 148 GB (used / available) Deployment model: Single instance JVM version: 1.7.0_04 Number of queries: avgRequestsPerSecond: 0.5478469104833896 GC algorithm: None specified, so I guess it defaults to the parallel GC. On 12/12/2016 10:22, Prateek Jain J wrote: Please provide some information like, disk space available deployment model of solr like solr-cloud or single instance jvm version no. of queries and type of queries etc. GC algorithm used etc. Regards, Prateek Jain -Original Message- From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk] Sent: 12 December 2016 10:14 AM To: solr-user@lucene.apache.org Subject: OOMs in Solr Hi Solr users, I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. What do you think? Any useful pointers for some tuning we could do to improve the service? Would upgrading to Solr 6 make sense? Thanks a lot in advance. -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Traverse over response docs in SearchComponent impl.
Hello - i need to traverse over the list of response docs in a SearchComponent, get all values for a specific field, and then conditionally add a new field. The request handler is configured as follows: dostuff I can see that Solr calls the component's process() method, but from within that method, rb.getResponseDocs(); is always null. No matter what i try, i do not seem to be able to get a hold of that list of response docs. I don't like to use a DocTransformer because i first need to check all fields in the response. Any idea on how to process the SolrDocumentList correctly? I am clearly missing something. Thanks, Markus
Re: OOMs in Solr
On Mon, 2016-12-12 at 10:13 +, Alfonso Muñoz-Pomer Fuentes wrote: > I’m writing because in our web application we’re using Solr 5.1.0 > and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 > are dedicated to Solr and nothing else is running there). This leaves very little memory for disk cache. I hope your underlying storage is local SSDs and not spinning drives over the network. > We have four cores, that are this size: > - 25.56 GB, Num Docs = 57,860,845 > - 12.09 GB, Num Docs = 173,491,631 Smallish in bytes, largish in document count. > We aren’t indexing on this machine, and we’re getting OOM relatively > quickly (after about 14 hours of regular use). The usual suspect for OOMs after some time is the filterCache. Worst- case entries in that one takes up 1 bit/document, which means 7MB and 22MB respectively for the two collections above. If your filterCache is set to 1000 for those, this means (7MB+22MB)*1000 ~= all your heap. > Right now we have a Cron job that restarts Solr every 12 hours, so > it’s not pretty. We use faceting quite heavily Hopefully on docValued fields? > and mostly as a document storage server (we want full data sets > instead of the n most relevant results). Hopefully with deep paging, as opposed to rows=173491631? > I don’t know if what we’re experiencing is usual given the index size > and memory constraint of the VM, or something looks like it’s wildly > misconfigured. I would have guessed that your heap was quite large enough for a static index, but that is just ... guesswork. Would upgrading to Solr 6 make sense? It would not hep in itself, but if you also switched to using streaming for your assumedly large exports, it would lower memory requirements. - Toke Eskildsen, State and University Library, Denmark >
Antw: Re: Solr 6.2.1 :: Collection Aliasing
Hi Shawn, your workaround works and is exactly what I was looking for. Did you find this solution via trial and error or can you point me to the appropriate section in the APRGuide? Thanks a lot! Rainer Rainer Gnan Bayerische Staatsbibliothek BibliotheksVerbund Bayern Verbundnahe Dienste 80539 München Tel.: +49(0)89/28638-2445 Fax: +49(0)89/28638-2665 E-Mail: rainer.g...@bsb-muenchen.de >>> Shawn Heisey 12.12.2016 11:43 >>> On 12/12/2016 3:32 AM, Rainer Gnan wrote: > Hi, > > actually I am trying to use Collection Aliasing in a SolrCloud-environment. > > My set up is as follows: > > 1. Collection_1 (alias "live") linked with config_1 > 2. Collection_2 (alias "test") linked with config_2 > 3. Collection_1 is different to Collection _2 > 4. config_1 is different to config_2 > > Case 1: Using > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test > > leads to the same results as > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml > which is correct. > > Case 2: Using > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live > > leads to the same result as > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml > which is correct, too. > > BUT > > Case 3: Using > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live > > leads NOT to the same result as > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live > > or > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml Do the query this way: http://hostname.de:8983/solr/live/select?indent=on&q=*:* I have no idea whether the behavior you are seeing is correct or wrong, but if you send the traffic directly to the alias it should work correctly. It might turn out that this is a bug, but I believe the above workaround should take care of the issue in your environment. Thanks, Shawn
Re: Solr 6.2.1 :: Collection Aliasing
On 12/12/2016 3:32 AM, Rainer Gnan wrote: > Hi, > > actually I am trying to use Collection Aliasing in a SolrCloud-environment. > > My set up is as follows: > > 1. Collection_1 (alias "live") linked with config_1 > 2. Collection_2 (alias "test") linked with config_2 > 3. Collection_1 is different to Collection _2 > 4. config_1 is different to config_2 > > Case 1: Using > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test > leads to the same results as > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml > which is correct. > > Case 2: Using > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live > leads to the same result as > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml > which is correct, too. > > BUT > > Case 3: Using > http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live > leads NOT to the same result as > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live > or > http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml Do the query this way: http://hostname.de:8983/solr/live/select?indent=on&q=*:* I have no idea whether the behavior you are seeing is correct or wrong, but if you send the traffic directly to the alias it should work correctly. It might turn out that this is a bug, but I believe the above workaround should take care of the issue in your environment. Thanks, Shawn
Re: Data Import Handler - maximum?
On 12/11/2016 8:00 PM, Brian Narsi wrote: > We are using Solr 5.1.0 and DIH to build index. > > We are using DIH with clean=true and commit=true and optimize=true. > Currently retrieving about 10.5 million records in about an hour. > > I will like to find from other member's experiences as to how long can DIH > run with no issues? What is the maximum number of records that anyone has > pulled using DIH? > > Are there any limitations on the maximum number of records that can/should > be pulled using DIH? What is the longest DIH can run? There are no hard limits other than the Lucene limit of a little over two billion docs per individual index. With sharding, Solr is able to easily overcome this limit on an entire index. I have one index where each shard was over 50 million docs. Each shard has fewer docs now, because I changed it so there are more shards and more machines. For some reason the rebuild time (using DIH) got really really long -- nearly 48 hours -- while building every shard in parallel. Still haven't figured out why the build time increased dramatically. One problem you might run into with DIH from a database has to do with merging. With default merge scheduler settings, eventually (typically when there are millions of rows being imported) you'll run into a pause in indexing that will take so long that the database connection will close, causing the import to fail after the pause finishes. I even opened a Lucene issue to get the default value for maxMergeCount changed. This issue went nowhere: https://issues.apache.org/jira/browse/LUCENE-5705 Here's a thread from this mailing list discussing the problem and the configuration solution: http://lucene.472066.n3.nabble.com/What-does-quot-too-many-merges-stalling-quot-in-indexwriter-log-mean-td4077380.html Thanks, Shawn
Solr 6.2.1 :: Collection Aliasing
Hi, actually I am trying to use Collection Aliasing in a SolrCloud-environment. My set up is as follows: 1. Collection_1 (alias "live") linked with config_1 2. Collection_2 (alias "test") linked with config_2 3. Collection_1 is different to Collection _2 4. config_1 is different to config_2 Case 1: Using http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=test leads to the same results as http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml which is correct. Case 2: Using http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live leads to the same result as http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml which is correct, too. BUT Case 3: Using http://hostname.de:8983/solr/Collection_2/select?indent=on&q=*:*&wt=xml&collection=live leads NOT to the same result as http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml&collection=live or http://hostname.de:8983/solr/Collection_1/select?indent=on&q=*:*&wt=xml It seems that using alias "live" in case 3 forces solr to search in Collection_1 (which is desired) but it uses config_2 of Collection_2 (which is not desired). MY AIM IS: Running one collection as productive and the other as test environment within a single SolrCloud. After setting up a new index (new schema, new solrconfig.xml) on the test collection I want to asign the test collection the alias "live" and the live collection the alias "test". How can I force solr to search in Collection_X with config_X? I hope that my description makes clear what my problem is. If not, don't hesitate to ask, I appreciate any help. Rainer Rainer Gnan Bayerische Staatsbibliothek BibliotheksVerbund Bayern Verbundnahe Dienste 80539 München Tel.: +49(0)89/28638-2445 Fax: +49(0)89/28638-2665 E-Mail: rainer.g...@bsb-muenchen.de
RE: OOMs in Solr
Please provide some information like, disk space available deployment model of solr like solr-cloud or single instance jvm version no. of queries and type of queries etc. GC algorithm used etc. Regards, Prateek Jain -Original Message- From: Alfonso Muñoz-Pomer Fuentes [mailto:amu...@ebi.ac.uk] Sent: 12 December 2016 10:14 AM To: solr-user@lucene.apache.org Subject: OOMs in Solr Hi Solr users, I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. What do you think? Any useful pointers for some tuning we could do to improve the service? Would upgrading to Solr 6 make sense? Thanks a lot in advance. -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
OOMs in Solr
Hi Solr users, I’m writing because in our web application we’re using Solr 5.1.0 and currently we’re hosting it on a VM with 32 GB of RAM (of which 30 are dedicated to Solr and nothing else is running there). We have four cores, that are this size: - 25.56 GB, Num Docs = 57,860,845 - 12.09 GB, Num Docs = 173,491,631 (The other two cores are about 10 MB, 20k docs) We aren’t indexing on this machine, and we’re getting OOM relatively quickly (after about 14 hours of regular use). Right now we have a Cron job that restarts Solr every 12 hours, so it’s not pretty. We use faceting quite heavily and mostly as a document storage server (we want full data sets instead of the n most relevant results). I don’t know if what we’re experiencing is usual given the index size and memory constraint of the VM, or something looks like it’s wildly misconfigured. What do you think? Any useful pointers for some tuning we could do to improve the service? Would upgrading to Solr 6 make sense? Thanks a lot in advance. -- Alfonso Muñoz-Pomer Fuentes Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer
Re: empty result set for a sort query
I am not sure that it's related, but with local tests we got to a scenario where we Add doc that somehow has * empty key* and then, when querying with sort over creationTime with rows=1, we get empty result set. When specifying the recent doc shard with shards=shard2 we do have results. I don't think we have empty keys in our production schema but maybe it can give a clue. Thanks Gilad -- View this message in context: http://lucene.472066.n3.nabble.com/empty-result-set-for-a-sort-query-tp4309256p4309315.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: empty result set for a sort query
Hi Thanks for the reply. We are using select?q=*:*&sort=creationTimestamp+desc&rows=1 So as you said we should have got results. Another piece of information is that we commit within 300ms when inserting the "sanity" doc. And again, we delete by query. We don't have any custom plugin/query processor. -- View this message in context: http://lucene.472066.n3.nabble.com/empty-result-set-for-a-sort-query-tp4309256p4309304.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Handler - maximum?
Am 12.12.2016 um 04:00 schrieb Brian Narsi: > We are using Solr 5.1.0 and DIH to build index. > > We are using DIH with clean=true and commit=true and optimize=true. > Currently retrieving about 10.5 million records in about an hour. > > I will like to find from other member's experiences as to how long can DIH > run with no issues? What is the maximum number of records that anyone has > pulled using DIH? Afaik, DIH will run until maximum number of documents per index. Our longest run took about 3.5 days for single DIH and over 100 mio. docs. The runtime depends pretty much on the complexity of the analysis during loading. Currently we are using concurrent DIH with 12 processes which takes 15 hours for the same amount. Optimizing afterwards takes 9.5 hours. SolrJ with 12 threads is doing the same indexing within 7.5 hours plus optimizing. For huge amounts of data you should consider using SolrJ. > > Are there any limitations on the maximum number of records that can/should > be pulled using DIH? What is the longest DIH can run? > > Thanks a bunch! >
RE: Problem with Cross Data Center Replication
Hi Erick, thanks for the hint. Indeed, i just forgot to paste the section into the email. It was configured just the same way as you wrote. Do you have any idea what else could be the cause for the error? Best regard, Gero -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, November 23, 2016 5:00 PM To: solr-user Subject: Re: Problem with Cross Data Center Replication Your _source_ (i.e. cdcr_testa) doesn't have the CDCR update log configured. This section isn't in solrconfig for cdcr_testa: ${solr.ulog.dir:} The update log is the transfer mechanism between the source and target clusters, so it needs to be configured in both. Best, Erick. P.S. kudos for including enough info to diagnose (assuming I'm right)! On Wed, Nov 23, 2016 at 4:40 AM, WILLMES Gero (SAFRAN IDENTITY AND SECURITY) wrote: > Hi Solr users, > > i try to configure Cross Data Center Replication according to > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62687 > 462 > > I set up two independent solr clouds. I created the collection "cdcr_testa" > on the source cloud and the collection "backup_collection" on the target > cloud. > > I adapted the configurations, according to the documentation. > > solrconfig.xml of the collection "cdcr_testa" in the source cluster: > > > > 127.0.0.2:2181 > cdcr_testa > backup_collection > > > > 8 > 1000 > 128 > > > > 1000 > > > > > solrconfig.xml of the collection "backup_collection" in the target cluster: > > > > disabled > > > > > > cdcr-processor-chain > > > > > > > > > > > > ${solr.ulog.dir:} > > > > > > When I now reload the collection "cdcr_testa", I allways get the > following Solr Exception > > 2016-11-23 12:05:35.604 ERROR (qtp1134712904-8045) [c:cdcr_testa s:shard1 > r:core_node1 x:cdcr_testa_shard1_replica1] o.a.s.s.HttpSolrCall > null:org.apache.solr.common.SolrException: Error handling 'reload' action > at > org.apache.solr.handler.admin.CoreAdminOperation$3.call(CoreAdminOperation.java:150) > at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:367) > at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:158) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) > at > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:663) > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:445) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:518) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654) > at > org.eclipse.jetty.util.thread.Queue