Re: Using Solr Spatial in conjunction with HBASE/Hadoop
Have you looked at Oracle NoSQL Database http://www.oracle.com/us/products/database/nosql/overview/index.html, a scalable key-value store? Can Solr be integrated with it? Thanks and warm regards. ashok joshi oracle -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034848.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Language Identification in index time
It sounds like you want an update request processor: http://wiki.apache.org/solr/UpdateRequestProcessor But, it also sounds like you should probably be normalizing the encoding before sending the data to Solr. -- Jack Krupansky -Original Message- From: Yewint Ko Sent: Sunday, January 20, 2013 10:36 AM To: solr-user@lucene.apache.org Subject: Language Identification in index time Hi all I am very new to solr and nutch. Currently i have a requirement to develop a small search engine for local movie websites. Because non standard encoding system currently using on many of our local websites, it become necessary for us to develop encoding identifier and converter in web crawling, indexing and query processing. The idea is we will identify the encoding used on the website, convert (if necessary) and store the index in unicode standard. We have developed our own identifier and converter (solr SearchComponent) that can be used in query time to identify the encoding of the user query and convert it to match the index. The problem I am having is that I dont know how to intercept the request in indexing time for identifying and converting purpose. Is there something like filter chain that can access the text before passing it to tokenizer, so that we can access the text and detect which encoding it is. Thanks yewint
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
If this was in SolrCloud mode, there was a bug in 4.0 when submitting batches of documents at once. Can't find it right now, but thought I'd mention it just in case. Submitting the docs one-at-a-time doesn't have the same problem. May not be applicable, and entirely orthogonal to the discussion about swallowing errors Erick On Tue, Jan 15, 2013 at 4:10 PM, Mark Bennett mbenn...@ideaeng.com wrote: First off, just reporting this: I wound up with approx 58% few documents having submitted via ConcurrentUpdateSolrServer. I went back and changed the code to use HttpSolrServer and had 100% This was a long running test, approx 12 hours, with gigabytes of data, so conveniently shared / reproducible, but I at least wanted to email around, in part to get it on the record, and second to see if anybody else has seen this? I didn't see anything in JIRA. I realize that Concurrent update is asynchronous and I'm giving up the ability to monitor things, but since it works using the old server, there's nothing glaringly wrong at least. Here's a few more details: * Approx 2 M docs, submitted 1,000 at a time. * Solr 4.0.0 on Windows Server 2008 * Solr server JVM configured with 4 Gigs of RAM * Submitting client JVM (SolrJ) configured with 10 Gigs of RAM * Did didn't see any OOM (Out Of Memory) errors on the asynchronous / ConcurrentUpdateSolrServer run. However, I didn't capture the entire log. Usually with OOM it's just before the run crashes, and the end of the log on the screen looked fine. * I also didn't think there was OOM issues on the Solr server side, for the same reason * When submitting the same data synchronously (via HttpSolrServer) it didn't have any problems Questions: The async client certainly finished faster, and since the underlying Solr server presumably didn't do the real work any faster, presumably a backlog built up somewhere. Agreed? I'm guessing this backlog had something to do with the failure. Or are there other areas to think about? Which process would get backlogged, the SolrJ client or the Solr server? I'd guess the server? And if async submits are accumulated in the Solr server, is there some mechanism to queue them onto disk, or does it try to hold them all in RAM? And *if* the backlog caused an OOM condition, wouldn't that JVM have mostly crashed (if not completely)? Any guesses on the mostly likely failure point, and where to look? Thanks, Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: Solr load balancer
Hmmm, the first thing I'd look at is why you are having long GC pauses. Here's a great place to start: http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/ and: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html I've wondered about a similar approach, but by firing off the same query to multiple nodes in your cluster, you'll be effectively doubling (at least) the load on your system. Leading to more memory issues perhaps in a non-virtuous cycle. FWIW, Erick On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy p...@brightsolid.com wrote: Hi, I would like to experiment with some custom load balancers to help with query latency in the face of long gc pauses and the odd time-consuming query that we need to be able to support. At the moment setting the socket timeout via the HttpShardHandlerFactory does help, but of course it can only be set to a length of time as long as the most time consuming query we are likely to receive. For example perhaps a load balancer that sends multiple queries concurrently to all/some replicas and only keeps the first response might be effective. Or maybe a load balancer which takes account of the frequency of timeouts would be able to recognize zombies more effectively. To use alternative load balancer implementations cleanly and without having to hack solr directly, I would need to be able to make the existing LBHttpSolrServer and HttpShardHandlerFactory more amenable to extension, I can then override the default load balancer using solr's plugin mechanism. So my question is, if I made a patch to make the load balancer more pluggable, is this something that would be acceptable and if so what do I do next? Phil __ brightsolid is used in this email to collectively mean brightsolid online innovation limited and its subsidiary companies brightsolid online publishing limited and brightsolid online technology limited. findmypast.co.uk is a brand of brightsolid online publishing limited. brightsolid online innovation limited, Gateway House, Luna Place, Dundee Technology Park, Dundee DD2 1TP. Registered in Scotland No. SC274983. brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington Street, London EC2A 3DQ. Registered in England No. 04369607. brightsolid online technology limited, Gateway House, Luna Place, Dundee Technology Park, Dundee DD2 1TP. Registered in Scotland No. SC161678. Email Disclaimer This message is confidential and may contain privileged information. You should not disclose its contents to any other person. If you are not the intended recipient, please notify the sender named above immediately. It is expressly declared that this e-mail does not constitute nor form part of a contract or unilateral obligation. Opinions, conclusions and other information in this message that do not relate to the official business of brightsolid shall be understood as neither given nor endorsed by it. __ This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs __
Re: Solr cache considerations
About your question about document cache: Typically the document cache has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very often. And remember that this cache is only hit when assembling the response for a few documents (your page size). Bottom line: I wouldn't worry about this cache much. It's quite useful for processing a particular query faster, but not really intended for cross-query use. Really, I think you're getting the cart before the horse here. Run it up the flagpole and try it. Rely on the OS to do its job (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). Find a bottleneck _then_ tune. Premature optimization and all that Several tens of millions of docs isn't that large unless the text fields are enormous. Best Erick On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Ok. Thank you everyone for your helpful answers. I understand that fieldValueCache is not used for resolving queries. Is there any cache that can help this basic scenario (a lot of different queries, on a small set of fields)? Does Lucene's FieldCache help (implicitly)? How can I use RAM to reduce I/O in this type of queries? On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: No, the fieldValueCache is not used for resolving queries. Only for multi-token faceting and apparently for the stats component too. The document cache maintains in memory the stored content of the fields you are retrieving or highlighting on. It'll hit if the same document matches the query multiple times and the same fields are requested, but as Eirck said, it is important for cases when multiple components in the same request need to access the same data. I think soft committing every 10 minutes is totally fine, but you should hard commit more often if you are going to be using transaction log. openSearcher=false will essentially tell Solr not to open a new searcher after the (hard) commit, so you won't see the new indexed data and caches wont be flushed. openSearcher=false makes sense when you are using hard-commits together with soft-commits, as the soft-commit is dealing with opening/closing searchers, you don't need hard commits to do it. Tomás On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow
RE: Solr 4.0 - timeAllowed in distributed search
(This is based on my knowledge of 3.6 - not sure if this has changed in 4.0) You are using rows=3, which requires retrieving 3 documents from disk. In a non-distributed search, the QTime will not include the time it takes to retrieve these documents, but in a distributed search, it will. For a *:* query, the document retrieval will almost always be the slowest part of the query. I'd suggest measuring how long it takes for the response to be returned, or use rows=0. The timeAllowed feature is very misleading. It only applies to a small portion of the query (which in my experience is usually not the part of the query that is actually slow). Do not depend on timeAllowed doing anything useful :) -Michael -Original Message- From: Lyuba Romanchuk [mailto:lyuba.romanc...@gmail.com] Sent: Sunday, January 20, 2013 6:36 AM To: solr-user@lucene.apache.org Subject: Solr 4.0 - timeAllowed in distributed search Hi, I try to use timeAllowed in query both in distributed search with one shard and directly to the same shard. I send the same query with timeAllowed=500 : - directly to the shard then QTime ~= 600 ms - through distributes search to the same shard QTime ~= 7 sec. I have two questions: - It seems that timeAllowed parameter doesn't work for distributes search, does it? - What may be the reason that causes the query to the shard through distributes search takes much more time than to the shard directly (the same distribution remains without timeAllowed parameter in the query)? Test results: Ask one shard through distributed search: http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebugQuery=true response lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime7307/int lst name=params str name=q*:*/str str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str str name=partialResultstrue/str str name=debugQuerytrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst lst name=shards.info lst name=127.0.0.1:8983/solr/shard_2013-01-07 long name=numFound29574223/long float name=maxScore1.0/float long name=time646/long/lst/lst result name=response numFound=29574223 start=0 maxScore=1.0 ... 30,000 docs ... lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str lst name=timingdouble name=time6141.0/double lst name=preparedouble name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.DebugComponentdouble name=time0.0/double/lst/lst lst name=processdouble name=time6141.0/double lst name=org.apache.solr.handler.component.QueryComponentdouble name=time6022.0/double/lst lst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.DebugComponentdouble name=time119.0/double/lst/lst/lst Ask the same shard directly: http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3timeAllowed=500partialResults=trueshards.info=truedebugQuery=true lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime617/int lst name=params str name=q*:*/str str name=partialResultstrue/str str name=debugQuerytrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst result name=response numFound=28687243 start=0 ... 30,000 docs lst name=debugstr name=rawquerystring*:*/strstr name=querystring*:*/strstr name=parsedqueryMatchAllDocsQuery(*:*)/strstr name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str lst name=timingdouble name=time617.0/double lst name=preparedouble name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lst lst
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 10:07 PM, Shawn Heisey wrote: On my dev 4.1 server with Java 7u11, I am using the G1 collector with a max pause target of 1500ms. I was thinking that this collector was producing long pauses too, but after reviewing the gc log with a closer eye, I see that there are lines that specifically say pause ... and all of THOSE lines are below half a second except one that took 1.4 seconds. Does that mean that it's actually meeting the target, or are the other lines that show quite long time values indicative of a problem? If only the lines that explicitly say pause are the ones I need to worry about, then it looks like G1 is the clear winner. Here's a paste showing a command and its output. I included remark in the grep because I saw a presentation saying that remark in G1 is stop-the-world: http://pastie.org/private/vygpvtjzicsl8uztg3drw None of the matching log lines get close to my 5 second pain point. If I check the entire unfiltered log for lines that exceed 3 seconds, I do find a few, but only one of them says pause and it's far enough below the 5 second level that it probably would not cause a problem: http://pastie.org/private/wcessvbrditextxmoapksq Here's the perl script used in the two outputs above: http://pastie.org/private/itu9hbgiwugdjtmy3yg8g The log was gathered during a full-import of six large shards, over 12 million docs each. The import took 7 hours. I had the patches for LUCENE-4599 (Compressed TermVectors) applied to Solr 4.1 at the time. What I'd like to know is whether a 'concurrent-mark-end' line indicates stop-the-world or not. I suspect that it is done while the application is working. If this is right, then I think I have found the right GC settings: -XX:+UseG1GC -XX:MaxGCPauseMillis=1500 -XX:GCPauseIntervalMillis=4000 My production servers have more total memory, more CPU cores, and much faster I/O than the dev server where I have been running these tests, but they both use the same 8GB java heap. One last question: Should I be worried about using the G1 collector on Oracle Java 6u38, which was released at the same time as 7u11? This *might* be a good opportunity to upgrade to java 7 in production, actually. I have two completely independent index chains, I could upgrade the secondary. If anyone has any suggestions for my GC parsing perl script, or a knows about a much more functional replacement, let me know. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/20/2013 11:33 AM, Shawn Heisey wrote: On 1/18/2013 10:07 PM, Shawn Heisey wrote: On my dev 4.1 server with Java 7u11, I am using the G1 collector with a max pause target of 1500ms. I was thinking that this collector was producing long pauses too, but after reviewing the gc log with a closer eye, I see that there are lines that specifically say pause ... and all of THOSE lines are below half a second except one that took 1.4 seconds. Does that mean that it's actually meeting the target, or are the other lines that show quite long time values indicative of a problem? If only the lines that explicitly say pause are the ones I need to worry about, then it looks like G1 is the clear winner. Here's a paste showing a command and its output. I included remark in the grep because I saw a presentation saying that remark in G1 is stop-the-world: http://pastie.org/private/vygpvtjzicsl8uztg3drw None of the matching log lines get close to my 5 second pain point. If I check the entire unfiltered log for lines that exceed 3 seconds, I do find a few, but only one of them says pause and it's far enough below the 5 second level that it probably would not cause a problem: http://pastie.org/private/wcessvbrditextxmoapksq Here's the perl script used in the two outputs above: http://pastie.org/private/itu9hbgiwugdjtmy3yg8g Here's the full gc log for anyone that feels compelled to fully investigate: http://dl.dropbox.com/u/97770508/gc.log Thanks, Shawn
Re: Have the SolrCloud collection REST endpoints move or changed for 4.1?
So the ticket I created wasn't related, there is a working patch for that now but my original issue remains, I get 404 when trying to post updates to a URL that worked fine in Solr 4.0. On Sat, Jan 19, 2013 at 5:56 PM, Brett Hoerner br...@bretthoerner.comwrote: I'm actually wondering if this other issue I've been having is a problem: https://issues.apache.org/jira/browse/SOLR-4321 The fact that some nodes don't get pieces of a collection could explain the 404. That said, even when a node has parts of a collection it reports 404 sometimes. What's odd is that I can use curl to post a JSON document to the same URL and it will return 200. When I log every request I make from my indexer process (using solr4j) it's about 50/50 between 404 and 200... On Sat, Jan 19, 2013 at 5:22 PM, Brett Hoerner br...@bretthoerner.comwrote: I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm trying out 4.1 RC1 right now but the update URL I used to use is returning HTTP 404. For example, I would post my document updates to, http://localhost:8983/solr/collection1 But that is 404ing now (collection1 exists according to the admin UI, all shards are green and happy, and data dirs exist on the nodes). I also tried the following, http://localhost:8983/solr/collection1/update And also received a 404 there. A specific example from the Java client: 22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend - Error while flushing to Solr. org.apache.solr.common.SolrException: Server at http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] But I can hit that URL with a GET, $ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status400/intint name=QTime2/int/lstlst name=errorstr name=msgmissing content stream/strint name=code400/int/lst /response Thoughts? Thanks.
Re: Long ParNew GC pauses - even when young generation is small
On 1/18/2013 10:07 PM, Shawn Heisey wrote: I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am giving this a try on my secondary server set. An encouraging note: The -XX:+UnlockExperimentalVMOptions option is no longer required to use the G1 collector, at least on version 6u38. Thanks, Shawn
Re: Solr cache considerations
Wow Erick, The MMap acrtivle is a very fundamental one. Totaly changed my view. It must be mentioned in SolrPerformanceFactors in SolrWiki... I'm sorry I did not know it before. Thank you a lot. I promise to share my results then my cart will start to fly :) On Sun, Jan 20, 2013 at 6:08 PM, Erick Erickson erickerick...@gmail.comwrote: About your question about document cache: Typically the document cache has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very often. And remember that this cache is only hit when assembling the response for a few documents (your page size). Bottom line: I wouldn't worry about this cache much. It's quite useful for processing a particular query faster, but not really intended for cross-query use. Really, I think you're getting the cart before the horse here. Run it up the flagpole and try it. Rely on the OS to do its job (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). Find a bottleneck _then_ tune. Premature optimization and all that Several tens of millions of docs isn't that large unless the text fields are enormous. Best Erick On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Ok. Thank you everyone for your helpful answers. I understand that fieldValueCache is not used for resolving queries. Is there any cache that can help this basic scenario (a lot of different queries, on a small set of fields)? Does Lucene's FieldCache help (implicitly)? How can I use RAM to reduce I/O in this type of queries? On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: No, the fieldValueCache is not used for resolving queries. Only for multi-token faceting and apparently for the stats component too. The document cache maintains in memory the stored content of the fields you are retrieving or highlighting on. It'll hit if the same document matches the query multiple times and the same fields are requested, but as Eirck said, it is important for cases when multiple components in the same request need to access the same data. I think soft committing every 10 minutes is totally fine, but you should hard commit more often if you are going to be using transaction log. openSearcher=false will essentially tell Solr not to open a new searcher after the (hard) commit, so you won't see the new indexed data and caches wont be flushed. openSearcher=false makes sense when you are using hard-commits together with soft-commits, as the soft-commit is dealing with opening/closing searchers, you don't need hard commits to do it. Tomás On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are
Re: Solr cache considerations
I routinely see hit rates over 75% on the document cache. Perhaps yours is too small. Mine is set at 10240 entries. wunder On Jan 20, 2013, at 8:08 AM, Erick Erickson wrote: About your question about document cache: Typically the document cache has a pretty low hit-ratio. I've rarely, if ever, seen it get hit very often. And remember that this cache is only hit when assembling the response for a few documents (your page size). Bottom line: I wouldn't worry about this cache much. It's quite useful for processing a particular query faster, but not really intended for cross-query use. Really, I think you're getting the cart before the horse here. Run it up the flagpole and try it. Rely on the OS to do its job (http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html). Find a bottleneck _then_ tune. Premature optimization and all that Several tens of millions of docs isn't that large unless the text fields are enormous. Best Erick On Sat, Jan 19, 2013 at 2:32 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Ok. Thank you everyone for your helpful answers. I understand that fieldValueCache is not used for resolving queries. Is there any cache that can help this basic scenario (a lot of different queries, on a small set of fields)? Does Lucene's FieldCache help (implicitly)? How can I use RAM to reduce I/O in this type of queries? On Fri, Jan 18, 2013 at 4:09 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: No, the fieldValueCache is not used for resolving queries. Only for multi-token faceting and apparently for the stats component too. The document cache maintains in memory the stored content of the fields you are retrieving or highlighting on. It'll hit if the same document matches the query multiple times and the same fields are requested, but as Eirck said, it is important for cases when multiple components in the same request need to access the same data. I think soft committing every 10 minutes is totally fine, but you should hard commit more often if you are going to be using transaction log. openSearcher=false will essentially tell Solr not to open a new searcher after the (hard) commit, so you won't see the new indexed data and caches wont be flushed. openSearcher=false makes sense when you are using hard-commits together with soft-commits, as the soft-commit is dealing with opening/closing searchers, you don't need hard commits to do it. Tomás On Fri, Jan 18, 2013 at 2:20 AM, Isaac Hebsh isaac.he...@gmail.com wrote: Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long.
Re: Solr 4.0 - timeAllowed in distributed search
If you are going to request 30,000 rows, you can give up on getting good performance. It is not going to happen. Even without all the disk accesses, think about how much is sent over the network, then parsed by the client. The client cannot even start working with the data until it is all received and parsed. wunder On Jan 20, 2013, at 8:49 AM, Michael Ryan wrote: (This is based on my knowledge of 3.6 - not sure if this has changed in 4.0) You are using rows=3, which requires retrieving 3 documents from disk. In a non-distributed search, the QTime will not include the time it takes to retrieve these documents, but in a distributed search, it will. For a *:* query, the document retrieval will almost always be the slowest part of the query. I'd suggest measuring how long it takes for the response to be returned, or use rows=0. The timeAllowed feature is very misleading. It only applies to a small portion of the query (which in my experience is usually not the part of the query that is actually slow). Do not depend on timeAllowed doing anything useful :) -Michael -Original Message- From: Lyuba Romanchuk [mailto:lyuba.romanc...@gmail.com] Sent: Sunday, January 20, 2013 6:36 AM To: solr-user@lucene.apache.org Subject: Solr 4.0 - timeAllowed in distributed search Hi, I try to use timeAllowed in query both in distributed search with one shard and directly to the same shard. I send the same query with timeAllowed=500 : - directly to the shard then QTime ~= 600 ms - through distributes search to the same shard QTime ~= 7 sec. I have two questions: - It seems that timeAllowed parameter doesn't work for distributes search, does it? - What may be the reason that causes the query to the shard through distributes search takes much more time than to the shard directly (the same distribution remains without timeAllowed parameter in the query)? Test results: Ask one shard through distributed search: http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3shards=127.0.0.1%3A8983%2Fsolr%2Fshard_2013-01-07timeAllowed=500partialResults=trueshards.info=truedebugQuery=true response lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime7307/int lst name=params str name=q*:*/str str name=shards127.0.0.1:8983/solr/shard_2013-01-07/str str name=partialResultstrue/str str name=debugQuerytrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst lst name=shards.info lst name=127.0.0.1:8983/solr/shard_2013-01-07 long name=numFound29574223/long float name=maxScore1.0/float long name=time646/long/lst/lst result name=response numFound=29574223 start=0 maxScore=1.0 ... 30,000 docs ... lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name=parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str lst name=timingdouble name=time6141.0/double lst name=preparedouble name=time0.0/double lst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.DebugComponentdouble name=time0.0/double/lst/lst lst name=processdouble name=time6141.0/double lst name=org.apache.solr.handler.component.QueryComponentdouble name=time6022.0/double/lst lst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.MoreLikeThisComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.StatsComponentdouble name=time0.0/double/lst lst name=org.apache.solr.handler.component.DebugComponentdouble name=time119.0/double/lst/lst/lst Ask the same shard directly: http://localhost:8983/solr/shard_2013-01-07/select?q=*:*rows=3timeAllowed=500partialResults=trueshards.info=truedebugQuery=true lst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime617/int lst name=params str name=q*:*/str str name=partialResultstrue/str str name=debugQuerytrue/str str name=shards.infotrue/str str name=rows3/str str name=timeAllowed500/str/lst/lst result name=response numFound=28687243 start=0 ... 30,000 docs lst name=debugstr name=rawquerystring*:*/strstr name=querystring*:*/strstr name=parsedqueryMatchAllDocsQuery(*:*)/strstr name=parsedquery_toString*:*/str str name=QParserLuceneQParser/str lst
Re: Have the SolrCloud collection REST endpoints move or changed for 4.1?
Sorry, I take it back. It looks like fixing https://issues.apache.org/jira/browse/SOLR-4321 fixed my issue after all. On Sun, Jan 20, 2013 at 2:21 PM, Brett Hoerner br...@bretthoerner.comwrote: So the ticket I created wasn't related, there is a working patch for that now but my original issue remains, I get 404 when trying to post updates to a URL that worked fine in Solr 4.0. On Sat, Jan 19, 2013 at 5:56 PM, Brett Hoerner br...@bretthoerner.comwrote: I'm actually wondering if this other issue I've been having is a problem: https://issues.apache.org/jira/browse/SOLR-4321 The fact that some nodes don't get pieces of a collection could explain the 404. That said, even when a node has parts of a collection it reports 404 sometimes. What's odd is that I can use curl to post a JSON document to the same URL and it will return 200. When I log every request I make from my indexer process (using solr4j) it's about 50/50 between 404 and 200... On Sat, Jan 19, 2013 at 5:22 PM, Brett Hoerner br...@bretthoerner.comwrote: I was using Solr 4.0 but ran into a few problems using SolrCloud. I'm trying out 4.1 RC1 right now but the update URL I used to use is returning HTTP 404. For example, I would post my document updates to, http://localhost:8983/solr/collection1 But that is 404ing now (collection1 exists according to the admin UI, all shards are green and happy, and data dirs exist on the nodes). I also tried the following, http://localhost:8983/solr/collection1/update And also received a 404 there. A specific example from the Java client: 22:38:12.474 [pool-7-thread-14] ERROR com.massrel.faassolr.SolrBackend - Error while flushing to Solr. org.apache.solr.common.SolrException: Server at http://backfill-2d.i.massrel.com:8983/solr/15724/update returned non ok status:404, message:Not Found at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:438) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) ~[solr-solrj-4.0.0.jar:4.0.0 1394950 - rmuir - 2012-10-06 03:05:44] But I can hit that URL with a GET, $ curl http://backfill-1d.i.massrel.com:8983/solr/15724/update ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status400/intint name=QTime2/int/lstlst name=errorstr name=msgmissing content stream/strint name=code400/int/lst /response Thoughts? Thanks.
RE: Long ParNew GC pauses - even when young generation is small
Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. Markus -Original message- From:Shawn Heisey s...@elyograg.org Sent: Sun 20-Jan-2013 21:48 To: solr-user@lucene.apache.org Subject: Re: Long ParNew GC pauses - even when young generation is small On 1/18/2013 10:07 PM, Shawn Heisey wrote: I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am giving this a try on my secondary server set. An encouraging note: The -XX:+UnlockExperimentalVMOptions option is no longer required to use the G1 collector, at least on version 6u38. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
On 1/20/2013 2:13 PM, Markus Jelsma wrote: Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. I don't have any really good test tools developed for testing throughput and latency. I have some less-than-ideal tools for other purposes that I might be able to adapt. Throughput is not a major issue for us - query volume is quite low. I would be mildly surprised by 5 queries per second. I don't have much of an idea of queries per second over the short term - the numbers available in 3.5 are limited. As for latency, early indications from an old SOLR-1972 patch suggest that the QTime values might be a little higher. The primary server stats (using CMS/ParNew) are over 1 million queries, and the secondary server stats (using G1) so far are only about 5000 queries. The QTime values are steadily dropping as the number of queries goes up. Here's a status page that gathers all the stats. Chain A is using CMS/ParNew and is no longer receiving queries. All the queries are now going to chain B, which is using G1. http://dl.dropbox.com/u/97770508/g1-vs-cms-stats.png The server CPU utilization graph doesn't have enough information yet to make any determination, but what little data is visible suggests that CPU may be higher. The secondary servers also have slightly slower CPUs than the primary servers. I was forced to make concessions on later purchases to keep the cost down. Thanks, Shawn
Re: build CMIS compatible Solr
I think this might be the one you are talking about: https://github.com/sourcesense/solr-cmis But I think Alfresco has already had search functionality, similar to Solr. Then why did you want to use it to index docs out of Alfresco? On Fri, Jan 18, 2013 at 8:00 PM, Upayavira u...@odoko.co.uk wrote: A colleague of mine when I was working for Sourcesense made a CMIS plugin for Solr. It was one way, and we used it to index stuff out of Alfresco into Solr. I can't search for it now, let me know if you can't find it. Upayavira On Fri, Jan 18, 2013, at 05:35 AM, Nicholas Li wrote: I want to make something like Alfresco, but not having that many features. And I'd like to utilise the searching ability of Solr. On Fri, Jan 18, 2013 at 4:11 PM, Gora Mohanty g...@mimirtech.com wrote: On 18 January 2013 10:36, Nicholas Li nicholas...@yarris.com wrote: hi I am new to solr and I would like to use Solr as my document server, plus search engine. But solr is not CMIS compatible( While it shoud not be, as it is not build as a pure document management server). In that sense, I would build another layer beyond Solr so that the exposed interface would be CMIS compatible. [...] May I ask why? Solr is designed to be a search engine, which is a very different beast from a document repository. In the open-source world, Alfresco ( http://www.alfresco.com/ ) already exists, can index into Solr, and supports CMIS-based access. Regards, Gora
Re: Long ParNew GC pauses - even when young generation is small
Unfortunately, G1 on Java 6 was a bust. Several times GC pauses made my load balancer think the server was down, just like with CMS/ParNew. Either there's something about my production query patterns that doesn't get along with any of the garbage collection methods, or I need to upgrade to Java 7. I have tried lowering my max heap before. That results in OOM problems when I do full-import with DIH. On 1/20/2013 2:13 PM, Markus Jelsma wrote: Hi Shawn, Although our heap spaces are much less than yours (256M for 2x 2.5GB cores per node) we saw decreased throughput and higher latency with G1 on Java 6. You can also expect higher CPU consumption. You can check it very well with VisualVM attached. Looking forward to your results. Markus -Original message- From:Shawn Heisey s...@elyograg.org Sent: Sun 20-Jan-2013 21:48 To: solr-user@lucene.apache.org Subject: Re: Long ParNew GC pauses - even when young generation is small On 1/18/2013 10:07 PM, Shawn Heisey wrote: I may try the G1 collector with Java 6 in production, since I am on the newest Oracle version. I am giving this a try on my secondary server set. An encouraging note: The -XX:+UnlockExperimentalVMOptions option is no longer required to use the G1 collector, at least on version 6u38. Thanks, Shawn
Re: Long ParNew GC pauses - even when young generation is small
I don't see any info on your website about pricing, so I can't make any decisions about whether it would be right for me. Can you give me long-term pricing information? As is the case with much of enterprise software (including getting a supported version of Oracle HotSpot), this is a sales-person conversation that we'd be happy to have. You can ask for someone to contact you about this right on the site, or if you want, you can contact me at gil at azulsystems dot com and I'll make sure we get you the information you need. Chances are that once I inform management of the cost, it'd never fly. You may be surprised. You seem to assume that Zing is expensive for some reason, while it's probably on par or cheaper than other supported JVMs for this sort of thing. It's certainly flown with management for others running into the exact same problems with both Solr and Lucene. Saved them both time and money in the process of forever removing GC headaches. -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034932.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Long ParNew GC pauses - even when young generation is small
If you believe the logs, using -XX:+PrintGCApplicationStoppedTime is probably the easiest way to avoid having to try to parse pause times from various formats. But remember, GC logs can [often unintentionally] lie (I've seen them under-report by multi-second gaps). If you want to actually measure your JVM pauses (GC or others), you can use something like jHiccup (http://www.azulsystems.com/jHiccup). It is a free (as in beer) and public domain (CC0) tool that will show you any blip/glitch/hiccup that you jvm experiences while running your application, and report it in both time based and detailed percentile form. What jHccup shows you is a best-case response time for your applicartion as it runs (the response time the application would have shown if completed everything as zero work). It's near-trivial to add jHiccup to your environment (as either a java agent or wrapper script). It would be interesting to see the percentile histograms (jHiccup's .hgrm text output) for your environment. -- View this message in context: http://lucene.472066.n3.nabble.com/Long-ParNew-GC-pauses-even-when-young-generation-is-small-tp4031110p4034934.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenized keywords
Can you please elaborate a more on what you are trying to achieve. Tokenizers work on indexed field and doesn't effect how the values will be displayed. The response value comes from stored field. If you want to see how your query is being tokenized, you can do it using analysis interface or enable debugQuery to see how your query is being formed. On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha romita.s...@sg.panasonic.comwrote Hi, I use some tokenizers to tokenize the query. I want to see the tokenized query words displayed in the response.Could you kindly help me do that. Thanks and regards, Romita
Re: Tokenized keywords
What I am trying to achieve is as follows. I query Search for all the Laptops and my tokenized key words are search laptop (I apply stopword filter to filter out words like for,all,the and i also user lowercase filter). I want to display these tokenized keywords using debugQuery. Thanks and regards, Romita From: Dikchant Sahi contacts...@gmail.com To: solr-user@lucene.apache.org, Date: 01/21/2013 02:26 PM Subject:Re: Tokenized keywords Can you please elaborate a more on what you are trying to achieve. Tokenizers work on indexed field and doesn't effect how the values will be displayed. The response value comes from stored field. If you want to see how your query is being tokenized, you can do it using analysis interface or enable debugQuery to see how your query is being formed. On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha romita.s...@sg.panasonic.comwrote Hi, I use some tokenizers to tokenize the query. I want to see the tokenized query words displayed in the response.Could you kindly help me do that. Thanks and regards, Romita
Data import handler start bulging the memory after completing 1 million
http://lucene.472066.n3.nabble.com/file/n4034949/ScreenShot034.jpg You may refer this snapshot to get an understanding of the resource consumption. I am trying to index a total number of 13 million documents from MySQL to SOLR. First 1 million document's got completed very smoothly in the first 2 minutes, later it started bulging the RAM, and it never gets released in between. I have tried all the known tricks and tactics, still failing to rectify this issue. I am using SOLR 4.0, using DIH to import from MySQL 5.5 DB. Any help will be much appreciated, and I am trying to find any loop hole in my schema and config files. -- View this message in context: http://lucene.472066.n3.nabble.com/Data-import-handler-start-bulging-the-memory-after-completing-1-million-tp4034949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenized keywords
Romita, That's what exactly is shown debugQuery output. If you cant find it there, paste output here, let's try to find together. Also pay attention to explainOther debug parameter and analisys page in admin ui. 21.01.2013 10:50 пользователь Romita Saha romita.s...@sg.panasonic.com написал: What I am trying to achieve is as follows. I query Search for all the Laptops and my tokenized key words are search laptop (I apply stopword filter to filter out words like for,all,the and i also user lowercase filter). I want to display these tokenized keywords using debugQuery. Thanks and regards, Romita From: Dikchant Sahi contacts...@gmail.com To: solr-user@lucene.apache.org, Date: 01/21/2013 02:26 PM Subject:Re: Tokenized keywords Can you please elaborate a more on what you are trying to achieve. Tokenizers work on indexed field and doesn't effect how the values will be displayed. The response value comes from stored field. If you want to see how your query is being tokenized, you can do it using analysis interface or enable debugQuery to see how your query is being formed. On Mon, Jan 21, 2013 at 11:06 AM, Romita Saha romita.s...@sg.panasonic.comwrote Hi, I use some tokenizers to tokenize the query. I want to see the tokenized query words displayed in the response.Could you kindly help me do that. Thanks and regards, Romita