RE: out of memory during indexing do to large incoming queue
Solrconfig.xml - http://apaste.info/dsbv Schema.xml - http://apaste.info/67PI This solrconfig.xml file has optimization enabled. I had another file which I can't locate at the moment, in which I defined a custom merge scheduler in order to disable optimization. When I say 1000 segments, I mean that's the number I saw in Solr UI. I assume there were much more files than that. Thanks, Yoni -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Sunday, June 02, 2013 22:53 To: solr-user@lucene.apache.org Subject: Re: out of memory during indexing do to large incoming queue On 6/2/2013 12:25 PM, Yoni Amir wrote: Hi Shawn and Shreejay, thanks for the response. Here is some more information: 1) The machine is a virtual machine on ESX server. It has 4 CPUs and 8GB of RAM. I don't remember what CPU but something modern enough. It is running Java 7 without any special parameters, and 4GB allocated for Java (-Xmx) 2) After successful indexing, I have 2.5 Million documents, 117GB index size. This is the size after it was optimized. 3) I plan to upgrade to 4.3 just didn't have time. 4.0 beta is what was available at the time that we had a release deadline. 4) The setup with master-slave replication, not Solr Cloud. The server that I am discussing is the indexing server, and in these tests there were actually no slaves involved, and virtually zero searches performed. 5) Attached is my configuration. I tried to disable the warm-up and opening of searchers, it didn't change anything. The commits are done by Solr, using autocommit. The client sends the updates without a commit command. 6) I want to disable optimization, but when I disabled it, the OOME occurred even faster. The number of segments reached around a thousand within an hour or so. I don't know if it's normal or not, but at that point if I restarted Solr it immediately took about 1GB of heap space just on start-up, instead of the usual 50MB or so. If I commit less frequently, don't I increase the risk of losing data, e.g., if the power goes down, etc.? If I disable optimization, is it necessary to avoid such a large number of segments? Is it possible? Last part first: Losing data is much less of a risk with Solr 4.x, if you have enabled the updateLog. We'll need some more info. See the end of the message for specifics. Right off the bat, I can tell you that with an index that's 117GB, you're going to need a LOT of RAM. Each of my 4.2.1 servers has 42GB of index and about 37 million documents between all the index shards. The web application never uses facets, which tend to use a lot of memory. My index is a lot smaller than yours, and I need a 6GB heap, seeing OOM errors if it's only 4GB. You probably need at least an 8GB heap, and possibly larger. Beyond the amount of memory that Solr itself uses, for good performance you will also need a large amount of memory for OS disk caching. Unless the server is using SSD, you need to allocate at least 64GB of real memory to the virtual machine. If you've got your index on SSD, 32GB might be enough. I've got 64GB total on my servers. http://wiki.apache.org/solr/SolrPerformanceProblems When you say that there are over 1000 segments, are you seeing 1000 files, or are there literally 1000 segments, giving you between 12000 and 15000 files? Even if your mergeFactor were higher than the default 10, that just shouldn't happen. Can you share your solrconfig.xml and schema.xml? Use a paste website like http://apaste.info and share the URLs. Thanks, Shawn Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.
Solr + Groovy
Hi, I have some query building and result processing code, which is currently running as normal Solr client outside of Solr. I think it would make a lot of sense to move parts of this code into a custom SearchHandler or SearchComponent. Because I'm not a big fan of the Java language, I would like to use Groovy. Searching the web I got the impression that Solr + alternative JVM languages is not a very common topic. So before starting my project, I would like to know: Is there a well known good reason not to use Groovy (or Clojure, Scala, ...) for implementing custom Solr code? kind regards, Achim
how are you handling killer queries?
How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? Would be pleased to hear your comments. Bernd
Re: Estimating the required volume to
Thanks for your answer. Can you please elaborate on mssql text searching is pretty primitive compared to Solr (Link or anything) Thanks. On Sun, Jun 2, 2013 at 4:54 PM, Erick Erickson erickerick...@gmail.comwrote: 1 Maybe, maybe not. mssql text searching is pretty primitive compared to Solr, just as Solr's db-like operations are primitive compared to mssql. They address different use-cases. So, you can store the docs in Solr and not touch your SQL db at all to return the docs. You can store just the IDs in Solr and retrieve your docs from the SQL store. You can store just enough data in Solr to display the results page and when the user tries to drill down you can go to your SQL database for assembling the full document. You can. It all depend on your use case, data size, all that rot. Very often, something like the DB is considered the system-of-record and it's indexed to Solr (See DIH or SolrJ) periodically. There is no underlying connection between your SQL store and Solr. You control when data is fetched from SQL and put into Solr. You control what the search experience is. etc. 2 Not really :(. See: http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Best Erick On Sat, Jun 1, 2013 at 1:07 PM, Mysurf Mail stammail...@gmail.com wrote: Hi, I am just starting to learn about solr. I want to test it in my env working with ms sql server. I have followed the tutorial and imported some rows to the Solr. Now I have a few noob question regarding the benefits of implementing Solr on a sql environment. 1. As I understand, When I send a query request over http, I receive a result with ID from the Solr system and than I query the full object row from the db. Is that right? Is there a comparison next to ms sql full text search which retrieves the full object in the same select? Is there a comparison that relates to db/server cluster and multiple machines? 2. Is there a technic that will assist me to estimate the volume size I will need for the indexed data (obviously, based on the indexed data properties) ?
Re: Removing a single value from a multiValue field
On Thu, May 30, 2013 at 5:01 PM, Jack Krupansky j...@basetechnology.com wrote: You gave an XML example, so I assumed you were working with XML! Right, I did give the output as XML. I find XML to be a great document markup language, but a terrible command format! Mostly, due to (mis-)use of the attributes. In JSON... [{id: doc-id, tags: {add: [a, b]}] and [{id: doc-id, tags: {set: null}}] Thank you! That is quite more intuitive and less ambiguous than the XML, would you not agree? BTW, this kind of stuff is covered in the book, separate chapters for XML and JSON, each with dozens of examples like this. I have not posted on the book postings, but I will definitely order one. My vote is for spiral bound, though I know that the perfect-bound will look more professional on a bookshelf. I don't even care what the book costs, within reason. Any resource that compiles in a single package the wonderful methods that yourself and other contributors mention here and in other places online, will pay for itself in short order. Apache Solr is an amazing product, but it is often obtuse and unintuitive. Other times one does not even know what Solr is capable of, such as the case in this thread, where I was parsing entire documents to change the multiField value. Thank you very much! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
/non/existent/dir/yields/warning
Hi, I am constantly getting this error in my solr log: Can't find (or read) directory to add to classloader: /non/existent/dir/yields/warning (resolved as: E:\Projects\apache_solr\solr-4.3.0\example\solr\genesis_experimental\non\existent\dir\yields\warning). Anyone got any idea on how to solve this -- Regards, Raheel Hasan
Re: /non/existent/dir/yields/warning
Hello! You should remove that entry from your solrconfig.xml file. It is something like this: lib dir=/non/existent/dir/yields/warning / -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I am constantly getting this error in my solr log: Can't find (or read) directory to add to classloader: /non/existent/dir/yields/warning (resolved as: E:\Projects\apache_solr\solr-4.3.0\example\solr\genesis_experimental\non\existent\dir\yields\warning). Anyone got any idea on how to solve this
Re: /non/existent/dir/yields/warning
ok thanks :) But why was it there anyway? I mean it says in comments: If a 'dir' option (with or without a regex) is used and nothing is found that matches, a warning will be logged. So it looks like a kind of exception handling or logging for libs not found... so shouldnt this folder actually exist? On Mon, Jun 3, 2013 at 2:06 PM, Rafał Kuć r@solr.pl wrote: Hello! You should remove that entry from your solrconfig.xml file. It is something like this: lib dir=/non/existent/dir/yields/warning / -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I am constantly getting this error in my solr log: Can't find (or read) directory to add to classloader: /non/existent/dir/yields/warning (resolved as: E:\Projects\apache_solr\solr-4.3.0\example\solr\genesis_experimental\non\existent\dir\yields\warning). Anyone got any idea on how to solve this -- Regards, Raheel Hasan
HostPort attribute of core tag in solr.xml
Hi, I am not very sure what the hostPort attribute in core tag of solr.xml mean. Can someone please let me know? Thanks, Prathik
Constant score for more like this reference document
I call the mlt handler using a query which searches for a certain document (?q=id:some_document_id). The reference document is included in the result and the score is also returned. I found out, that the score if fixed, independent of the document. So for each document id I get the same score. The score varies between cores, but is fixed per core. I'm aware of all the warnings about scores not being absolute values and that you cannot compare them. But I wonder, why the value is fixed per core. Is it just a random value or is it possible to explain how it's calculated? I'm just digging into the code to get a better understanding of the inner working, but I'm not yet deep enough. Feel free to point me to the relevant code snippets! kind regards, Achim
Re: /non/existent/dir/yields/warning
Hello! That's a good question. I suppose its there to show users how to setup a custom path to libraries. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch ok thanks :) But why was it there anyway? I mean it says in comments: If a 'dir' option (with or without a regex) is used and nothing is found that matches, a warning will be logged. So it looks like a kind of exception handling or logging for libs not found... so shouldnt this folder actually exist? On Mon, Jun 3, 2013 at 2:06 PM, Rafał Kuć r@solr.pl wrote: Hello! You should remove that entry from your solrconfig.xml file. It is something like this: lib dir=/non/existent/dir/yields/warning / -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I am constantly getting this error in my solr log: Can't find (or read) directory to add to classloader: /non/existent/dir/yields/warning (resolved as: E:\Projects\apache_solr\solr-4.3.0\example\solr\genesis_experimental\non\existent\dir\yields\warning). Anyone got any idea on how to solve this
Re: How can a Tokenizer be CoreAware?
Benson, I think the idea is that Tokenizers are created as needed (from the TokenizerFactory), while those other objects are singular (one created for each corresponding stanza in solrconfig.xml). So Tokenizers should be short-lived; they'll be cleaned up after each use, and the assumption is you wouldn't need to do any cleanup yourself; rather just let the garbage collector do its work -- assuming these are per-document resources. But if you have longer-lived resources, maybe you could manage them in the TokenizerFactory, which will be a singleton? Or in UpdateRequestProcessFactory, like you suggested -Mike On 5/29/13 7:36 AM, Benson Margulies wrote: I am currently testing some things with Solr 4.0.0. I tried to make a tokenizer CoreAware, and was rewarded with: Caused by: org.apache.solr.common.SolrException: Invalid 'Aware' object: com.basistech.rlp.solr.RLPTokenizerFactory@19336006 -- org.apache.solr.util.plugin.SolrCoreAware must be an instance of: [org.apache.solr.request.SolrRequestHandler] [org.apache.solr.response.QueryResponseWriter] [org.apache.solr.handler.component.SearchComponent] [org.apache.solr.update.processor.UpdateRequestProcessorFactory] [org.apache.solr.handler.component.ShardHandlerFactory] I need this to allow cleanup of some cached items in the tokenizer. Questions: 1: will a newer version allow me to do this directly? 2: is there some other approach that anyone would recommend? I could, for example, make a fake object in the list above to act as a singleton with a static accessor, but that seems pretty ugly.
Re: Solr + Groovy
On 6/3/13 3:07 AM, Achim Domma wrote: Hi, I have some query building and result processing code, which is currently running as normal Solr client outside of Solr. I think it would make a lot of sense to move parts of this code into a custom SearchHandler or SearchComponent. Because I'm not a big fan of the Java language, I would like to use Groovy. Searching the web I got the impression that Solr + alternative JVM languages is not a very common topic. So before starting my project, I would like to know: Is there a well known good reason not to use Groovy (or Clojure, Scala, ...) for implementing custom Solr code? kind regards, Achim Check out Paul Nelson's work, presented at Lucene Revolution 2013: http://www.lucenerevolution.org/sites/default/files/Advanced%20Query%20Parsing%20Techniques.pdf He reported success using Groovy embedded in Solr to generate queries -Mike
Re: Reindexing strategy
On Fri, May 31, 2013 at 3:57 AM, Michael Sokolov msoko...@safaribooksonline.comgt wrote: On UNIX platforms, take a look at vmstat for basic I/O measurement, and iostat for more detailed stats. One coarse measurement is the number of blocked/waiting processes - usually this is due to I/O contention, and you will want to look at the paging and swapping numbers - you don't want any swapping at all. But the best single number to look at is overall disk activity, which is the I/O percentage utilized number Shaun was mentioning. -Mike Great, thanks! I've got some terms to google. For those who follow in my footsteps, on Ubuntu the package 'sysstat' needs to be installed to use iostat. Here are my reference stats before starting to experiment, both for my own use later to compare and also if anybody sees anything amiss here then I would love to know about it. If there is any fine manual that is particularly urgent that I should read, please do mention it. Thanks! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
SpatialRecursivePrefixTreeFieldType Spatial Searching
Hi, I'm seeing really slow query times. 7-25 seconds when I run a simple filter query that uses my SpatialRecursivePrefixTreeFieldType field. My index is about 30k documents. Prior to adding the Spatial field, the on disk space was about 100Mb, so it's a really tiny index. Once I add the spatial field (which is multi-values), the index size jumps up to 2GB. (Is this normal?). Only about 10k documents will have any spatial data. Typically, they will have at most 10 shapes each, but the majority are all one of two rectangles. This is my fieldType definition. fieldType name=date_availability class=solr.SpatialRecursivePrefixTreeFieldType geo=false worldBounds=0 0 3650 1 distErrPct=0 maxDistErr=1 units=degrees / And the field field name=availability_spatial type=date_availability indexed=true stored=false multiValued=true / I am using the field to represent approximately 10 years after January 1st 2013, where each day is along the X-axis. Because the availability starts and ends at 2pm and 10am, I was using a decimal place when creating my shape to show that detail. (Is this approach wrong?) So a typical rectangle when indexed would be (minX minY maxX maxY) Rectangle 100.6 0 120.4 1 Is it wrong that my Y and X values are not of the same scale? Since I don't care about the Y axis at all, I just set it to be of 1 height always. I'm running Solr 4.3, with a small JVM of 768M (can be increased). And I have 2GB RAM. (Again can be increased). Thanks
ContributorsGroup
Hi, Could you please add EmrahKara to ContributorsGroup in solr wiki? -- *[image: CNT logo] http://www.cntbilisim.com.tr/ **Emrah Kara* Developer at CNT Email / Gtalk: em...@cntbilisim.com.tr Skype: rockipsiz TEL: +90 232 3481851 GSM: +90 533 3634362 FAX: +90 232 3481861 283/14 Sk No 4 Ender Apt. D:4 Mansuroglu Mah. Bayrakli IZMIR TURKEY www.tamindir.com
Re: ContributorsGroup
Done, looking forward to your contributions! Erick On Mon, Jun 3, 2013 at 7:22 AM, Emrah Kara em...@cntbilisim.com.tr wrote: Hi, Could you please add EmrahKara to ContributorsGroup in solr wiki? -- *[image: CNT logo] http://www.cntbilisim.com.tr/ **Emrah Kara* Developer at CNT Email / Gtalk: em...@cntbilisim.com.tr Skype: rockipsiz TEL: +90 232 3481851 GSM: +90 533 3634362 FAX: +90 232 3481861 283/14 Sk No 4 Ender Apt. D:4 Mansuroglu Mah. Bayrakli IZMIR TURKEY www.tamindir.com
Re: SpatialRecursivePrefixTreeFieldType Spatial Searching
Also, here is a sample query, and the debugQuery output fq={!cost=200}*:* -availability_spatial:Intersects(182.6 0 199.4 1) Incase the formatting is bad, here is a raw past of the debugQuery: http://pastie.org/pastes/872/text?key=ksjyboect4imrha0rck8sa ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime8171/int lst name=params str name=debugQuerytrue/str str name=indenttrue/str str name= q*:*/str str name=_1370259235923/str str name=wtxml/str str name=fq{!cost=200}*:* -availability_spatial:Intersects(182.6 0 199.4 1)/str str name=rows0/str /lst /lst result name= response numFound=16137 start=0 /result lst name=debug str name=rawquerystring*:*/str str name=querystring*:*/str str name= parsedqueryMatchAllDocsQuery(*:*)/str str name=parsedquery_toString *:*/str lst name=explain/ str name=QParserLuceneQParser/str arr name=filter_queries str{!cost=200}*:* -availability_spatial:Intersects(182.6 0 199.4 1)/str /arr arr name= parsed_filter_queries str+MatchAllDocsQuery(*:*) -ConstantScore(org.apache.lucene.spatial.prefix.IntersectsPrefixTreeFilter@42ce603b )/str /arr lst name=timing double name=time8171.0/double lst name=prepare double name=time1.0/double lst name=query double name=time0.0/double /lst lst name=facet double name=time0.0/ double /lst lst name=mlt double name=time1.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/double /lst lst name=debug double name= time0.0/double /lst /lst lst name=process double name=time 8170.0/double lst name=query double name=time8170.0/double /lst lst name=facet double name=time0.0/double /lst lst name=mlt double name=time0.0/double /lst lst name=highlight double name=time0.0/double /lst lst name=stats double name=time0.0/ double /lst lst name=debug double name=time0.0/double /lst / lst /lst /lst /response On Mon, Jun 3, 2013 at 12:27 PM, Chris Atkinson chrisa...@gmail.com wrote: Hi, I'm seeing really slow query times. 7-25 seconds when I run a simple filter query that uses my SpatialRecursivePrefixTreeFieldType field. My index is about 30k documents. Prior to adding the Spatial field, the on disk space was about 100Mb, so it's a really tiny index. Once I add the spatial field (which is multi-values), the index size jumps up to 2GB. (Is this normal?). Only about 10k documents will have any spatial data. Typically, they will have at most 10 shapes each, but the majority are all one of two rectangles. This is my fieldType definition. fieldType name=date_availability class=solr.SpatialRecursivePrefixTreeFieldType geo=false worldBounds=0 0 3650 1 distErrPct=0 maxDistErr=1 units=degrees / And the field field name=availability_spatial type=date_availability indexed=true stored=false multiValued=true / I am using the field to represent approximately 10 years after January 1st 2013, where each day is along the X-axis. Because the availability starts and ends at 2pm and 10am, I was using a decimal place when creating my shape to show that detail. (Is this approach wrong?) So a typical rectangle when indexed would be (minX minY maxX maxY) Rectangle 100.6 0 120.4 1 Is it wrong that my Y and X values are not of the same scale? Since I don't care about the Y axis at all, I just set it to be of 1 height always. I'm running Solr 4.3, with a small JVM of 768M (can be increased). And I have 2GB RAM. (Again can be increased). Thanks
Re: Estimating the required volume to
Here's a link to various transformations you can do while indexing and searching in Solr: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Consider stemming ngrams WordDelimiterFilterFactory ASCIIFoldingFilterFactory phrase queries boosting synonyms blah blah blah You can't do a lot of these transformations, at least not easily in SQL. OTOH, you can't do 5-way joins in Solr. Different problems, different tools All that said, there's no good reason to use Solr if your use-case is satisfied by simple keyword searches that have no transformations, mysql etc. work just fine in those cases. It's all about selecting the right tool for the use-case. FWIW, Erick On Mon, Jun 3, 2013 at 4:44 AM, Mysurf Mail stammail...@gmail.com wrote: Thanks for your answer. Can you please elaborate on mssql text searching is pretty primitive compared to Solr (Link or anything) Thanks. On Sun, Jun 2, 2013 at 4:54 PM, Erick Erickson erickerick...@gmail.comwrote: 1 Maybe, maybe not. mssql text searching is pretty primitive compared to Solr, just as Solr's db-like operations are primitive compared to mssql. They address different use-cases. So, you can store the docs in Solr and not touch your SQL db at all to return the docs. You can store just the IDs in Solr and retrieve your docs from the SQL store. You can store just enough data in Solr to display the results page and when the user tries to drill down you can go to your SQL database for assembling the full document. You can. It all depend on your use case, data size, all that rot. Very often, something like the DB is considered the system-of-record and it's indexed to Solr (See DIH or SolrJ) periodically. There is no underlying connection between your SQL store and Solr. You control when data is fetched from SQL and put into Solr. You control what the search experience is. etc. 2 Not really :(. See: http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Best Erick On Sat, Jun 1, 2013 at 1:07 PM, Mysurf Mail stammail...@gmail.com wrote: Hi, I am just starting to learn about solr. I want to test it in my env working with ms sql server. I have followed the tutorial and imported some rows to the Solr. Now I have a few noob question regarding the benefits of implementing Solr on a sql environment. 1. As I understand, When I send a query request over http, I receive a result with ID from the Solr system and than I query the full object row from the db. Is that right? Is there a comparison next to ms sql full text search which retrieves the full object in the same select? Is there a comparison that relates to db/server cluster and multiple machines? 2. Is there a technic that will assist me to estimate the volume size I will need for the indexed data (obviously, based on the indexed data properties) ?
Re: /non/existent/dir/yields/warning
Hi, but the path looks like it shows how to setup non existent lib warning... :D On Mon, Jun 3, 2013 at 2:56 PM, Rafał Kuć r@solr.pl wrote: Hello! That's a good question. I suppose its there to show users how to setup a custom path to libraries. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch ok thanks :) But why was it there anyway? I mean it says in comments: If a 'dir' option (with or without a regex) is used and nothing is found that matches, a warning will be logged. So it looks like a kind of exception handling or logging for libs not found... so shouldnt this folder actually exist? On Mon, Jun 3, 2013 at 2:06 PM, Rafał Kuć r@solr.pl wrote: Hello! You should remove that entry from your solrconfig.xml file. It is something like this: lib dir=/non/existent/dir/yields/warning / -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I am constantly getting this error in my solr log: Can't find (or read) directory to add to classloader: /non/existent/dir/yields/warning (resolved as: E:\Projects\apache_solr\solr-4.3.0\example\solr\genesis_experimental\non\existent\dir\yields\warning). Anyone got any idea on how to solve this -- Regards, Raheel Hasan
Re: FieldCache insanity with field used as facet and group
I'm reproducing the problem with the 4.2.1 example with 2 shards. 1) started up solr shards, indexed the example data, and confirmed empty fieldCaches [sanniere@funlevel-dx example]$ java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar [sanniere@funlevel-dx example2]$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar 2) used both grouping and faceting on the popularity field, then checked the fieldcache insanity count [sanniere@funlevel-dx example]$ curl -sS http://localhost:8983/solr/select?q=*:*group=truegroup.field=popularity; /dev/null [sanniere@funlevel-dx example]$ curl -sS http://localhost:8983/solr/select?q=*:*facet=truefacet.field=popularity; /dev/null [sanniere@funlevel-dx example]$ curl -sS http://localhost:8983/solr/admin/mbeans?stats=truekey=fieldCachewt=jsonindent=true; | grep entries_count|insanity_count entries_count:10, insanity_count:2, insanity#0:VALUEMISMATCH: Multiple distinct value objects for SegmentCoreReader(owner=_g(4.2.1):C1)+popularity\n\t'SegmentCoreReader(owner=_g(4.2.1):C1)'='popularity',class org.apache.lucene.index.SortedDocValues,0.5=org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#12129794\n\t'SegmentCoreReader(owner=_g(4.2.1):C1)'='popularity',int,null=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#12298774\n\t'SegmentCoreReader(owner=_g(4.2.1):C1)'='popularity',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#12298774\n, insanity#1:VALUEMISMATCH: Multiple distinct value objects for SegmentCoreReader(owner=_f(4.2.1):C9)+popularity\n\t'SegmentCoreReader(owner=_f(4.2.1):C9)'='popularity',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#16648315\n\t'SegmentCoreReader(owner=_f(4.2.1):C9)'='popularity',int,null=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#16648315\n\t'SegmentCoreReader(owner=_f(4.2.1):C9)'='popularity',class org.apache.lucene.index.SortedDocValues,0.5=org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#1130715\n}}}, HIGHLIGHTING,{}, OTHER,{}]} I've updated https://issues.apache.org/jira/browse/SOLR-4866 Elodie Le 28.05.2013 10:22, Elodie Sannier a écrit : I've created https://issues.apache.org/jira/browse/SOLR-4866 Elodie Le 07.05.2013 18:19, Chris Hostetter a écrit : : I am using the Lucene FieldCache with SolrCloud and I have insane instances : with messages like: FWIW: I'm the one that named the result of these sanity checks FieldCacheInsantity and i have regretted it ever since -- a better label would have been inconsistency : VALUEMISMATCH: Multiple distinct value objects for : SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)+merchantid : 'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'='merchantid',class : org.apache.lucene.index.SortedDocValues,0.5=org.apache.lucene.search.FieldCacheImpl$SortedDocValuesImpl#557711353 : 'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'='merchantid',int,null=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#1105988713 : 'SegmentCoreReader(owner=_11i(4.2.1):C4493997/853637)'='merchantid',int,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_INT_PARSER=org.apache.lucene.search.FieldCacheImpl$IntsFromArray#1105988713 : : All insane instances are for a field merchantid of type int used as facet : and group field. Interesting: it appears that the grouping code and the facet code are not being consistent in how they are building hte field cache, so you are getting two objects in the cache for each segment I haven't checked if this happens much with the example configs, but if you could: please file a bug with the details of which Solr version you are using along with the schema fieldType filed declarations for your merchantid field, along with the mbean stats output showing the field cache insanity after executing two queries like... /select?q=*:*facet=truefacet.field=merchantid /select?q=*:*group=truegroup.field=merchantid (that way we can rule out your custom SearchComponent as having a bug in it) : This insanity can have performance impact ? : How can I fix it ? the impact is just that more ram is being used them is probably strictly neccessary. unless there is something unusual in your fieldType delcataion, i don't think there is an easy fix you can apply -- we need to fix the underlying code. -Hoss -- Kelkoo *Elodie Sannier *Software engineer *E*elodie.sann...@kelkoo.frmailto:elodie.sann...@kelkoo.fr *Y!Messenger* kelkooelodies *T* +33 (0)4 56 09 07 55 *M* *A* 4/6 Rue des Méridiens 38130 Echirolles Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le
Multitable import - uniqueKey
Hi, I am importing multiple table (by join) into solr using DIH. All is set, except for 1 confusion: what to do with *uniqueKey* in schema? When I had only 1 table, I had it fine. Now how to put 2 uniqueKeys (both from different table). For example: uniqueKeytable1_id/uniqueKey uniqueKeytable2_id/uniqueKey Will this work? -- Regards, Raheel Hasan
Re: Estimating the required volume to
Hi, Thanks for your answer. I want to refer to your message, because I am trying to choose the right tool. 1. regarding stemming: I am running in ms-sql SELECT * FROM sys.dm_fts_parser ('FORMSOF(INFLECTIONAL,provide)', 1033, 0, 0) and I receive group_id phrase_id occurrence special_term display_term expansion_type source_term 1 0 1 Exact Match *provided *2 provide 1 0 1 Exact Match *provides *2 provide 1 0 1 Exact Match *providing *2 provide 1 0 1 Exact Match *provide *0 provide isnt that stemming ? 2. Regarding synonyms sql server has a full thesaurus featurehttp://msdn.microsoft.com/en-us/library/ms142491.aspx. Doesnt it mean synonyms? On Mon, Jun 3, 2013 at 2:43 PM, Erick Erickson erickerick...@gmail.comwrote: Here's a link to various transformations you can do while indexing and searching in Solr: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Consider stemming ngrams WordDelimiterFilterFactory ASCIIFoldingFilterFactory phrase queries boosting synonyms blah blah blah You can't do a lot of these transformations, at least not easily in SQL. OTOH, you can't do 5-way joins in Solr. Different problems, different tools All that said, there's no good reason to use Solr if your use-case is satisfied by simple keyword searches that have no transformations, mysql etc. work just fine in those cases. It's all about selecting the right tool for the use-case. FWIW, Erick On Mon, Jun 3, 2013 at 4:44 AM, Mysurf Mail stammail...@gmail.com wrote: Thanks for your answer. Can you please elaborate on mssql text searching is pretty primitive compared to Solr (Link or anything) Thanks. On Sun, Jun 2, 2013 at 4:54 PM, Erick Erickson erickerick...@gmail.com wrote: 1 Maybe, maybe not. mssql text searching is pretty primitive compared to Solr, just as Solr's db-like operations are primitive compared to mssql. They address different use-cases. So, you can store the docs in Solr and not touch your SQL db at all to return the docs. You can store just the IDs in Solr and retrieve your docs from the SQL store. You can store just enough data in Solr to display the results page and when the user tries to drill down you can go to your SQL database for assembling the full document. You can. It all depend on your use case, data size, all that rot. Very often, something like the DB is considered the system-of-record and it's indexed to Solr (See DIH or SolrJ) periodically. There is no underlying connection between your SQL store and Solr. You control when data is fetched from SQL and put into Solr. You control what the search experience is. etc. 2 Not really :(. See: http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Best Erick On Sat, Jun 1, 2013 at 1:07 PM, Mysurf Mail stammail...@gmail.com wrote: Hi, I am just starting to learn about solr. I want to test it in my env working with ms sql server. I have followed the tutorial and imported some rows to the Solr. Now I have a few noob question regarding the benefits of implementing Solr on a sql environment. 1. As I understand, When I send a query request over http, I receive a result with ID from the Solr system and than I query the full object row from the db. Is that right? Is there a comparison next to ms sql full text search which retrieves the full object in the same select? Is there a comparison that relates to db/server cluster and multiple machines? 2. Is there a technic that will assist me to estimate the volume size I will need for the indexed data (obviously, based on the indexed data properties) ?
Re: how are you handling killer queries?
On 6/3/2013 2:39 AM, Bernd Fehling wrote: How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? As you might know, EofException happens when one end (usually the client) closes the TCP connection before the response is delivered. This is usually caused by explicitly setting timeouts, or by using a load balancer in front of Solr, because these will normally limit how long the response can take. The timeout involved is probably 60 seconds in this case, and the query took nearly 90 seconds. It doesn't cause any *direct* problems for Solr, though the nasty exception that gets logged every time is annoying. A query like that does use a lot of resources, so if the server doesn't have a lot of spare capacity, it can cause problems for everyone else. Assuming that this isn't happening due to bugs in your application, the only way to really handle this problem is to first locate the problem user and educate them. If the problem continues and it's a viable option, you might need to ban that user from your system. Thanks, Shawn
Re: HostPort attribute of core tag in solr.xml
On 6/3/2013 3:16 AM, Prathik Puthran wrote: I am not very sure what the hostPort attribute in core tag of solr.xml mean. Can someone please let me know? This only has meaning if you are using SolrCloud. This is how each Solr server in the cloud informs the cloud what port it is using. http://wiki.apache.org/solr/SolrCloud#SolrCloud_Instance_Params Thanks, Shawn
Re: /non/existent/dir/yields/warning
On 6/3/2013 5:58 AM, Raheel Hasan wrote: but the path looks like it shows how to setup non existent lib warning... :D The reason for its existence is encoded in its name. A nonexistent path results in a warning. It's a way to illustrate to a novice what happens when you have a non-fatal misconfiguration. The message is a warning and doesn't prevent Solr startup. Thanks, Shawn
Can mm (min-match) be specified by field in dismax or edismax?
I would like to have the min-match set differently for different fields in my dismax handler. Is this possible?
Re: how are you handling killer queries?
Hi Shawn, well, the user is the world and the servers have enough capacity. So its nothing really to worry about. OK, could raise timeout from standard 60 to 90, 120 or even 180 seconds. Just wanted to know how other solr developer handle this. The technical question, where is the difference between hitting the stop button from the browser while a search is running and the timeout of http connection in my container (in my case jetty)? I guess the stop button from the browser will inform all parts involved whereas the timeout just leaves an open end somewhere in the container (broken pipe)? And the container has no way to simulate a browser stop button in case of a timeout to get a sane termination? Bernd Am 03.06.2013 16:20, schrieb Shawn Heisey: On 6/3/2013 2:39 AM, Bernd Fehling wrote: How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? As you might know, EofException happens when one end (usually the client) closes the TCP connection before the response is delivered. This is usually caused by explicitly setting timeouts, or by using a load balancer in front of Solr, because these will normally limit how long the response can take. The timeout involved is probably 60 seconds in this case, and the query took nearly 90 seconds. It doesn't cause any *direct* problems for Solr, though the nasty exception that gets logged every time is annoying. A query like that does use a lot of resources, so if the server doesn't have a lot of spare capacity, it can cause problems for everyone else. Assuming that this isn't happening due to bugs in your application, the only way to really handle this problem is to first locate the problem user and educate them. If the problem continues and it's a viable option, you might need to ban that user from your system. Thanks, Shawn
Re: Multitable import - uniqueKey
Hi, Thanks for the replies. Actually, I had only a small confusion: From table_1 I got key_1; using this I join into table_2. But table_2 also gave another key key_2 which is needed for joining with table_3. So for Table1 and Table2 its obviously just fine... but what will happen when table3 is also added? will the 3 tables be intact in terms of relationship? Thanks. On Mon, Jun 3, 2013 at 7:33 PM, Jack Krupansky j...@basetechnology.comwrote: If the respective table IDs are not globally unique, then you (the developer) will have to supplement the raw ID with a prefix or suffix or other form of global ID (e.g., UUID) to assure that they are unique. You could just add the SQL table name as a prefix or suffix. The bottom line: What do you WANT the Solr key field to look like? I mean, YOU are the data architect, right? What requirements do you have? When your Solr application users receive the key values in the responses to queries, what expectations do you expect to set for them? -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Monday, June 03, 2013 9:12 AM To: solr-user@lucene.apache.org Subject: Multitable import - uniqueKey Hi, I am importing multiple table (by join) into solr using DIH. All is set, except for 1 confusion: what to do with *uniqueKey* in schema? When I had only 1 table, I had it fine. Now how to put 2 uniqueKeys (both from different table). For example: uniqueKeytable1_id/**uniqueKey uniqueKeytable2_id/**uniqueKey Will this work? -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: /non/existent/dir/yields/warning
ok fantastic... now I will comment it to be sure thanks a lot Regards, Raheel On Mon, Jun 3, 2013 at 7:27 PM, Shawn Heisey s...@elyograg.org wrote: On 6/3/2013 5:58 AM, Raheel Hasan wrote: but the path looks like it shows how to setup non existent lib warning... :D The reason for its existence is encoded in its name. A nonexistent path results in a warning. It's a way to illustrate to a novice what happens when you have a non-fatal misconfiguration. The message is a warning and doesn't prevent Solr startup. Thanks, Shawn -- Regards, Raheel Hasan
Re: how are you handling killer queries?
On 6/3/2013 8:43 AM, Bernd Fehling wrote: Hi Shawn, well, the user is the world and the servers have enough capacity. So its nothing really to worry about. OK, could raise timeout from standard 60 to 90, 120 or even 180 seconds. Just wanted to know how other solr developer handle this. The technical question, where is the difference between hitting the stop button from the browser while a search is running and the timeout of http connection in my container (in my case jetty)? I guess the stop button from the browser will inform all parts involved whereas the timeout just leaves an open end somewhere in the container (broken pipe)? And the container has no way to simulate a browser stop button in case of a timeout to get a sane termination? The result is probably the same, no matter how the connection gets closed. I've seen it mostly from my load balancer, and most often with the layer 7 check that uses my ping handler. It has a timeout of 5 seconds, and occasionally (usually due to garbage collection pauses) the query will take longer than 5 seconds. The load balancer closes the connection with a TCP reset, which is a perfectly valid (and very fast) way to close a TCP connection. The exception isn't coming from unclean closes, it's coming from ANY close. I think that Solr shouldn't log a full stacktrace when this happens, but I'm not sure whether Solr has any control over it, because the exception comes from Jetty. Thanks, Shawn
Re: Multitable import - uniqueKey
Same answer. Whether it is 2, 3, 10 or 1000 tables, you, the data architect must decide how to uniquely identify Solr documents. In general, when joining n tables, combine the n keys into one composite key. Either do it on the SQL query side, or with a Solr update request processor. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Monday, June 03, 2013 10:44 AM To: solr-user@lucene.apache.org Subject: Re: Multitable import - uniqueKey Hi, Thanks for the replies. Actually, I had only a small confusion: From table_1 I got key_1; using this I join into table_2. But table_2 also gave another key key_2 which is needed for joining with table_3. So for Table1 and Table2 its obviously just fine... but what will happen when table3 is also added? will the 3 tables be intact in terms of relationship? Thanks. On Mon, Jun 3, 2013 at 7:33 PM, Jack Krupansky j...@basetechnology.comwrote: If the respective table IDs are not globally unique, then you (the developer) will have to supplement the raw ID with a prefix or suffix or other form of global ID (e.g., UUID) to assure that they are unique. You could just add the SQL table name as a prefix or suffix. The bottom line: What do you WANT the Solr key field to look like? I mean, YOU are the data architect, right? What requirements do you have? When your Solr application users receive the key values in the responses to queries, what expectations do you expect to set for them? -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Monday, June 03, 2013 9:12 AM To: solr-user@lucene.apache.org Subject: Multitable import - uniqueKey Hi, I am importing multiple table (by join) into solr using DIH. All is set, except for 1 confusion: what to do with *uniqueKey* in schema? When I had only 1 table, I had it fine. Now how to put 2 uniqueKeys (both from different table). For example: uniqueKeytable1_id/**uniqueKey uniqueKeytable2_id/**uniqueKey Will this work? -- Regards, Raheel Hasan -- Regards, Raheel Hasan
RE: Spell Checker (DirectSolrSpellChecker) correct settings
My first guess is that no documents match the query provinical court. Because you have spellcheck.maxCollationTries set to a non-zero value, it will not return these as collations unless the correction will return hits. You can test my theory out by removing spellcheck.maxCollationTries from the request and see if it returns provinical court as expected. If this isn't it, then give us the full query request and also the full spellcheck response for your failing case. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Raheel Hasan [mailto:raheelhasan@gmail.com] Sent: Friday, May 31, 2013 9:38 AM To: solr-user@lucene.apache.org Subject: Spell Checker (DirectSolrSpellChecker) correct settings Hi guyz, I am new to solr. Here is the thing I have: When i search Courtt, I get correct suggestion saying: spellcheck: { suggestions: [ courtt, { numFound: 1, startOffset: 0, endOffset: 6, suggestion: [ court ] }, collation, [ collationQuery, court, hits, 53, misspellingsAndCorrections, [ courtt, court ] ] ] }, But when I try Provincial Courtt, it gives me no suggestions, instead it searches for Provincial only. Here is the spell check settings in *solrconfig.xml*: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_en_splitting/str !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedefault/str str name=classnamesolr.DirectSolrSpellChecker/str str name=fieldtext/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.5/float !-- Require terms to occur in 1% of documents in order to be included in the dictionary -- float name=thresholdTokenFrequency.01/float !-- the spellcheck distance measure used, the default is the internal levenshtein -- !--str name=distanceMeasureinternal/str-- !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum number of characters the terms should share -- int name=minPrefix3/int !-- maximum number of possible matches to review before returning results -- int name=maxInspections3/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float /lst !-- a spellchecker that can break or combine words. See /spell handler below for usage -- lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtext/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges5/int /lst /searchComponent Here is the *requestHandler*: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=dftext/str !-- Spell checking defaults -- str name=spellcheckon/str str name=spellcheck.count5/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount2/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.collatetrue/str str name=spellcheck.maxCollations3/str str name=spellcheck.maxCollationTries3/str str name=spellcheck.collateExtendedResultstrue/str /lst !-- append spellchecking to our list of components -- arr name=last-components strspellcheck/str /arr /requestHandler -- Regards, Raheel Hasan
Re: how are you handling killer queries?
There are two radically distinct use cases: 1. Consumers on the open Internet. They do stupid things. Give them a very constrained search experience, enforced with query preprocessing. Maybe give them only dismax queries. 2. Professional power users. They typically have credentials for using the application, so if they are detected as performing long or stupid queries, log the details and administratively take action, such as denying them access (or billing them for excessive resource usage.) -- Jack Krupansky -Original Message- From: Bernd Fehling Sent: Monday, June 03, 2013 4:39 AM To: solr-user@lucene.apache.org Subject: how are you handling killer queries? How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? Would be pleased to hear your comments. Bernd
Re: Multitable import - uniqueKey
ok. But do we need it? Thats what I am confused at. should 1 key from table_1 pull all the data in relationship as they were inserted? On Mon, Jun 3, 2013 at 7:53 PM, Jack Krupansky j...@basetechnology.comwrote: Same answer. Whether it is 2, 3, 10 or 1000 tables, you, the data architect must decide how to uniquely identify Solr documents. In general, when joining n tables, combine the n keys into one composite key. Either do it on the SQL query side, or with a Solr update request processor. -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Monday, June 03, 2013 10:44 AM To: solr-user@lucene.apache.org Subject: Re: Multitable import - uniqueKey Hi, Thanks for the replies. Actually, I had only a small confusion: From table_1 I got key_1; using this I join into table_2. But table_2 also gave another key key_2 which is needed for joining with table_3. So for Table1 and Table2 its obviously just fine... but what will happen when table3 is also added? will the 3 tables be intact in terms of relationship? Thanks. On Mon, Jun 3, 2013 at 7:33 PM, Jack Krupansky j...@basetechnology.com** wrote: If the respective table IDs are not globally unique, then you (the developer) will have to supplement the raw ID with a prefix or suffix or other form of global ID (e.g., UUID) to assure that they are unique. You could just add the SQL table name as a prefix or suffix. The bottom line: What do you WANT the Solr key field to look like? I mean, YOU are the data architect, right? What requirements do you have? When your Solr application users receive the key values in the responses to queries, what expectations do you expect to set for them? -- Jack Krupansky -Original Message- From: Raheel Hasan Sent: Monday, June 03, 2013 9:12 AM To: solr-user@lucene.apache.org Subject: Multitable import - uniqueKey Hi, I am importing multiple table (by join) into solr using DIH. All is set, except for 1 confusion: what to do with *uniqueKey* in schema? When I had only 1 table, I had it fine. Now how to put 2 uniqueKeys (both from different table). For example: uniqueKeytable1_id/uniqueKey uniqueKeytable2_id/uniqueKey Will this work? -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: Can mm (min-match) be specified by field in dismax or edismax?
No, but you can with the LucidWorks Search query parser: f1:(cat dog fox bat fish cow)~50% f2:(cat dog fox bat fish zebra)~2 See: http://docs.lucidworks.com/display/lweug/Minimum+Match+for+Simple+Queries -- Jack Krupansky -Original Message- From: Eric Wilson Sent: Monday, June 03, 2013 10:30 AM To: solr-user@lucene.apache.org Subject: Can mm (min-match) be specified by field in dismax or edismax? I would like to have the min-match set differently for different fields in my dismax handler. Is this possible?
Re: Can mm (min-match) be specified by field in dismax or edismax?
Well, there is a hack(ish) way to do it: _query_:{!type=edismax qf='someField' v='$q' mm=100%} This is clearly not a solrconfig.xml settings, but part of your query string using LocalParam behavior. This is going to get really messy if you have plenty of fields you'd like to search, where you'd need a similar construct for each. I cannot attest to performance at scale with such a construct…but just showing a way you can go about this if you feel compelled enough to do so. Jason On Jun 3, 2013, at 8:08 AM, Jack Krupansky j...@basetechnology.com wrote: No, but you can with the LucidWorks Search query parser: f1:(cat dog fox bat fish cow)~50% f2:(cat dog fox bat fish zebra)~2 See: http://docs.lucidworks.com/display/lweug/Minimum+Match+for+Simple+Queries -- Jack Krupansky -Original Message- From: Eric Wilson Sent: Monday, June 03, 2013 10:30 AM To: solr-user@lucene.apache.org Subject: Can mm (min-match) be specified by field in dismax or edismax? I would like to have the min-match set differently for different fields in my dismax handler. Is this possible?
Re: updating docs in solr cloud hangs
Hi, My cluster hangs again running an update process, the HTTP POST request was aborted because a timeout error. After the hang, I couldn't do more updates without restart the cluster. I could see this error on node's log after kill it. Is like if solr waits for the update response forever … and no more operations can be handle until this one finish. [qtp301150411-1248] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: interrupted waiting for shard update response at org.apache.solr.update.SolrCmdDistributor.checkResponses(SolrCmdDistributor.java:429) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:99) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:447) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1140) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.solr.update.SolrCmdDistributor.checkResponses(SolrCmdDistributor.java:408) ... 35 more -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, June 3, 2013 at 2:18 AM, Erick Erickson wrote: Did you take a stack trace of your _server_ and see if the fragment I posted is the place a bunch of threads are stuck? If so, then it's what I mentioned, and the patch I pointed to should fix it up (when it's ready)... The fact that it hangs more frequently with replication 1 is consistent with the JIRA. Shawn: Thanks, you beat me to the punch for clarifying replication! Best Erick On Sun, Jun 2, 2013 at 12:41 PM, Yago Riveiro yago.rive...@gmail.com (mailto:yago.rive...@gmail.com) wrote: Shawn: replicationFactor higher than one yes. -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Sunday, June 2, 2013 at 4:07 PM, Shawn Heisey wrote: On 6/2/2013 8:28 AM, Yago Riveiro wrote: Erick: In my case, when server hangs, no exception is thrown, the logs on both servers stop registering the update INFO messages. if a shutdown one node, immediately the log of the alive node register some update INFO messages that appears was stuck
RE: Spell Checker (DirectSolrSpellChecker) correct settings
For each fot he 4 cases listed below, can you give your query request string (q=...fq=...qt=...etc) and also the spellchecker output? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Raheel Hasan [mailto:raheelhasan@gmail.com] Sent: Monday, June 03, 2013 10:22 AM To: solr-user@lucene.apache.org Subject: Re: Spell Checker (DirectSolrSpellChecker) correct settings Hi, thanks a lot for the reply. Actually, Provincial Courtt is mentioned in many documents (sorry about the type earlier). Secondly, I tried your idea, but not much of help. The issue is very microscopic: 1) When I search for Provinciaal Courtt = it only suggests `str name= courttcourt/str` and not Provincial 2) Search for Provincial Courtt = returns result for 'Provincial' keyword and no suggestion for 'court'. 3) Search for Provinciaal Court = no suggestion; instead searches for court and returns result. 4) Search for Provinciall Courtt = correct suggestions.. On Mon, Jun 3, 2013 at 7:55 PM, Dyer, James james.d...@ingramcontent.comwrote: My first guess is that no documents match the query provinical court. Because you have spellcheck.maxCollationTries set to a non-zero value, it will not return these as collations unless the correction will return hits. You can test my theory out by removing spellcheck.maxCollationTries from the request and see if it returns provinical court as expected. If this isn't it, then give us the full query request and also the full spellcheck response for your failing case. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Raheel Hasan [mailto:raheelhasan@gmail.com] Sent: Friday, May 31, 2013 9:38 AM To: solr-user@lucene.apache.org Subject: Spell Checker (DirectSolrSpellChecker) correct settings Hi guyz, I am new to solr. Here is the thing I have: When i search Courtt, I get correct suggestion saying: spellcheck: { suggestions: [ courtt, { numFound: 1, startOffset: 0, endOffset: 6, suggestion: [ court ] }, collation, [ collationQuery, court, hits, 53, misspellingsAndCorrections, [ courtt, court ] ] ] }, But when I try Provincial Courtt, it gives me no suggestions, instead it searches for Provincial only. Here is the spell check settings in *solrconfig.xml*: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_en_splitting/str !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedefault/str str name=classnamesolr.DirectSolrSpellChecker/str str name=fieldtext/str !-- minimum accuracy needed to be considered a valid spellcheck suggestion -- float name=accuracy0.5/float !-- Require terms to occur in 1% of documents in order to be included in the dictionary -- float name=thresholdTokenFrequency.01/float !-- the spellcheck distance measure used, the default is the internal levenshtein -- !--str name=distanceMeasureinternal/str-- !-- the maximum #edits we consider when enumerating terms: can be 1 or 2 -- int name=maxEdits1/int !-- the minimum number of characters the terms should share -- int name=minPrefix3/int !-- maximum number of possible matches to review before returning results -- int name=maxInspections3/int !-- minimum length of a query term to be considered for correction -- int name=minQueryLength4/int !-- maximum threshold of documents a query term can appear to be considered for correction -- float name=maxQueryFrequency0.01/float /lst !-- a spellchecker that can break or combine words. See /spell handler below for usage -- lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldtext/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges5/int /lst /searchComponent Here is the *requestHandler*: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=dftext/str !-- Spell checking defaults -- str name=spellcheckon/str str name=spellcheck.count5/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.maxResultsForSuggest5/str str name=spellcheck.alternativeTermCount2/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.collatetrue/str str
Re: how are you handling killer queries?
I think you should take a look at the TimeLimitingCollector (it is used also inside SolrIndexSearcher). My understanding is that it will stop your server from consuming unnecessary resources. --roman On Mon, Jun 3, 2013 at 4:39 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? Would be pleased to hear your comments. Bernd
Solr: separating index and storage
Consider the following use case. Certain words are extracted from a document and indexed. The exact sentence containing the word cannot be stored alongside the extracted word because of the volume at which the documents grow; How can the index and, lets call it doc servers be separated ? An option is to store the sentences in MongoDB or a RDBMS. But there seems to be a schema level design issue. Assuming 'word' to be a multivalued field, how do we associate to it a reference to the corresponding entry in the doc server. May create (word_1, ref_1) tuples. Is there any other in-built feature ? Any related project which separates index doc servers ? Thanks, Sourajit
Re: Solr query performance tool
You can use this tool to analyze the logs.. https://github.com/dfdeshom/solr-loganalyzer We use solrmeter to test the performance / Stress testing. https://code.google.com/p/solrmeter/ -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-query-performance-tool-tp4066900p4067869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how are you handling killer queries?
There is the timeAllowed parameter: http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed -- Jack Krupansky -Original Message- From: Roman Chyla Sent: Monday, June 03, 2013 11:53 AM To: solr-user@lucene.apache.org Subject: Re: how are you handling killer queries? I think you should take a look at the TimeLimitingCollector (it is used also inside SolrIndexSearcher). My understanding is that it will stop your server from consuming unnecessary resources. --roman On Mon, Jun 3, 2013 at 4:39 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: How are you handling killer queries with solr? While solr/lucene (currently 4.2.1) is trying to do its best I see sometimes stupid queries in my logs, located with extremly long query time. Example: q=???+and+??+and+???+and++and+???+and+?? I even get hits for this (hits=34091309 status=0 QTime=88667). But the jetty log says: WARN:oejs.Response:Committed before 500 {msg=Datenübergabe unterbrochen (broken pipe),trace=org.eclipse.jetty.io.EofException... org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838)|?... 35 more|,code=500} WARN:oejs.ServletHandler:/solr/base/select java.lang.IllegalStateException: Committed at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1136) Because I get hits and qtime the search is successful, right? But jetty/http has already closed the connection and solr doesn't know about this? How are you handling killer queries, just ignoring? Or something to tune (jetty config about timeout) or filter (query filtering)? Would be pleased to hear your comments. Bernd
Saravanan Chinnadurai/Actionimages is out of the office.
I will be out of the office starting 03/06/2013 and will not return until 04/06/2013. Please email to itsta...@actionimages.com for any urgent issues. Action Images is a division of Reuters Limited and your data will therefore be protected in accordance with the Reuters Group Privacy / Data Protection notice which is available in the privacy footer at www.reuters.com Registered in England No. 145516 VAT REG: 397000555
Solr 4.2.1 higher memory footprint vs Solr 3.5
Hi, Using the same schema for both Solr 3.5 and Solr 4.2.1 and posting the same data to both these server, and the memory requirements seem to have gone up sharply during request handling. . Requests come in at around 200QPS. . Document sizes are very large but that did not seem to be a problem with 3.5 (Lots of multivalued fields with large array lengths.) Could you help me understand what change in SOLR 4.2.1 would attribute to this higher memory requirement? Also, in a different test, I ran a query to just get a list of all unique ID's via a single query and no load and I see it complete in 500ms however the time it takes to ship the data back to the client seems to be very large. Any idea what could be causing this behavior? Would appreciate any help. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-1-higher-memory-footprint-vs-Solr-3-5-tp4067879.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can mm (min-match) be specified by field in dismax or edismax?
Also, just to be clear, MM/minMatch, is not an option for a field but for a full BooleanQuery. I mean, you can't have two different MM values within the same BooleanQuery, except with nested BooleanQuerys, where each BQ has its own MM. -- Jack Krupansky -Original Message- From: Jason Hellman Sent: Monday, June 03, 2013 11:40 AM To: solr-user@lucene.apache.org Subject: Re: Can mm (min-match) be specified by field in dismax or edismax? Well, there is a hack(ish) way to do it: _query_:{!type=edismax qf='someField' v='$q' mm=100%} This is clearly not a solrconfig.xml settings, but part of your query string using LocalParam behavior. This is going to get really messy if you have plenty of fields you'd like to search, where you'd need a similar construct for each. I cannot attest to performance at scale with such a construct…but just showing a way you can go about this if you feel compelled enough to do so. Jason On Jun 3, 2013, at 8:08 AM, Jack Krupansky j...@basetechnology.com wrote: No, but you can with the LucidWorks Search query parser: f1:(cat dog fox bat fish cow)~50% f2:(cat dog fox bat fish zebra)~2 See: http://docs.lucidworks.com/display/lweug/Minimum+Match+for+Simple+Queries -- Jack Krupansky -Original Message- From: Eric Wilson Sent: Monday, June 03, 2013 10:30 AM To: solr-user@lucene.apache.org Subject: Can mm (min-match) be specified by field in dismax or edismax? I would like to have the min-match set differently for different fields in my dismax handler. Is this possible?
Re: Disable all caches in solr
You can also check out this link. http://lucene.472066.n3.nabble.com/Is-there-a-way-to-remove-caches-in-SOLR-td4061216.html#a4061219 -- View this message in context: http://lucene.472066.n3.nabble.com/Disable-all-caches-in-solr-tp4066517p4067870.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr + Groovy
Looks interesting, but it's just for the UpdateHandler. Right? Does a similar handler for searching already exist? Achim Am 03.06.2013 um 17:22 schrieb Jack Krupansky: Check out the support for external scripting of update request processors: http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html Are there any of your requirements that that doesn't address? -- Jack Krupansky -Original Message- From: Achim Domma Sent: Monday, June 03, 2013 3:07 AM To: solr-user@lucene.apache.org Subject: Solr + Groovy Hi, I have some query building and result processing code, which is currently running as normal Solr client outside of Solr. I think it would make a lot of sense to move parts of this code into a custom SearchHandler or SearchComponent. Because I'm not a big fan of the Java language, I would like to use Groovy. Searching the web I got the impression that Solr + alternative JVM languages is not a very common topic. So before starting my project, I would like to know: Is there a well known good reason not to use Groovy (or Clojure, Scala, ...) for implementing custom Solr code? kind regards, Achim=
Re: Solr + Groovy
Sorry about that. Unfortunately, scripting is only on the update side. But I imagine athat a lot of the logic could be repurposed for the query side. -- Jack Krupansky -Original Message- From: Achim Domma Sent: Monday, June 03, 2013 2:31 PM To: solr-user@lucene.apache.org Subject: Re: Solr + Groovy Looks interesting, but it's just for the UpdateHandler. Right? Does a similar handler for searching already exist? Achim Am 03.06.2013 um 17:22 schrieb Jack Krupansky: Check out the support for external scripting of update request processors: http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html Are there any of your requirements that that doesn't address? -- Jack Krupansky -Original Message- From: Achim Domma Sent: Monday, June 03, 2013 3:07 AM To: solr-user@lucene.apache.org Subject: Solr + Groovy Hi, I have some query building and result processing code, which is currently running as normal Solr client outside of Solr. I think it would make a lot of sense to move parts of this code into a custom SearchHandler or SearchComponent. Because I'm not a big fan of the Java language, I would like to use Groovy. Searching the web I got the impression that Solr + alternative JVM languages is not a very common topic. So before starting my project, I would like to know: Is there a well known good reason not to use Groovy (or Clojure, Scala, ...) for implementing custom Solr code? kind regards, Achim=
Re: Solr + Groovy
Yeah, it's currently just for the update side of things. But this issue is open https://issues.apache.org/jira/browse/SOLR-3669 and assigned to me, for one of these days. I set it for my 5.0 radar. Certainly anyone that wants to make this happen sooner than I maybe will possibly hopefully one week will delve into, go for it! Erik p.s. [infomercial] We do have update-side scripting (JavaScript) and business rules (via Drools) capabilities in our LucidWorks Search platform* http://www.lucidworks.com/products/lucidworks-search with the update-side scripting running in the connector framework by design rather than on the Solr side of things to allow it to scale in a separate tier. On Jun 3, 2013, at 14:31 , Achim Domma wrote: Looks interesting, but it's just for the UpdateHandler. Right? Does a similar handler for searching already exist? Achim Am 03.06.2013 um 17:22 schrieb Jack Krupansky: Check out the support for external scripting of update request processors: http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html Are there any of your requirements that that doesn't address? -- Jack Krupansky -Original Message- From: Achim Domma Sent: Monday, June 03, 2013 3:07 AM To: solr-user@lucene.apache.org Subject: Solr + Groovy Hi, I have some query building and result processing code, which is currently running as normal Solr client outside of Solr. I think it would make a lot of sense to move parts of this code into a custom SearchHandler or SearchComponent. Because I'm not a big fan of the Java language, I would like to use Groovy. Searching the web I got the impression that Solr + alternative JVM languages is not a very common topic. So before starting my project, I would like to know: Is there a well known good reason not to use Groovy (or Clojure, Scala, ...) for implementing custom Solr code? kind regards, Achim=
Re: Dynamic Indexing using DB and DIH
On 6/3/2013 12:35 PM, PeriS wrote: I noticed the delta-import is creating a new indexed entry on top of the existing one..is that normal? Not sure what you are asking here, so I'll give an answer to the question I think you're asking: If you have a uniqueKey defined in your schema, then new documents with matching values in the uniqueKey field will replace the existing documents. Solr will delete the old one before inserting the new one. Thanks, Shawn
Re: Dynamic Indexing using DB and DIH
Shawn, You got the point; I do have a the unique key defined, but for some reason, when i run the delta-import; a new entry is created for the same record with a new unique key. Its almost somehow it doesn't detect the existing record. On Jun 3, 2013, at 3:51 PM, Shawn Heisey s...@elyograg.org wrote: On 6/3/2013 12:35 PM, PeriS wrote: I noticed the delta-import is creating a new indexed entry on top of the existing one..is that normal? Not sure what you are asking here, so I'll give an answer to the question I think you're asking: If you have a uniqueKey defined in your schema, then new documents with matching values in the uniqueKey field will replace the existing documents. Solr will delete the old one before inserting the new one. Thanks, Shawn
Re: Custom Response Handler
Hi Erik, In my case I have to calculate a custom value depending on the retrieved candidates .This will be for each document.So my choice will be Doc Transformer. Lets say in this case if I need to include a java class which does the computation , how does I tie that with Doc transformer. Solr wiki (http://wiki.apache.org/solr/DocTransformers) talks about the Custom Transformers but does not include an example. Please help. Regards, Vibhor Jaiswal -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-Response-Handler-tp4067558p4067923.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom Response Handler
You can refer this post to use doctransforemers.. http://java.dzone.com/news/solr-40-doctransformers-first -- View this message in context: http://lucene.472066.n3.nabble.com/Custom-Response-Handler-tp4067558p4067926.html Sent from the Solr - User mailing list archive at Nabble.com.
Inconsistent Full import document index counts.
Hello All, I've been working on a 2-shard SolrCloud instance with several million documents, and the import process has recently begun to miss documents as they are added to the underlying Postgres database. There are no glaring failures in the log files (all SEVERE and WARNING level errors in the log are from malformed queries). To ensure that it is not an issue with my delta-import query, I've tried running full imports to no avail. Strangely, when I modify my data-import query to only search for a specific id that is missed in the full-import, all of the relevant documents are indexed. Any ideas for possible causes of missed document imports in long-running full-imports? Thanks, Chris Donaher
RE: Solr query performance tool
You have to be careful looking at the QTime's. They do not include garbage collection. I've run into issues where QTime is short (cause it was), it just happened that the query came in during a long garbage collection where everything was paused. So you can get into situations where once the 15 second GC is done everything performs as expected! I'd make sure and have an external querying tool and you can monitor GC times as well via JMX. From: bbarani [bbar...@gmail.com] Sent: Monday, June 03, 2013 8:58 AM To: solr-user@lucene.apache.org Subject: Re: Solr query performance tool You can use this tool to analyze the logs.. https://github.com/dfdeshom/solr-loganalyzer We use solrmeter to test the performance / Stress testing. https://code.google.com/p/solrmeter/ -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-query-performance-tool-tp4066900p4067869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr query performance tool
On 6/3/2013 3:33 PM, Greg Harris wrote: You have to be careful looking at the QTime's. They do not include garbage collection. I've run into issues where QTime is short (cause it was), it just happened that the query came in during a long garbage collection where everything was paused. So you can get into situations where once the 15 second GC is done everything performs as expected! I'd make sure and have an external querying tool and you can monitor GC times as well via JMX. The QTime value in the response is calculated using System.currentTimeMillis(), so it should include the GC time, unless the GC happens to hit just after the QTime is calculated but before the final response with all the results is sent. If you are requesting a lot of documents or you have very large documents where most/all of the fields are stored, having long GCs hit during that particular moment might actually be a common occurrence. Thanks, Shawn
SolrCloud Load Balancer weight
Hey guys, I have recently looked into an issue with my Solrcloud related to very high load when performing a full-import on DIH. While some work could be done to improve my queries, etc in DIH, this lead me to a new feature idea in Solr: weighted internal load balancing. Basically, I can think of two uses cases, and how a weight on load balancing could help: 1) My situation from above - I'm doing a huge import and want SolrCloud to direct fewer queries to the node handling the DIH full-import, say weight 10/100 (10%) instead of 100/100. 2) Mixed hardware - Although I wouldn't recommend doing this, some people may have mixed hardware, some capable of handling more or less traffic. These weights wouldn't be expected to be exact, just best-effort to be able generally to influence load on nodes inside the cluster. They of course would only matter on reads (/get, /select, etc). A full blown approach would have weight awareness in the Zookeeper-aware client implementation, and on inter-node replica requests. Should I JIRA this? Thoughts? Tim
Re: SpatialRecursivePrefixTreeFieldType Spatial Searching
Hi Chris: Have you read: http://wiki.apache.org/solr/SpatialForTimeDurations You're modeling your data sub-optimally. Full precision rectangles (distErrPct=0) doesn't scale well and you're seeing that. You should represent your durations as a point and it will take up a fraction of the space (see above). Furthermore, because your detail gets into one digit to the right of the decimal, your maxDistErr should definitely be smaller than 1 -- use something like 0.5 (given you have two levels of precision below a full day) but to be safer (more certain it's not a problem) use 0.3 -- a little less. Please report back how that goes. ~ David On 6/3/13 7:27 AM, Chris Atkinson chrisa...@gmail.com wrote: Hi, I'm seeing really slow query times. 7-25 seconds when I run a simple filter query that uses my SpatialRecursivePrefixTreeFieldType field. My index is about 30k documents. Prior to adding the Spatial field, the on disk space was about 100Mb, so it's a really tiny index. Once I add the spatial field (which is multi-values), the index size jumps up to 2GB. (Is this normal?). Only about 10k documents will have any spatial data. Typically, they will have at most 10 shapes each, but the majority are all one of two rectangles. This is my fieldType definition. fieldType name=date_availability class=solr.SpatialRecursivePrefixTreeFieldType geo=false worldBounds=0 0 3650 1 distErrPct=0 maxDistErr=1 units=degrees / And the field field name=availability_spatial type=date_availability indexed=true stored=false multiValued=true / I am using the field to represent approximately 10 years after January 1st 2013, where each day is along the X-axis. Because the availability starts and ends at 2pm and 10am, I was using a decimal place when creating my shape to show that detail. (Is this approach wrong?) So a typical rectangle when indexed would be (minX minY maxX maxY) Rectangle 100.6 0 120.4 1 Is it wrong that my Y and X values are not of the same scale? Since I don't care about the Y axis at all, I just set it to be of 1 height always. I'm running Solr 4.3, with a small JVM of 768M (can be increased). And I have 2GB RAM. (Again can be increased). Thanks
Leader election deadlock after restarting leader in 4.2.1
SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), java 6u27 64 bit 6 nodes, 2 shards, 3 replicas each. Names changed to r1s2 (replica1 - shard 2), r2s2, and r3s2 for each replica in shard 2. What we see: * Under production load, we restart a leader (r1s2), and observe in the cloud admin that the old leader is in state Down and no new leader is ever elected. * The system will stay like this until we stop the old leader (or cause a ZK timeout...see below). *Please note:* the leader is killed, then kill -9'd 5 seconds later, before restarting. We have since changed this. Digging into the logs on the old leader (r1s2 = replica1-shard 2): * The old leader restarted at 5:23:29 PM, but appears to be stuck in SolrDispatchFilter.init() -- (See recovery at bottom). * It doesn't want to become leader, possibly due to the unclean shutdown. May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r1s2:8080/solr Our versions are too old. ourHighThreshold=1436325665147191297 otherLowThreshold=1436325775374548992 * It then tries to recover, but cannot, because there is no leader. May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log SEVERE: Error while trying to recover. core=browse:org.apache.solr.common.SolrException: No registered leader was found, collection:browse slice:shard2 * Meanwhile, it appears that blocking in init(), prevents the http-8080 handler from starting (See recovery at bottom). Digging into the other replicas (r2s2): * For some reason, the old leader (r1s2) remains in the list of replicas that r2s2 attempts to sync to. May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r2s2:8080/solr START replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] nUpdates=100 * This apparently fails (30 second timeout), possibly due to http-8080 handler not being started on r1s2. May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception talking to http://r1s2:8080/solr/browse/, failed org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://r1s2:8080/solr/browse *At this point, the cluster will remain indefinitely without a leader, if nothing else changes.* But in this particular instance, we took some stack and heap dumps from r1s2, which paused java long enough to cause a *zookeeper timeout on the old leader (r1s2)*: May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run INFO: Client session timed out, have not heard from server in 38226ms for sessionid 0x23d28e0f584005d, closing socket connection and attempting reconnect Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and succeeded in becoming leader: May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r3s2:8080/solr START replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r3s2:8080/solr Received 100 versions from r2s2:8080/solr/browse/ May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r3s2:8080/solr Our versions are newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded Now that we have a leader, r1s2 can succeed in recovery and finish SolrDispatchFilter.init(), apparently allowing the http-8080 handler to start (r1s2). May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay INFO: No replay needed. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Replication Recovery was successful - registering as Active. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=browse state=active May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish INFO: numShards not found on descriptor - reading it from system property May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Finished recovery process. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=browse recoveringAfterStartup=false May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init INFO: user.dir=/ May 28, 2013 5:34:49 PM org.apache.solr.servlet.SolrDispatchFilter init *INFO: SolrDispatchFilter.init() done* May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=browse state=recovering May 28, 2013 5:34:49 PM
Re: SolrCloud Load Balancer weight
On Jun 3, 2013, at 3:33 PM, Tim Vaillancourt t...@elementspace.com wrote: Should I JIRA this? Thoughts? Yeah - it's always been in the back of my mind - it's come up a few times - eventually we would like nodes to report some stats to zk to influence load balancing. - mark
How to Get Cluster State By Solrj?
I want to get cluster state of my SolrCloud by Solrj (I know that admin page shows it but I want to customize it at my application). Firstly wiki says that: CloudSolrServer server = new CloudSolrServer(localhost:9983); why CloudSolrServer takes only one Zookeeper host:port as an argument? I have a quorum of Zookeeper and some of them maybe down even quorum works? Secondly how can I get the current state of clusters properly?
Re: How to Get Cluster State By Solrj?
It actually accepts a comma separated list of zk host addresses (your quorum). Same format as zk describes in it's docs. To get the cluster state, get the ZkStateReader from the CloudSolrServer and then it's getClusterState or something. - Mark On Jun 3, 2013, at 5:30 PM, Furkan KAMACI furkankam...@gmail.com wrote: I want to get cluster state of my SolrCloud by Solrj (I know that admin page shows it but I want to customize it at my application). Firstly wiki says that: CloudSolrServer server = new CloudSolrServer(localhost:9983); why CloudSolrServer takes only one Zookeeper host:port as an argument? I have a quorum of Zookeeper and some of them maybe down even quorum works? Secondly how can I get the current state of clusters properly?
Re: Leader election deadlock after restarting leader in 4.2.1
Thanks - I can try and look into this perhaps next week. You might copy the details into a JIRA issue to prevent it from getting lost though... - Mark On Jun 3, 2013, at 4:46 PM, John Guerrero jguerr...@tagged.com wrote: SOLR 4.2.1, tomcat 6.0.35, CentOS 6.2 (2.6.32-220.4.1.el6.x86_64 #1 SMP), java 6u27 64 bit 6 nodes, 2 shards, 3 replicas each. Names changed to r1s2 (replica1 - shard 2), r2s2, and r3s2 for each replica in shard 2. What we see: * Under production load, we restart a leader (r1s2), and observe in the cloud admin that the old leader is in state Down and no new leader is ever elected. * The system will stay like this until we stop the old leader (or cause a ZK timeout...see below). *Please note:* the leader is killed, then kill -9'd 5 seconds later, before restarting. We have since changed this. Digging into the logs on the old leader (r1s2 = replica1-shard 2): * The old leader restarted at 5:23:29 PM, but appears to be stuck in SolrDispatchFilter.init() -- (See recovery at bottom). * It doesn't want to become leader, possibly due to the unclean shutdown. May 28, 2013 5:24:42 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r1s2:8080/solr Our versions are too old. ourHighThreshold=1436325665147191297 otherLowThreshold=1436325775374548992 * It then tries to recover, but cannot, because there is no leader. May 28, 2013 5:24:43 PM org.apache.solr.common.SolrException log SEVERE: Error while trying to recover. core=browse:org.apache.solr.common.SolrException: No registered leader was found, collection:browse slice:shard2 * Meanwhile, it appears that blocking in init(), prevents the http-8080 handler from starting (See recovery at bottom). Digging into the other replicas (r2s2): * For some reason, the old leader (r1s2) remains in the list of replicas that r2s2 attempts to sync to. May 28, 2013 5:23:42 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r2s2:8080/solr START replicas=[http://r1s2:8080/solr/browse/, http://r3s2:8080/solr/browse/] nUpdates=100 * This apparently fails (30 second timeout), possibly due to http-8080 handler not being started on r1s2. May 28, 2013 5:24:12 PM org.apache.solr.update.PeerSync handleResponse WARNING: PeerSync: core=browse url=http://r2s2:8080/solr exception talking to http://r1s2:8080/solr/browse/, failed org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://r1s2:8080/solr/browse *At this point, the cluster will remain indefinitely without a leader, if nothing else changes.* But in this particular instance, we took some stack and heap dumps from r1s2, which paused java long enough to cause a *zookeeper timeout on the old leader (r1s2)*: May 28, 2013 5:33:26 PM org.apache.zookeeper.ClientCnxn$SendThread run INFO: Client session timed out, have not heard from server in 38226ms for sessionid 0x23d28e0f584005d, closing socket connection and attempting reconnect Then, one of the replicas (r3s2) finally stopped trying to sync to r1s2 and succeeded in becoming leader: May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r3s2:8080/solr START replicas=[http://r2s2:8080/solr/browse/] nUpdates=100 May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r3s2:8080/solr Received 100 versions from r2s2:8080/solr/browse/ May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync handleVersions INFO: PeerSync: core=browse url=http://r3s2:8080/solr Our versions are newer. ourLowThreshold=1436325775374548992 otherHigh=1436325775805513730 May 28, 2013 5:33:34 PM org.apache.solr.update.PeerSync sync INFO: PeerSync: core=browse url=http://r3s2:8080/solr DONE. sync succeeded Now that we have a leader, r1s2 can succeed in recovery and finish SolrDispatchFilter.init(), apparently allowing the http-8080 handler to start (r1s2). May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy replay INFO: No replay needed. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Replication Recovery was successful - registering as Active. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish INFO: publishing core=browse state=active May 28, 2013 5:34:49 PM org.apache.solr.cloud.ZkController publish INFO: numShards not found on descriptor - reading it from system property May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy doRecovery INFO: Finished recovery process. core=browse May 28, 2013 5:34:49 PM org.apache.solr.cloud.RecoveryStrategy run INFO: Starting recovery process. core=browse recoveringAfterStartup=false May 28, 2013 5:34:49 PM org.apache.solr.common.cloud.ZkStateReader updateClusterState INFO: Updating cloud state from ZooKeeper... May 28, 2013 5:34:49 PM