Re: Regarding improving performance of the solr
@Shawn: Correctly I am trying to reduce the index size. I am working on reindex the solr with some of the features as indexed and not stored @Jean: I tried with different caches. It did not show much improvement. On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn
No or limited use of FieldCache
Hi We have a SolrCloud setup handling huge amounts of data. When we do group, facet or sort searches Solr will use its FieldCache, and add data in it for every single document we have. For us it is not realistic that this will ever fit in memory and we get OOM exceptions. Are there some way of disabling the FieldCache (taking the performance penalty of course) or make it behave in a nicer way where it only uses up to e.g. 80% of the memory available to the JVM? Or other suggestions? Regards, Per Steffensen
Re: Some highlighted snippets aren't being returned
Thank you, Aloke and Bryan! I'll give this a try and I'll report back on what happens! - Eric On Sep 9, 2013, at 2:32 AM, Aloke Ghoshal alghos...@gmail.com wrote: Hi Eric, As Bryan suggests, you should look at appropriately setting up the fragSize maxAnalyzedChars for long documents. One issue I find with your search request is that in trying to highlight across three separate fields, you have added each of them as a separate request param: hl.fl=contentshl.fl=titlehl.fl=original_url The way to do it would be (http://wiki.apache.org/solr/HighlightingParameters#hl.fl) to pass them as values to one comma (or space) separated field: hl.fl=contents,title,original_url Regards, Aloke On 9/9/13, Bryan Loofbourrow bloofbour...@knowledgemosaic.com wrote: Eric, Your example document is quite long. Are you setting hl.maxAnalyzedChars? If you don't, the highlighter you appear to be using will not look past the first 51,200 characters of the document for snippet candidates. http://wiki.apache.org/solr/HighlightingParameters#hl.maxAnalyzedChars -- Bryan -Original Message- From: Eric O'Hanlon [mailto:elo2...@columbia.edu] Sent: Sunday, September 08, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Re: Some highlighted snippets aren't being returned Hi again Everyone, I didn't get any replies to this, so I thought I'd re-send in case anyone missed it and has any thoughts. Thanks, Eric On Aug 7, 2013, at 1:51 PM, Eric O'Hanlon elo2...@columbia.edu wrote: Hi Everyone, I'm facing an issue in which my solr query is returning highlighted snippets for some, but not all results. For reference, I'm searching through an index that contains web crawls of human-rights-related websites. I'm running solr as a webapp under Tomcat and I've included the query's solr params from the Tomcat log: ... webapp=/solr-4.2 path=/select params={facet=truesort=score+descgroup.limit=10spellcheck.q=Unanganf.m imetype_code.facet.limit=7hl.simple.pre=codeq.alt=*:*f.organization_t ype__facet.facet.limit=6f.language__facet.facet.limit=6hl=truef.date_of _capture_.facet.limit=6group.field=original_urlhl.simple.post=/code facet.field=domainfacet.field=date_of_capture_facet.field=mimetype _codefacet.field=geographic_focus__facetfacet.field=organization_based_i n__facetfacet.field=organization_type__facetfacet.field=language__facet facet.field=creator_name__facethl.fragsize=600f.creator_name__facet.face t.limit=6facet.mincount=1qf=text^1hl.fl=contentshl.fl=titlehl.fl=orig inal_urlwt=rubyf.geographic_focus__facet.facet.limit=6defType=edismaxr ows=10f.domain.facet.limit=6q=Unanganf.organization_based_in__facet.fac et.limit=6q.op=ANDgroup=truehl.usePhraseHighlighter=true} hits=8 status=0 QTime=108 ... For the query above (which can be simplified to say: find all documents that contain the word unangan and return facets, highlights, etc.), I get five search results. Only three of these are returning highlighted snippets. Here's the highlighting portion of the solr response (note: printed in ruby notation because I'm receiving this response in a Rails app): highlighting= {20100602195444/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun% 202002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100902203939/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20111202233029/http://www.kontras.org/uu_ri_ham/UU%20Nomor%2023%20Tahun%2 02002%20tentang%20Perlindungan%20Anak.pdf= {}, 20100618201646/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20100902235358/http://www.komnasham.go.id/portal/files/39-99.pdf= {contents= [...actual snippet is returned here...]}, 20110302213056/http://www.komnasham.go.id/publikasi/doc_download/2- uu-no-39-tahun-1999= {contents= [...actual snippet is returned here...]}, 20110302213102/http://www.komnasham.go.id/publikasi/doc_view/2-uu-no- 39-tahun-1999?tmpl=componentformat=raw= {contents= [...actual snippet is returned here...]}, 20120303113654/http://www.iwgia.org/iwgia_files_publications_files/0028_U timut_heritage.pdf= {}} I have eight (as opposed to five) results above because I'm also doing a grouped query, grouping by a field called original_url, and this leads to five grouped results. I've confirmed that my highlight-lacking results DO contain the word unangan, as expected, and this term is appearing in a text field that's indexed and stored, and being searched for all text searches. For example, one of the search results is for a crawl of this document: http://www.iwgia.org/iwgia_files_publications_files/0028_Utimut_heritage.p df And if you view that document on the web, you'll see that it does contain unangan. Has anyone seen this before? And does
Re: How to facet data from a multivalued field?
oh got it.. Thanks a lot... On Tue, Sep 10, 2013 at 10:10 PM, Erick Erickson erickerick...@gmail.comwrote: You can't facet on fields where indexed=false. When you look at output docs, you're seeing _stored_ not indexed data. Set indexed=true and re-index... Best, Erick On Tue, Sep 10, 2013 at 5:51 AM, Rah1x raheel_itst...@yahoo.com wrote: Hi buddy, I am having this problem that I cant even reach to what you did at first step.. all I get is: lst name=series / This is the schema: field name=series type=string indexed=false stored=true required=false omitTermFreqAndPositions=true multiValued=true / Note: the data is correctly placed in the field as the query results shows. However, the facet is not working. Could you please share the schema of what you did to achieve it? Thanks a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p4089045.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Raheel Hasan
Re: charfilter doesn't do anything
perfect, i tried it before but always at the tail of the expression with no effect. thanks a lot. a last question, do you know how to keep the html comments from being filtered before the transformer has done its work? On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote: Okay, I can repro the problem. Yes, in appears that the pattern replace char filter does not default to multiline mode for pattern matching, so body on one line and /body on another line cannot be matched. Now, whether that is by design or a bug or an option for enhancement is a matter for some committer to comment on. But, the good news is that you can in fact set multiline mode in your pattern my starting it with (?s), which means that dot accepts line break characters as well. So, here are my revised field types: fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_html_body_strip class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=(?s)^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / charFilter class=solr.HTMLStripCharFilterFactory / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType The first type accepts everything within body, including nested HTML formatting, while the latter strips nested HTML formatting as well. The tokenizer will in fact strip out white space, but that happens after all character filters have completed. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Tuesday, September 10, 2013 7:07 AM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a \r\n even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this? On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: Use XML then. Although you will need to escape the XML special characters as I did in the pattern. The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 7:05 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: Did you at least try the pattern I gave you? The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 6:40 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: Did you in fact try my suggested example? If not, please do so. -- Jack Krupansky -Original Message- From: Andreas Owen Sent: Monday, September 09, 2013 4:42 PM To: solr-user@lucene.apache.org Subject: Re: charfilter doesn't do anything i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out. html-file: htmlnav-contentbody nur das will ich sehen/bodyfooter-content/html solr update debug output: text_html: [html\r\n\r\nmeta name=\Content-Encoding\ content=\ISO-8859-1\\r\nmeta name=\Content-Type\ content=\text/html; charset=ISO-8859-1\\r\ntitle/title\r\n\r\nbodynav-content nur das will ich sehenfooter-content/body/html] On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: I tried this and it seems to work when added to the standard Solr example in 4.4: field name=body type=text_html_body indexed=true stored=true / fieldType name=text_html_body class=solr.TextField positionIncrementGap=100 analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=^.*lt;bodygt;(.*)lt;/bodygt;.*$ replacement=$1 / tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType That char filter retains only text between body and /body. Is that what you wanted? Indexing this data: curl 'localhost:8983/solr/update?commit=true' -H
Dynamic analizer settings change
Let's take the following type definition and schema (borrowed from Rafal Kuc's Solr 4 cookbook) : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ /analyzer /fieldType and schema: field name=id type=string indexed=true stored=true required=true / field name=title type=text indexed=true stored=true / The above analizer will apply SnowballPorterFilter english language filter. But would it be possible to change the language to french during indexing for some documents. is this possible? If not, what would be the best solution for having the same analizer but with different languages, which languange being determined at index time ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No or limited use of FieldCache
On 9/11/13 3:11 AM, Per Steffensen wrote: Hi We have a SolrCloud setup handling huge amounts of data. When we do group, facet or sort searches Solr will use its FieldCache, and add data in it for every single document we have. For us it is not realistic that this will ever fit in memory and we get OOM exceptions. Are there some way of disabling the FieldCache (taking the performance penalty of course) or make it behave in a nicer way where it only uses up to e.g. 80% of the memory available to the JVM? Or other suggestions? Regards, Per Steffensen I think you might want to look into using DocValues fields, which are column-stride fields stored as compressed arrays - one value per document -- for the fields on which you are sorting and faceting. My understanding (which is limited) is that these avoid the use of the field cache, and I believe you have the option to control whether they are held in memory or on disk. I hope someone who knows more will elaborate... -Mike
Re: Solr doesnt return answer when searching numbers
Mail guy. You've been around long enough to know to try adding debug=query to your URL and looking at the results, what does that show? Best Erick On Tue, Sep 10, 2013 at 9:25 AM, Mysurf Mail stammail...@gmail.com wrote: I am querying using http://...:8983/solr/vault/select?q=design testfl=PackageName I get 3 result: - design test - design test 2013 - design test for jobs Now when I query using q=test for jobs - I get only design test for jobs But when I query using q = 2013 http://...:8983/solr/vault/select?q=2013fl=PackageName I get no result. Why doesnt it return an answer when I query with numbers? In schema xml field name=PackageName type=text_en indexed=true stored=true required=true/
Re: SolrCloud 4.x hangs under high update volume
If you use CloudSolrServer, you need to apply SOLR-4816 or use a recent copy of the 4x branch. By recent, I mean like today, it looks like Mark applied this early this morning. But several reports indicate that this will solve your problem. I would expect that increasing the number of shards would make the problem worse, not better. There's also SOLR-5232... Best Erick On Tue, Sep 10, 2013 at 5:20 PM, Tim Vaillancourt t...@elementspace.comwrote: Hey guys, Based on my understanding of the problem we are encountering, I feel we've been able to reduce the likelihood of this issue by making the following changes to our app's usage of SolrCloud: 1) We increased our document batch size to 200 from 10 - our app batches updates to reduce HTTP requests/overhead. The theory is increasing the batch size reduces the likelihood of this issue happening. 2) We reduced to 1 application node sending updates to SolrCloud - we write Solr updates to Redis, and have previously had 4 application nodes pushing the updates to Solr (popping off the Redis queue). Reducing the number of nodes pushing to Solr reduces the concurrency on SolrCloud. 3) Less threads pushing to SolrCloud - due to the increase in batch size, we were able to go down to 5 update threads on the update-pushing-app (from 10 threads). To be clear the above only reduces the likelihood of the issue happening, and DOES NOT actually resolve the issue at hand. If we happen to encounter issues with the above 3 changes, the next steps (I could use some advice on) are: 1) Increase the number of shards (2x) - the theory here is this reduces the locking on shards because there are more shards. Am I onto something here, or will this not help at all? 2) Use CloudSolrServer - currently we have a plain-old least-connection HTTP VIP. If we go direct to what we need to update, this will reduce concurrency in SolrCloud a bit. Thoughts? Thanks all! Cheers, Tim On 6 September 2013 14:47, Tim Vaillancourt t...@elementspace.com wrote: Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Miller markrmil...@gmail.com wrote: Okay, thanks, useful info. Getting on a plane, but ill look more at this soon. That 10k thread spike is good to know - that's no good and could easily be part of the problem. We want to keep that from happening. Mark Sent from my iPhone On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt t...@elementspace.com wrote: Hey Mark, The farthest we've made it at the same batch size/volume was 12 hours without this patch, but that isn't consistent. Sometimes we would only get to 6 hours or less. During the crash I can see an amazing spike in threads to 10k which is essentially our ulimit for the JVM, but I strangely see no OutOfMemory: cannot open native thread errors that always follow this. Weird! We also notice a spike in CPU around the crash. The instability caused some shard recovery/replication though, so that CPU may be a symptom of the replication, or is possibly the root cause. The CPU spikes from about 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, while spiking isn't quite pinned (very beefy Dell R720s - 16 core Xeons, whole index is in 128GB RAM, 6xRAID10 15k). More on resources: our disk I/O seemed to spike about 2x during the crash (about 1300kbps written to 3500kbps), but this may have been the replication, or ERROR logging (we generally log nothing due to WARN-severity unless something breaks). Lastly, I found this stack trace occurring frequently, and have no idea what it is (may be useful or not): java.lang.IllegalStateException : at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) at org.eclipse.jetty.server.Response.sendError(Response.java:325) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at
Re: Regarding improving performance of the solr
Be a little careful when extrapolating from disk to memory. Any fields where you've set stored=true will put data in segment files with extensions .fdt and .fdx, see These are the compressed verbatim copy of the data for stored fields and have very little impact on memory required for searching. I've seen indexes where 75% of the data is stored and indexes where 5% of the data is stored. Summary of File Extensions here: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html Best, Erick On Wed, Sep 11, 2013 at 2:57 AM, prabu palanisamy pr...@serendio.comwrote: @Shawn: Correctly I am trying to reduce the index size. I am working on reindex the solr with some of the features as indexed and not stored @Jean: I tried with different caches. It did not show much improvement. On Fri, Sep 6, 2013 at 3:17 PM, Shawn Heisey s...@elyograg.org wrote: On 9/6/2013 2:54 AM, prabu palanisamy wrote: I am currently using solr -3.5.0, indexed wikipedia dump (50 gb) with java 1.6. I am searching the solr with text (which is actually twitter tweets) . Currently it takes average time of 210 millisecond for each post, out of which 200 millisecond is consumed by solr server (QTime). I used the jconsole monitor tool. If the size of all your Solr indexes on disk is in the 50GB range of your wikipedia dump, then for ideal performance, you'll want to have 50GB of free memory so the OS can cache your index. You might be able to get by with 25-30GB of free memory, depending on your index composition. Note that this is memory over and above what you allocate to the Solr JVM, and memory used by other processes on the machine. If you do have other services on the same machine, note that those programs might ALSO require OS disk cache RAM. http://wiki.apache.org/solr/SolrPerformanceProblems#OS_Disk_Cache Thanks, Shawn
Re: Dynamic analizer settings change
I wouldn't :). Here's the problem. Say you do this successfully at index time. How do you then search reasonably? There's often not near enough information to know what the search language is, there's little or no context. If the number of languages is limited, people often index into separate language-specific fields, say title_fr and title_en and use edismax to automatically distribute queries against all the fields. Others index families of languages in separate fields using things like the folding filters for Western languages, another field for, say, CJK languages and another for Middle Eastern languages etc. FWIW, Erick On Wed, Sep 11, 2013 at 6:55 AM, maephisto my_sky...@yahoo.com wrote: Let's take the following type definition and schema (borrowed from Rafal Kuc's Solr 4 cookbook) : fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ /analyzer /fieldType and schema: field name=id type=string indexed=true stored=true required=true / field name=title type=text indexed=true stored=true / The above analizer will apply SnowballPorterFilter english language filter. But would it be possible to change the language to french during indexing for some documents. is this possible? If not, what would be the best solution for having the same analizer but with different languages, which languange being determined at index time ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No or limited use of FieldCache
I don't know any more than Michael, but I'd _love_ some reports from the field. There are some restriction on DocValues though, I believe one of them is that they don't really work on analyzed data FWIW, Erick On Wed, Sep 11, 2013 at 7:00 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 9/11/13 3:11 AM, Per Steffensen wrote: Hi We have a SolrCloud setup handling huge amounts of data. When we do group, facet or sort searches Solr will use its FieldCache, and add data in it for every single document we have. For us it is not realistic that this will ever fit in memory and we get OOM exceptions. Are there some way of disabling the FieldCache (taking the performance penalty of course) or make it behave in a nicer way where it only uses up to e.g. 80% of the memory available to the JVM? Or other suggestions? Regards, Per Steffensen I think you might want to look into using DocValues fields, which are column-stride fields stored as compressed arrays - one value per document -- for the fields on which you are sorting and faceting. My understanding (which is limited) is that these avoid the use of the field cache, and I believe you have the option to control whether they are held in memory or on disk. I hope someone who knows more will elaborate... -Mike
Re: Stemming and protwords configuration
Did you try putting them _all_ in protwords.txt? i.e. frais, fraise, fraises? Don't forget to re-index. An alternative is to index in a second field that doesn't have the stemmer and when you want exact matches, search against that field. Best Erick On Mon, Sep 9, 2013 at 10:29 AM, csicard@orange.com wrote: Hi, We have a Solr server using stemming: filter class=solr.SnowballPorterFilterFactory language=French protected=protwords.txt / I would like to query the French words frais and fraise separately. I put the word fraise in protwords.txt file. - When I query the word fraise, no document indexed with the word frais are found. - When I query the word frais, I've got documents indexed with the word fraise. Is there a way to do not match fraises documents in the second situation ? I hope this is clear. Thanks for your reply. Christophe _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you.
solrj-httpclient-slow
hi,everyone when i track my solr client timing cost , i find one problem : some time the whole execute time is very long ,when i go to detail ,i find the solr server execute short time , then the main costs inside httpclient (make a connection ,send request or recived response ,blablabla. i am not familar httpclient inside code . does anyone met the same problem ? although , i update solrj 's new version ,the problem still. by the way : my solrj version is : 4.2 ,solr is 3.* Thanks a lot -- View this message in context: http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dynamic analizer settings change
Thanks, Erik! I might have missed mentioning something relevant. When querying Solr, I wouldn't actually need to query all fields, but only the one corresponding to the language picked by the user on the website. If he's using DE, then the search should only apply to the text_de field. What if I need to work with 50 different languages? Then I would get a schema with 50 types and 50 fields (text_en, text_fr, text_de, ...): won't this affect the performance ? bigger documents - slower queries. -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Profiling Solr Lucene for query
Dmitry - currently we don't have such a front end, this sounds like a good idea creating it. And yes, we do query all 36 shards every query. Mikhail - I do think 1 minute is enough data, as during this exact minute I had a single query running (that took a qtime of 1 minute). I wanted to isolate these hard queries. I repeated this profiling few times. I think I will take the termInterval from 128 to 32 and check the results. I'm currently using NRTCachingDirectoryFactory On Mon, Sep 9, 2013 at 11:29 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi Manuel, The frontend solr instance is the one that does not have its own index and is doing merging of the results. Is this the case? If yes, are all 36 shards always queried? Dmitry On Mon, Sep 9, 2013 at 10:11 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hi Dmitry, I have solr 4.3 and every query is distributed and merged back for ranking purpose. What do you mean by frontend solr? On Mon, Sep 9, 2013 at 2:12 PM, Dmitry Kan solrexp...@gmail.com wrote: are you querying your shards via a frontend solr? We have noticed, that querying becomes much faster if results merging can be avoided. Dmitry On Sun, Sep 8, 2013 at 6:56 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Hello all Looking on the 10% slowest queries, I get very bad performances (~60 sec per query). These queries have lots of conditions on my main field (more than a hundred), including phrase queries and rows=1000. I do return only id's though. I can quite firmly say that this bad performance is due to slow storage issue (that are beyond my control for now). Despite this I want to improve my performances. As tought in school, I started profiling these queries and the data of ~1 minute profile is located here: http://picpaste.com/pics/IMG_20130908_132441-ZyrfXeTY.1378637843.jpg Main observation: most of the time I do wait for readVInt, who's stacktrace (2 out of 2 thread dumps) is: catalina-exec-3870 - Thread t@6615 java.lang.Thread.State: RUNNABLE at org.apadhe.lucene.store.DataInput.readVInt(DataInput.java:108) at org.apaChe.lucene.codeosAockTreeIermsReade$FieldReader$SegmentTermsEnumFrame.loadBlock(BlockTreeTermsReader.java: 2357) at ora.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.seekExact(BlockTreeTermsReader.java:1745) at org.apadhe.lucene.index.TermContext.build(TermContext.java:95) at org.apache.lucene.search.PhraseQuery$PhraseWeight.init(PhraseQuery.java:221) at org.apache.lucene.search.PhraseQuery.createWeight(PhraseQuery.java:326) at org.apache.lucene.search.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at oro.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.searth.BooleanQuery$BooleanWeight.init(BooleanQuery.java:183) at org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:384) at org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:675) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:297) So I do actually wait for IO as expected, but I might be too many time page faulting while looking for the TermBlocks (tim file), ie locating the term. As I reindex now, would it be useful lowering down the termInterval (default to 128)? As the FST (tip files) are that small (few 10-100 MB) so there are no memory contentions, could I lower down this param to 8 for example? The benefit from lowering down the term interval would be to obligate the FST to get on memory (JVM - thanks to the NRTCachingDirectory) as I do not control the term dictionary file (OS caching, loads an average of 6% of it). General configs: solr 4.3 36 shards, each has few million docs These 36 servers (each server has 2 replicas) are running virtual, 16GB memory each (4GB for JVM, 12GB remain for the OS caching), consuming 260GB of disk mounted for the index files.
RE: Dynamic analizer settings change
-Original message- From:maephisto my_sky...@yahoo.com Sent: Wednesday 11th September 2013 14:34 To: solr-user@lucene.apache.org Subject: Re: Dynamic analizer settings change Thanks, Erik! I might have missed mentioning something relevant. When querying Solr, I wouldn't actually need to query all fields, but only the one corresponding to the language picked by the user on the website. If he's using DE, then the search should only apply to the text_de field. What if I need to work with 50 different languages? Then I would get a schema with 50 types and 50 fields (text_en, text_fr, text_de, ...): won't this affect the performance ? bigger documents - slower queries. Yes, that will affect performance greatly! The problem is not searching 50 languages but when using (e)dismax, the problem is creating the entire query. You will see good performance in the `process` part of a search but poor performance in the `prepare` part of the search when debugging. -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No or limited use of FieldCache
The reason I mention sort is that we in my project, half a year ago, have dealt with the FieldCache-OOM-problem when doing sort-requests. We basically just reject sort-requests unless they hit below X documents - in case they do we just find them without sorting and sort them ourselves afterwards. Currently our problem is, that we have to do a group/distinct (in SQL-language) query and we have found that we can do what we want to do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - either will work for us. Problem is that they both use FieldCache and we know that using FieldCache will lead to OOM-execptions with the amount of data each of our Solr-nodes administrate. This time we have really no option of just limit usage as we did with sort. Therefore we need a group/distinct-functionality that works even on huge data-amounts (and a algorithm using FieldCache will not) I believe setting facet.method=enum will actually make facet not use the FieldCache. Is that true? Is it a bad idea? I do not know much about DocValues, but I do not believe that you will avoid FieldCache by using DocValues? Please elaborate, or point to documentation where I will be able to read that I am wrong. Thanks! Regards, Per Steffensen On 9/11/13 1:38 PM, Erick Erickson wrote: I don't know any more than Michael, but I'd _love_ some reports from the field. There are some restriction on DocValues though, I believe one of them is that they don't really work on analyzed data FWIW, Erick
Re: Dynamic analizer settings change
Yes, supporting multiple languages will be a performance hit, but maybe it won't be so bad since all but one of these language-specific fields will be empty for each document and Lucene text search should handle empty field values just fine. If you can't accept that performance hit, don't support multiple languages! It is completely your choice. There are index-time update processors that can do language detection and then automatically direct the text to the proper text_xx field. See: https://cwiki.apache.org/confluence/display/solr/Detecting+Languages+During+Indexing Although my e-book has a lot better examples, especially for the field redirection aspect. -- Jack Krupansky -Original Message- From: maephisto Sent: Wednesday, September 11, 2013 8:33 AM To: solr-user@lucene.apache.org Subject: Re: Dynamic analizer settings change Thanks, Erik! I might have missed mentioning something relevant. When querying Solr, I wouldn't actually need to query all fields, but only the one corresponding to the language picked by the user on the website. If he's using DE, then the search should only apply to the text_de field. What if I need to work with 50 different languages? Then I would get a schema with 50 types and 50 fields (text_en, text_fr, text_de, ...): won't this affect the performance ? bigger documents - slower queries. -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089288.html Sent from the Solr - User mailing list archive at Nabble.com.
charset encoding
i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: No or limited use of FieldCache
On 09/11/2013 08:40 AM, Per Steffensen wrote: The reason I mention sort is that we in my project, half a year ago, have dealt with the FieldCache-OOM-problem when doing sort-requests. We basically just reject sort-requests unless they hit below X documents - in case they do we just find them without sorting and sort them ourselves afterwards. Currently our problem is, that we have to do a group/distinct (in SQL-language) query and we have found that we can do what we want to do using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - either will work for us. Problem is that they both use FieldCache and we know that using FieldCache will lead to OOM-execptions with the amount of data each of our Solr-nodes administrate. This time we have really no option of just limit usage as we did with sort. Therefore we need a group/distinct-functionality that works even on huge data-amounts (and a algorithm using FieldCache will not) I believe setting facet.method=enum will actually make facet not use the FieldCache. Is that true? Is it a bad idea? I do not know much about DocValues, but I do not believe that you will avoid FieldCache by using DocValues? Please elaborate, or point to documentation where I will be able to read that I am wrong. Thanks! There is Simon Willnauer's presentation http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene and this blog post http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ and this one that shows some performance comparisons: http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
AW: Facet values for spacial field
Hi Eric (and others), thanx for the the explanation. This helps. For the usecase: I am cataloging findings of field expeditions. The collectors usualy store a single location for the field trip, so the numer of locations is limited. Regards Chris Von: Erick Erickson [erickerick...@gmail.com] Gesendet: Dienstag, 10. September 2013 19:14 Bis: solr-user@lucene.apache.org Betreff: Re: Facet values for spacial field You might be able to facet by query, but faceting by location fields doesn't make a huge amount of sense, you'll have lots of facets on individual lat/lon points. What is the use-case you are trying to support here? Best, Erick On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK c.koeh...@zfmk.dewrote: Hi, I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates just find so I am sure they are indexed correctly. However when I try to create facets from this field, solr returns something which looks like a hash of the coordinates: Schema: ?xml version=1.0 encoding=UTF-8 ? schema name=example version=1.5 types ... fieldType name=location class=solr.**SpatialRecursivePrefixTreeFiel**dType units=degrees / ... field name=geo_locality type=location indexed=true stored=true / /schema Result: http://localhost/solr/browse?**facet=truefacet.field=geo_**localityhttp://localhost/solr/browse?facet=truefacet.field=geo_locality- ... lst name=facet_fields lst name=geo_locality int name=7zz660/int int name=t4m70cmvej9290/int int name=t4187pnmky3214/int int name=t441z6vwv3j179/int int name=t4328x4s6dj165/int int name=t1c639yyxdr143/int ... /lst /lst Filtering by this hashes fails: http://localhost/solr/browse?**q=fq=geo_localityhttp://localhost/solr/browse?q=fq=geo_locality :**t4m70cmvej9 java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9 How do I get the results of a single location using faceting? Any thoughts? Regards Chris -- Christian Köhler Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Dynamic analizer settings change
Thanks Jack! Indeed, very nice examples in your book. Inspired from there, here's a crazy idea: would it be possible to build a custom processor chain that would detect the language and use it to apply filters, like the aforementioned SnowballPorterFilter. That would leave at the end a document having as fields: text(with filtered content) and language(the one determined by the processor). And at search time, always append the language=user selected language. Does this make sense? If so, would it affect the performance at index time? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrj-httpclient-slow
First, I would be wary of mixing the solrj version with a different solr version. They are pretty compatible but what are you expecting to gain for the risk? Regardless, though, that shouldn't be your problem. You'll have to give us a lot more detail about what you're trying to do, what you mean by slow (300ms? 300 secnds?) and what you expect. Best Erick On Wed, Sep 11, 2013 at 7:44 AM, xiaoqi belivexia...@gmail.com wrote: hi,everyone when i track my solr client timing cost , i find one problem : some time the whole execute time is very long ,when i go to detail ,i find the solr server execute short time , then the main costs inside httpclient (make a connection ,send request or recived response ,blablabla. i am not familar httpclient inside code . does anyone met the same problem ? although , i update solrj 's new version ,the problem still. by the way : my solrj version is : 4.2 ,solr is 3.* Thanks a lot -- View this message in context: http://lucene.472066.n3.nabble.com/solrj-httpclient-slow-tp4089287.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet values for spacial field
It seems like the right thing to do here is store something more intelligible than an encoded lat/lon pair and facet on that instead. lat/lon, even bare are not all that useful without some effort anywa... FWIW, Erick On Wed, Sep 11, 2013 at 9:24 AM, Köhler Christian c.koeh...@zfmk.de wrote: Hi Eric (and others), thanx for the the explanation. This helps. For the usecase: I am cataloging findings of field expeditions. The collectors usualy store a single location for the field trip, so the numer of locations is limited. Regards Chris Von: Erick Erickson [erickerick...@gmail.com] Gesendet: Dienstag, 10. September 2013 19:14 Bis: solr-user@lucene.apache.org Betreff: Re: Facet values for spacial field You might be able to facet by query, but faceting by location fields doesn't make a huge amount of sense, you'll have lots of facets on individual lat/lon points. What is the use-case you are trying to support here? Best, Erick On Tue, Sep 10, 2013 at 8:43 AM, Christian Köhler - ZFMK c.koeh...@zfmk.dewrote: Hi, I use the new SpatialRecursivePrefixTreeFiel**dType field to store geo coordinates (e.g. 14.021666,51.5433353 ). I can retrieve the coordinates just find so I am sure they are indexed correctly. However when I try to create facets from this field, solr returns something which looks like a hash of the coordinates: Schema: ?xml version=1.0 encoding=UTF-8 ? schema name=example version=1.5 types ... fieldType name=location class=solr.**SpatialRecursivePrefixTreeFiel**dType units=degrees / ... field name=geo_locality type=location indexed=true stored=true / /schema Result: http://localhost/solr/browse?**facet=truefacet.field=geo_**locality http://localhost/solr/browse?facet=truefacet.field=geo_locality- ... lst name=facet_fields lst name=geo_locality int name=7zz660/int int name=t4m70cmvej9290/int int name=t4187pnmky3214/int int name=t441z6vwv3j179/int int name=t4328x4s6dj165/int int name=t1c639yyxdr143/int ... /lst /lst Filtering by this hashes fails: http://localhost/solr/browse?**q=fq=geo_locality http://localhost/solr/browse?q=fq=geo_locality :**t4m70cmvej9 java.lang.**IllegalArgumentException: missing parens: t4m70cmvej9 How do I get the results of a single location using faceting? Any thoughts? Regards Chris -- Christian Köhler Zoologisches Forschungsmuseum Alexander Koenig Leibniz-Institut für Biodiversität der Tiere Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn -- Zoologisches Forschungsmuseum Alexander Koenig - Leibniz-Institut für Biodiversität der Tiere - Adenauerallee 160, 53113 Bonn, Germany www.zfmk.de Stiftung des öffentlichen Rechts; Direktor: Prof. J. Wolfgang Wägele Sitz: Bonn
Re: Dynamic analizer settings change
You're still in danger of overly-broad hits. When you try stemming differently into the _same_ underlying field you get things that make sense in one language but are totally bogus in another language matching the query. As far as lots and lots of fields is concerned, if you want to restrict your searches to only one language you have a couple of choices here Consider a different core per language. Solr easily handles many cores/server. Now you have no 'wasted' space, it just happens that the stemmer for the core uses the DE-specific stemmers. Which you can extend to German de-compounding etc. Alternatively, you can form your queries with some care. There's nothing that requires, say, edismax to be specified in solrconfig.xml. Anything you would put in the defaults section of the config you can override on the command line. So, for instance, if you knew you were querying in French, you could form something like (going from memory) defType=edismaxqf=title_fr,text_fr or qf=title_de,text_de and so completely avoid cross-languge searching. Or you could simply include a field that has the language and tack on an fq clause like fq=de. But you haven't told us how big your problem is. I wouldn't worry at all about efficiency at this stage if you have, say, 10M documents, I'd just try the simplest thing first and measure. 500M documents is probably another story. FWIW Erick On Wed, Sep 11, 2013 at 9:50 AM, maephisto my_sky...@yahoo.com wrote: Thanks Jack! Indeed, very nice examples in your book. Inspired from there, here's a crazy idea: would it be possible to build a custom processor chain that would detect the language and use it to apply filters, like the aforementioned SnowballPorterFilter. That would leave at the end a document having as fields: text(with filtered content) and language(the one determined by the processor). And at search time, always append the language=user selected language. Does this make sense? If so, would it affect the performance at index time? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Dynamic-analizer-settings-change-tp4089274p4089305.html Sent from the Solr - User mailing list archive at Nabble.com.
Higher Memory Usage with solr 4.4
Hi, We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the Solr with 40GB but we noticed that the QTime is way high compared to similar on 3.5 solr. Both the 3.5 and 4.4 solr's configurations and schema are similarly constructed. Also during the triage we found the physical memory to be utilized at 95..%. Is there any configuration we might be missing. Looking forward for your reply. Thanks. Kuchekar, Nilesh
Re: Error with Solr 4.4.0, Glassfish, and CentOS 6.2
On 9/10/2013 9:18 PM, vhoangvu wrote: Yesterday, I just install latest version of Solr 4.4.0 on Glassfish and CentOS 6.2 and got an error when try to access the administration page. I have checked this version on Mac OS one month ago, it works well. So, please help me clarify what problem. snip [#|2013-09-10T18:31:36.896+|INFO|oracle-glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=1;_ThreadName=Thread-2;|2907 [main] ERROR org.apache.solr.core.SolrCore ? null:org.apache.solr.common.SolrException: Error instantiating shardHandlerFactory class [HttpShardHandlerFactory]: Failure initializing default system SSL context This is a container problem. It can't initialize SSL. The most common reason is that the java keystore has a password and it hasn't been provided. If that's the problem, here's one solution: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1364232676233-4051159.p...@n3.nabble.com%3E Another solution, especially if you aren't going to be hosting SSL in Java containers at all on that machine, is to get rid of the keystore entirely. If that doesn't do it, you'll need to get help from a Glassfish support avenue. Thanks, Shawn
Re: Higher Memory Usage with solr 4.4
There are some defaults (sorry, don't have them listed) that are somewhat different. If you took your 3.5 and just used it for 4.x, it's probably worth going back over it and start with the 4.x example and add in any customizations you did for 3.5... But in general, the memory usage for 4.x should be much smaller than for 3.5, there were some _major_ improvements in that area. So I'm guessing you've moved over some innocent-seeming config.. FWIW, Erick On Wed, Sep 11, 2013 at 10:54 AM, Kuchekar kuchekar.nil...@gmail.comwrote: Hi, We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the Solr with 40GB but we noticed that the QTime is way high compared to similar on 3.5 solr. Both the 3.5 and 4.4 solr's configurations and schema are similarly constructed. Also during the triage we found the physical memory to be utilized at 95..%. Is there any configuration we might be missing. Looking forward for your reply. Thanks. Kuchekar, Nilesh
Re: synonyms not working
Attach debug=query to your URL and inspect the parsed query, you should be seeing the substitutions if you're configured correctly. Multi-word synonyms at query time have the getting through the query parser problem. Best Erick On Wed, Sep 11, 2013 at 11:04 AM, cheops m.schm...@mediaskill.de wrote: Hi, I'm using solr4.4 and try to use different synonyms based on different fieldtypes: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType ...I have the same fieldtype for english (name=text_general_en and synonyms=synonyms_en.txt). The first fieldtype works fine, my synonyms are processed and the result is as expected. But the en-version doesn't seem to work. I'm able to find the original english words but the synonyms are not processed. ps: yes, i know using synonyms at query time is not a good idea :-) ... but can't change it here Any help would be appreciated! Thank you. Best regards Marcus -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html Sent from the Solr - User mailing list archive at Nabble.com.
synonyms not working
Hi, I'm using solr4.4 and try to use different synonyms based on different fieldtypes: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType ...I have the same fieldtype for english (name=text_general_en and synonyms=synonyms_en.txt). The first fieldtype works fine, my synonyms are processed and the result is as expected. But the en-version doesn't seem to work. I'm able to find the original english words but the synonyms are not processed. ps: yes, i know using synonyms at query time is not a good idea :-) ... but can't change it here Any help would be appreciated! Thank you. Best regards Marcus -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: synonyms not working
thanx for your help. could solve the problem meanwhile! i used analyzer type=query_en ...which is wrong, it must be analyzer type=query -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-not-working-tp4089318p4089345.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Higher Memory Usage with solr 4.4
On 9/11/2013 8:54 AM, Kuchekar wrote: We are using solr 4.4 on Linux with OpenJDK 64-Bit. We started the Solr with 40GB but we noticed that the QTime is way high compared to similar on 3.5 solr. Both the 3.5 and 4.4 solr's configurations and schema are similarly constructed. Also during the triage we found the physical memory to be utilized at 95..%. A 40GB heap is *huge*. Unless you are dealing with millions of super-large documents or many many millions of smaller documents, there should be no need for a heap that large. Additionally, if you are allocating most of your system memory to Java, then you will have little or no RAM available for OS disk caching, which will cause major performance issues. For most indexes, memory usage should be less after an upgrade, but there are exceptions. I see that you had an earlier question about stored field compression, and that you talked about exporting data from your 3.5 install to index into 4.4, in which you had stored every field, including copyFields. If you have a lot of stored data, memory usage for decompression can become a problem. It's usually a lot better to store minimal information, just enough to display a result grid/list, and some ID information so that when someone clicks on an individual result, you can retrieve the entire record from another data source, like a database or a filesystem. Here's a more exhaustive list of potential performance and memory problems with Solr: http://wiki.apache.org/solr/SolrPerformanceProblems OpenJDK may be problematic, especially if it's version 6. With Java 7, OpenJDK is actually the reference implementation, so if you are using OpenJDK 7, I would be less concerned. With either version, Oracle Java tends to produce better results. Thanks, Shawn
Distributing lucene segments across multiple disks.
Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
@Greg - Are you suggesting RAID as a replacement for Solr or making Solr work with RAID? Could you elaborate more on the latter, if that's you meant? We make use of solr's advanced text processing features which would be hard to replicate just using RAID. -Deepak On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com wrote: Why not use some form of RAID for your index store? You'd get the performance benefit of multiple disks without the complexity of managing them via solr. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Re: Distributing lucene segments across multiple disks.
Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
RE: Distributing lucene segments across multiple disks.
Why not use some form of RAID for your index store? You'd get the performance benefit of multiple disks without the complexity of managing them via solr. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
RE: Distributing lucene segments across multiple disks.
Deepak, Sorry for not being more verbose in my previous suggestion. As I take your question, you'd like to spread your index files across multiple disks (for performance or space reasons I assume). If you used even a basic md-raid setup you could then format the raid device and thus your entire set of disks with your favorite filesystem, mount it in one directory in the directory tree then configure it as the data directory that solr uses. This setup would accomplish your goal of having the lucene indexes spread across multiple disks without the complexity of using multiple solr cores/collections. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 2:26 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. @Greg - Are you suggesting RAID as a replacement for Solr or making Solr work with RAID? Could you elaborate more on the latter, if that's you meant? We make use of solr's advanced text processing features which would be hard to replicate just using RAID. -Deepak On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com wrote: Why not use some form of RAID for your index store? You'd get the performance benefit of multiple disks without the complexity of managing them via solr. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 2:07 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. -Deepak On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote: I think you'll find it hard to distribute different segments between disks, as they are typically stored in the same directory. However, instantiating separate cores on different disks should be straight-forward enough, and would give you a performance benefit. I've certainly heard of that done at Amazon, with a separate EBS volume per core giving some performance improvement. Upayavira On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote: Hi, I know that SolrCloud allows you to have multiple shards on different machines (or a single machine). But it requires a zookeeper installation for doing things like leader election, leader availability, etc While SolrCloud may be the ideal solution for my usecase eventually, I'd like to know if there's a way I can point my Solr instance to read lucene segments distributed across different disks attached to the same machine. Thanks! -Deepak
Do I need to delete my index?
Hello, I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Thanks, Brian
Re: Distributing lucene segments across multiple disks.
On 9/11/2013 1:07 PM, Deepak Konidena wrote: Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. Sure, you can do it all manually. At that point you would not be using SolrCloud at all, because the way to enable SolrCloud is to tell Solr where zookeeper lives. Without SolrCloud, there is no cluster automation at all. There is no collection paradigm, you just have cores. You have to send updates to the correct core; they not be redirected for you. Similarly, queries will not be load balanced automatically. For Java clients, the CloudSolrServer object can work seamlessly when servers go down. If you're not using SolrCloud, you can't use CloudSolrServer. You would be in charge of creating the shards parameter yourself. The way that I do this on my index is that I have a broker core that has no index of its own, but its solrconfig.xml has the shards and shards.qt parameters in all the request handler definitions. You can also include the parameter with the query. You would also have to handle redundancy yourself, either with replication or with independently updated indexes. I use the latter method, because it offers a lot more flexibility than replication. As mentioned in another reply, setting up RAID with a lot of disks may be better than trying to split your index up on different filesystems that each reside on different disks. I would recommend RAID10 for Solr, and it works best if it's hardware RAID and the controller has battery-backed (or NVRAM) cache. Thanks, Shawn
Re: Error while importing HBase data to Solr using the DataImportHandler
Hi, Can you provide me an example of data-config.xml? because with my Hbase configuration, I am getting Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodError: org.apache.hadoop.net.NetUtils.getInputStream(Ljava/net/Socket;)Ljava/io/InputStream; AND Exception while processing: item document : SolrInputDocument[]:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute SCANNER: [tableName=Item, startRow=null, stopRow=null, columns=[{Item|r}, {Item|m}, {Item|u}]] Processing Document # 1 Mine data-config.xml: dataConfig dataSource type=HbaseDataSource name=HBase host=127.0.0.1 port=2181 / document name=Item entity name=item pk=ROW_KEY dataSource=HBase processor=HbaseEntityProcessor tableName=Item onError=abort columns=Item|r, Item|m, Item|u query=scan 'Item', {COLUMNS = ['r','m', 'u']} deltaImportQuery= deltaQuery= field column=ROW_KEY name=id / field column=r name=r / field column=m name=m / field column=u name=u / /entity /document /dataConfig Please respond me ASAP. Thanks in advance!! -- View this message in context: http://lucene.472066.n3.nabble.com/Error-while-importing-HBase-data-to-Solr-using-the-DataImportHandler-tp4085613p4089402.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Do I need to delete my index?
In addition, if I do need to delete my index, how do I go about that? I've been looking through the documentation and can't find anything specific. I know where the index is, I'm just not sure which files to delete. Hello, I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Thanks, Brian
Re: Distributing lucene segments across multiple disks.
I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? -Deepak On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 1:07 PM, Deepak Konidena wrote: Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. Sure, you can do it all manually. At that point you would not be using SolrCloud at all, because the way to enable SolrCloud is to tell Solr where zookeeper lives. Without SolrCloud, there is no cluster automation at all. There is no collection paradigm, you just have cores. You have to send updates to the correct core; they not be redirected for you. Similarly, queries will not be load balanced automatically. For Java clients, the CloudSolrServer object can work seamlessly when servers go down. If you're not using SolrCloud, you can't use CloudSolrServer. You would be in charge of creating the shards parameter yourself. The way that I do this on my index is that I have a broker core that has no index of its own, but its solrconfig.xml has the shards and shards.qt parameters in all the request handler definitions. You can also include the parameter with the query. You would also have to handle redundancy yourself, either with replication or with independently updated indexes. I use the latter method, because it offers a lot more flexibility than replication. As mentioned in another reply, setting up RAID with a lot of disks may be better than trying to split your index up on different filesystems that each reside on different disks. I would recommend RAID10 for Solr, and it works best if it's hardware RAID and the controller has battery-backed (or NVRAM) cache. Thanks, Shawn
Re: Do I need to delete my index?
On 9/11/2013 2:17 PM, Brian Robinson wrote: I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Changing the field type is not going to affect this issue. Because you are not indexing the field, the choice of string or text_general is not really going to matter, but string will probably be more efficient. What is happening here is an XML issue with the update request itself. The PHP client is sending an XML update request to Solr, and the request includes the URL text as-is in the XML request. It is not properly XML encoded. For an XML update request, that snippet of your text would need to be encoded as amp;load= to work properly. XML has a much smaller list of valid entities than HTML, but load is not a valid entity in either XML or HTML. I was going to call this a bug in the PHP client library, but then I got a look at what SolrClient::request actually does: http://php.net/manual/en/solrclient.request.php It expects you to create the XML yourself, which means you have to do all the encoding of characters which have special meaning to XML. If you have no desire to figure out proper XML encoding, you should probably be using SolrClient::addDocument or SolrClient::addDocuments instead. Thanks, Shawn
RE: Distributing lucene segments across multiple disks.
Deepak, It might be a bit outside what you're willing to consider but you can make a raid out of your spinning disks then use your SSD(s) as a dm-cache device to accelerate reads and writes to the raid device. If you're putting lucene indexes on a mixed bag of disks and ssd's without any type of control for what goes where you'd want to use the ssd to accelerate the spinning disks anyway. Check out http://lwn.net/Articles/540996/ for more information on the dm-cache device. Thanks, Greg -Original Message- From: Deepak Konidena [mailto:deepakk...@gmail.com] Sent: Wednesday, September 11, 2013 3:57 PM To: solr-user@lucene.apache.org Subject: Re: Distributing lucene segments across multiple disks. I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? -Deepak On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 1:07 PM, Deepak Konidena wrote: Are you suggesting a multi-core setup, where all the cores share the same schema, and the cores lie on different disks? Basically, I'd like to know if I can distribute shards/segments on a single machine (with multiple disks) without the use of zookeeper. Sure, you can do it all manually. At that point you would not be using SolrCloud at all, because the way to enable SolrCloud is to tell Solr where zookeeper lives. Without SolrCloud, there is no cluster automation at all. There is no collection paradigm, you just have cores. You have to send updates to the correct core; they not be redirected for you. Similarly, queries will not be load balanced automatically. For Java clients, the CloudSolrServer object can work seamlessly when servers go down. If you're not using SolrCloud, you can't use CloudSolrServer. You would be in charge of creating the shards parameter yourself. The way that I do this on my index is that I have a broker core that has no index of its own, but its solrconfig.xml has the shards and shards.qt parameters in all the request handler definitions. You can also include the parameter with the query. You would also have to handle redundancy yourself, either with replication or with independently updated indexes. I use the latter method, because it offers a lot more flexibility than replication. As mentioned in another reply, setting up RAID with a lot of disks may be better than trying to split your index up on different filesystems that each reside on different disks. I would recommend RAID10 for Solr, and it works best if it's hardware RAID and the controller has battery-backed (or NVRAM) cache. Thanks, Shawn
ReplicationFactor for solrcloud
Hi - I am trying to set the 3 shards and 3 replicas for my solrcloud deployment with 3 servers, specifying the replicationFactor=3 and numShards=3 when starting the first node. I see each of the servers allocated to 1 shard each.however, do not see 3 replicas allocated on each node. I specifically need to have 3 replicas across 3 servers with 3 shards. Do we think of any reason to not have this configuration ? -- Regards, -Aditya Sakhuja
Re: Distributing lucene segments across multiple disks.
On 9/11/2013 2:57 PM, Deepak Konidena wrote: I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? On the broker core - I have a core on my servers that has no index of its own. In the /select handler (and others) I have placed a shards parameter, and many of them also have a shards.qt parameter. The shards paramter is how a non-cloud distributed search is done. http://wiki.apache.org/solr/DistributedSearch Addressing your first paragraph: You say that you have lots of RAM ... but is there a lot of unallocated RAM that the OS can use for caching, or is it mostly allocated to processes, such as the java heap for Solr? Depending on exactly how your indexes are composed, you need up to 100% of the total index size available as unallocated RAM. With SSD, the requirement is less, but cannot be ignored. I personally wouldn't go below about 25-50% even with SSD, and I'd plan on 50-100% for regular disks. There is some evidence to suggest that you only need unallocated RAM equal to 10% of your index size for caching with SSD, but that is only likely to work if you have a lot of stored (as opposed to indexed) data. If most of your index is unstored, then more would be required. Thanks, Shawn
Re: Do I need to delete my index?
On 9/11/2013 3:17 PM, Brian Robinson wrote: In addition, if I do need to delete my index, how do I go about that? I've been looking through the documentation and can't find anything specific. I know where the index is, I'm just not sure which files to delete. Generally you'll find it in a path that ends with data/index ... but if you have messed with dataDir, it might just end in /index instead. Here is an example of index directory contents, from a system that *is* changing dataDir: ncindex@bigindy5 /index/solr4/data/s2_0 $ echo `ls -1 index` _m5o.fdt _m5o.fdx _m5o.fnm _m5o_Lucene41_0.doc _m5o_Lucene41_0.pos _m5o_Lucene41_0.tim _m5o_Lucene41_0.tip _m5o_Lucene45_0.dvd _m5o_Lucene45_0.dvm _m5o.nvd _m5o.nvm _m5o.si _m5o.tvd _m5o.tvx _m5o_z.del _m5p.fdt _m5p.fdx _m5p.fnm _m5p_Lucene41_0.doc _m5p_Lucene41_0.pos _m5p_Lucene41_0.tim _m5p_Lucene41_0.tip _m5p_Lucene45_0.dvd _m5p_Lucene45_0.dvm _m5p.nvd _m5p.nvm _m5p.si _m5p.tvd _m5p.tvx _m5v.fdt _m5v.fdx _m5v.fnm _m5v_Lucene41_0.doc _m5v_Lucene41_0.pos _m5v_Lucene41_0.tim _m5v_Lucene41_0.tip _m5v_Lucene45_0.dvd _m5v_Lucene45_0.dvm _m5v.nvd _m5v.nvm _m5v.si _m5v.tvd _m5v.tvx _m5w.fdt _m5w.fdx _m5w.fnm _m5w_Lucene41_0.doc _m5w_Lucene41_0.pos _m5w_Lucene41_0.tim _m5w_Lucene41_0.tip _m5w_Lucene45_0.dvd _m5w_Lucene45_0.dvm _m5w.nvd _m5w.nvm _m5w.si _m5w.tvd _m5w.tvx _m5x.fdt _m5x.fdx _m5x.fnm _m5x_Lucene41_0.doc _m5x_Lucene41_0.pos _m5x_Lucene41_0.tim _m5x_Lucene41_0.tip _m5x_Lucene45_0.dvd _m5x_Lucene45_0.dvm _m5x.nvd _m5x.nvm _m5x.si _m5x.tvd _m5x.tvx _m5y.fdt _m5y.fdx _m5y.fnm _m5y_Lucene41_0.doc _m5y_Lucene41_0.pos _m5y_Lucene41_0.tim _m5y_Lucene41_0.tip _m5y_Lucene45_0.dvd _m5y_Lucene45_0.dvm _m5y.nvd _m5y.nvm _m5y.si _m5y.tvd _m5y.tvx _m5z.fdt _m5z.fdx _m5z.fnm _m5z_Lucene41_0.doc _m5z_Lucene41_0.pos _m5z_Lucene41_0.tim _m5z_Lucene41_0.tip _m5z_Lucene45_0.dvd _m5z_Lucene45_0.dvm _m5z.nvd _m5z.nvm _m5z.si _m5z.tvd _m5z.tvx segments_554 segments.gen write.lock Thanks, Shawn
Re: Distributing lucene segments across multiple disks.
@Greg - Thanks for the suggestion. Will pass it along to my folks. @Shawn - That's the link I was looking for 'non-SolrCloud approach to distributed search'. Thanks for passing that along. Will give it a try. As far as RAM usage goes, I believe we set the heap size to about 40% of the RAM and less than 10% is available for OS caching ( since replica takes another 40%). Why does unallocated RAM help? How does it impact performance under load? -Deepak On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 2:57 PM, Deepak Konidena wrote: I guess at this point in the discussion, I should probably give some more background on why I am doing what I am doing. Having a single Solr shard (multiple segments) on the same disk is posing severe performance problems under load,in that, calls to Solr cause a lot of connection timeouts. When we looked at the ganglia stats for the Solr box, we saw that while memory, cpu and network usage were quite normal, the i/o wait spiked. We are unsure on what caused the i/o wait and why there were no spikes in the cpu/memory usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD), we'd like to distribute the segments to multiple locations (disks) and see whether this improves performance under load. @Greg - Thanks for clarifying that. I just learnt that I can't set them up using RAID as some of them are SSDs and some others are SATA (spinning disks). @Shawn Heisey - Could you elaborate more about the broker core and delegating the requests to other cores? On the broker core - I have a core on my servers that has no index of its own. In the /select handler (and others) I have placed a shards parameter, and many of them also have a shards.qt parameter. The shards paramter is how a non-cloud distributed search is done. http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch Addressing your first paragraph: You say that you have lots of RAM ... but is there a lot of unallocated RAM that the OS can use for caching, or is it mostly allocated to processes, such as the java heap for Solr? Depending on exactly how your indexes are composed, you need up to 100% of the total index size available as unallocated RAM. With SSD, the requirement is less, but cannot be ignored. I personally wouldn't go below about 25-50% even with SSD, and I'd plan on 50-100% for regular disks. There is some evidence to suggest that you only need unallocated RAM equal to 10% of your index size for caching with SSD, but that is only likely to work if you have a lot of stored (as opposed to indexed) data. If most of your index is unstored, then more would be required. Thanks, Shawn
Re: Do I need to delete my index?
Thanks Shawn. I had actually tried changing load= to amp;load=, but still got the error. It sounds like addDocuments is worth a try, though. On 9/11/2013 4:37 PM, Shawn Heisey wrote: On 9/11/2013 2:17 PM, Brian Robinson wrote: I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Changing the field type is not going to affect this issue. Because you are not indexing the field, the choice of string or text_general is not really going to matter, but string will probably be more efficient. What is happening here is an XML issue with the update request itself. The PHP client is sending an XML update request to Solr, and the request includes the URL text as-is in the XML request. It is not properly XML encoded. For an XML update request, that snippet of your text would need to be encoded as amp;load= to work properly. XML has a much smaller list of valid entities than HTML, but load is not a valid entity in either XML or HTML. I was going to call this a bug in the PHP client library, but then I got a look at what SolrClient::request actually does: http://php.net/manual/en/solrclient.request.php It expects you to create the XML yourself, which means you have to do all the encoding of characters which have special meaning to XML. If you have no desire to figure out proper XML encoding, you should probably be using SolrClient::addDocument or SolrClient::addDocuments instead. Thanks, Shawn
Re: Distributing lucene segments across multiple disks.
On 9/11/2013 4:16 PM, Deepak Konidena wrote: As far as RAM usage goes, I believe we set the heap size to about 40% of the RAM and less than 10% is available for OS caching ( since replica takes another 40%). Why does unallocated RAM help? How does it impact performance under load? Because once the data is in the OS disk cache, reading it becomes instantaneous, it doesn't need to go out to the disk. Disks are glacial compared to RAM. Even SSD has a far slower response time. Any recent operating system does this automatically, including the one from Redmond that we all love to hate. http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Thanks, Shawn
Re: Do I need to delete my index?
Typically I'll just delete the entire data dir recursively after shutting down Solr, the default location is solr_home/solr/collectionblah/data On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson br...@socialsurgemedia.comwrote: Thanks Shawn. I had actually tried changing load= to amp;load=, but still got the error. It sounds like addDocuments is worth a try, though. On 9/11/2013 4:37 PM, Shawn Heisey wrote: On 9/11/2013 2:17 PM, Brian Robinson wrote: I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Changing the field type is not going to affect this issue. Because you are not indexing the field, the choice of string or text_general is not really going to matter, but string will probably be more efficient. What is happening here is an XML issue with the update request itself. The PHP client is sending an XML update request to Solr, and the request includes the URL text as-is in the XML request. It is not properly XML encoded. For an XML update request, that snippet of your text would need to be encoded as amp;load= to work properly. XML has a much smaller list of valid entities than HTML, but load is not a valid entity in either XML or HTML. I was going to call this a bug in the PHP client library, but then I got a look at what SolrClient::request actually does: http://php.net/manual/en/**solrclient.request.phphttp://php.net/manual/en/solrclient.request.php It expects you to create the XML yourself, which means you have to do all the encoding of characters which have special meaning to XML. If you have no desire to figure out proper XML encoding, you should probably be using SolrClient::addDocument or SolrClient::addDocuments instead. Thanks, Shawn
Re: Do I need to delete my index?
Thanks Erick On 9/11/2013 6:46 PM, Erick Erickson wrote: Typically I'll just delete the entire data dir recursively after shutting down Solr, the default location is solr_home/solr/collectionblah/data On Wed, Sep 11, 2013 at 6:01 PM, Brian Robinson br...@socialsurgemedia.comwrote: Thanks Shawn. I had actually tried changing load= to amp;load=, but still got the error. It sounds like addDocuments is worth a try, though. On 9/11/2013 4:37 PM, Shawn Heisey wrote: On 9/11/2013 2:17 PM, Brian Robinson wrote: I'm in the process of creating my index using a series of SolrClient::request commands in PHP. I ran into a problem when some of the fields that I had as text_general fieldType contained load= in a URL, triggering an error because the HTML entity load wasn't recognized. I realized that I should have made my URL fields of type string instead, so that they would be taken as is (they're not being indexed, just stored), so I removed all docs from my index, updated schema.xml, and restarted Solr, but I'm still getting the same error. Do I need to delete the index itself and then restart to get this to work? Am I correct that changing those fields to string type should fix the issue? Changing the field type is not going to affect this issue. Because you are not indexing the field, the choice of string or text_general is not really going to matter, but string will probably be more efficient. What is happening here is an XML issue with the update request itself. The PHP client is sending an XML update request to Solr, and the request includes the URL text as-is in the XML request. It is not properly XML encoded. For an XML update request, that snippet of your text would need to be encoded as amp;load= to work properly. XML has a much smaller list of valid entities than HTML, but load is not a valid entity in either XML or HTML. I was going to call this a bug in the PHP client library, but then I got a look at what SolrClient::request actually does: http://php.net/manual/en/**solrclient.request.phphttp://php.net/manual/en/solrclient.request.php It expects you to create the XML yourself, which means you have to do all the encoding of characters which have special meaning to XML. If you have no desire to figure out proper XML encoding, you should probably be using SolrClient::addDocument or SolrClient::addDocuments instead. Thanks, Shawn
Grouping by field substring?
Hi all, Assuming I want to use the first N characters of a specific field for grouping results, is such a thing possible out-of-the-box? If not, then what would the next best option be? E.g. a custom function query? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: Grouping by field substring?
Do a copyField to another field, with a limit of 8 characters, and then use that other field. -- Jack Krupansky -Original Message- From: Ken Krugler Sent: Wednesday, September 11, 2013 8:24 PM To: solr-user@lucene.apache.org Subject: Grouping by field substring? Hi all, Assuming I want to use the first N characters of a specific field for grouping results, is such a thing possible out-of-the-box? If not, then what would the next best option be? E.g. a custom function query? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr
Re: solr performance against oracle
Setting asside the excellent responses that have already been made in this thread, there are fundemental discrepencies in what you are comparing in your respective timing tests. first off: a micro benchmark like this is virtually useless -- unless you really plan on only ever executing a single query in a single run of a java application that then terminates, trying to time a single query is silly -- you should do lots and lots of iterations using a large set of sample inputs. Second: what you are timing is vastly different between the two cases. In your Solr timing, no communication happens over the wire to the solr server until the call to server.query() inside your time stamps -- if you were doing multiple requests using the same SolrServer object, the HTTP connection would get re-used, but as things stand your timing includes all of hte network overhead of connecting to the server, sending hte request, and reading the response. in your oracle method however, the timestamps you record are only arround the call to executeQuery(), rs.next(), and rs.getString() ... you are ignoring the timing neccessary for the getConnection() and prepareStatement() methods, which may be significant as they both involved over the wire communication with the remote server (And it's not like these are one time execute and forget about them methods ... in a real long lived application you'd need to manage your connections, re-open if they get closed, recreate the prepared statement if your connection has to be re-open, etc... ) Your comparison is definitly apples and oranges. Lastly, as others have mentioned: 150-200ms to request a single document by uniqueKey from an index containing 800K docs seems ridiculously slow, and suggests that something is poorly configured about your solr instance (another apples to oranges comparison: you've got an ad-hoc solr installation setup on your laptop and you're benchmarking it against a remote oracle server running on dedicated remote hardware that has probably been heavily tunned/optimized for queries). You haven't provided us any details however about how your index is setup, or how you have confiugred solr, or what JVM options you are using to run solr, or what physical resources are available to your solr process (disk, jvm heap ram, os file system cache ram) so there isn't much we can offer in the way of advice on how to speed things up. FWIW: On my laptop, using Solr 4.4 w/ the example configs and built in jetty (ie: java -jart start.jar) i got a 3.4 GB max heap, and a 1.5 GB default heap, with plenty of physical ram left over for the os file system cache of an index i created containing 1,000,000 documents with 6 small fields containing small amounts of random terms. I then used curl to execute ~4150 requests for documents by id (using simple search, not the /get RTG handler) and return the results using JSON. This commpleted in under 4.5 seconds, or ~1.0ms/request. Using the more verbose XML response format (after restarting solr to ensure nothing in the query result caches) only took 0.3 seconds longer on the total time (~1.1ms/request) $ time curl -sS 'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]wt=jsonindent=true' /dev/null real0m4.471s user0m0.412s sys 0m0.116s $ time curl -sS 'http://localhost:8983/solr/collection1/select?q=id%3A[1-100:241]wt=xmlindent=true' /dev/null real0m4.868s user0m0.376s sys 0m0.136s $ java -version java version 1.7.0_25 OpenJDK Runtime Environment (IcedTea 2.3.10) (7u25-2.3.10-1ubuntu0.12.04.2) OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) $ uname -a Linux frisbee 3.2.0-52-generic #78-Ubuntu SMP Fri Jul 26 16:21:44 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux -Hoss
Re: No or limited use of FieldCache
Per, check zee Wiki, there is a page describing docvalues. We used them successfully in a solr for analytics scenario. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 9:15 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 09/11/2013 08:40 AM, Per Steffensen wrote: The reason I mention sort is that we in my project, half a year ago, have dealt with the FieldCache-OOM-problem when doing sort-requests. We basically just reject sort-requests unless they hit below X documents - in case they do we just find them without sorting and sort them ourselves afterwards. Currently our problem is, that we have to do a group/distinct (in SQL-language) query and we have found that we can do what we want to do using group (http://wiki.apache.org/solr/**FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing) or facet - either will work for us. Problem is that they both use FieldCache and we know that using FieldCache will lead to OOM-execptions with the amount of data each of our Solr-nodes administrate. This time we have really no option of just limit usage as we did with sort. Therefore we need a group/distinct-functionality that works even on huge data-amounts (and a algorithm using FieldCache will not) I believe setting facet.method=enum will actually make facet not use the FieldCache. Is that true? Is it a bad idea? I do not know much about DocValues, but I do not believe that you will avoid FieldCache by using DocValues? Please elaborate, or point to documentation where I will be able to read that I am wrong. Thanks! There is Simon Willnauer's presentation http://www.slideshare.net/** lucenerevolution/willnauer-**simon-doc-values-column-** stride-fields-in-lucenehttp://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene and this blog post http://blog.trifork.com/2011/** 10/27/introducing-lucene-**index-doc-values/http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ and this one that shows some performance comparisons: http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/
Re: charset encoding
Using tomcat by any chance? The ML archive has the solution. May be on Wiki, too. Otis Solr ElasticSearch Support http://sematext.com/ On Sep 11, 2013 8:56 AM, Andreas Owen a...@conx.ch wrote: i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag content-encoding as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?
Re: Distributing lucene segments across multiple disks.
Very helpful link. Thanks for sharing that. -Deepak On Wed, Sep 11, 2013 at 4:34 PM, Shawn Heisey s...@elyograg.org wrote: On 9/11/2013 4:16 PM, Deepak Konidena wrote: As far as RAM usage goes, I believe we set the heap size to about 40% of the RAM and less than 10% is available for OS caching ( since replica takes another 40%). Why does unallocated RAM help? How does it impact performance under load? Because once the data is in the OS disk cache, reading it becomes instantaneous, it doesn't need to go out to the disk. Disks are glacial compared to RAM. Even SSD has a far slower response time. Any recent operating system does this automatically, including the one from Redmond that we all love to hate. http://blog.thetaphi.de/2012/**07/use-lucenes-mmapdirectory-** on-64bit.htmlhttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Thanks, Shawn
Re: Can we used CloudSolrServer for searching data
Thanks for your reply. I am using Solrcloud with zookeeper setup. And using CloudSolrServer for both indexing and searching. As per my understanding CloudSolrserver by default using LBHttpSolrServer. And CloudSolrServer connect to Zookeeper and passing all the running server node to LBHttpSolrServer. Thanks for you guys once again for your reply. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-we-used-CloudSolrServer-for-searching-data-tp4086766p4089475.html Sent from the Solr - User mailing list archive at Nabble.com.
Wrapper for SOLR for Compression
I asked this before... But can we add a parameter for SOLR to expose the compression modes to solrconfig.xml ? * https://issues.apache.org/jira/browse/LUCENE-4226* * It mentions that we can set compression mode:* * FAST, HIGH_COMPRESSION, FAST_UNCOMPRESSION.* -- Bill Bell billnb...@gmail.com cell 720-256-8076
number of replicas in Cloud
Hi, I want to setup solrcloud with 2 shards and 1 replica for each shard. MyCollection shard1 , shard2 shard1-replica , shard2-replica In this case, i would numShards=2. For replicationFactor , should give replicationFactor=1 or replicationFActor=2 ? Pls suggest me. thanks, Prasi
Re: number of replicas in Cloud
Prasi, a replicationFactor of 2 is what you want. However, as of the current releases, this is not persisted. On Thu, Sep 12, 2013 at 11:17 AM, Prasi S prasi1...@gmail.com wrote: Hi, I want to setup solrcloud with 2 shards and 1 replica for each shard. MyCollection shard1 , shard2 shard1-replica , shard2-replica In this case, i would numShards=2. For replicationFactor , should give replicationFactor=1 or replicationFActor=2 ? Pls suggest me. thanks, Prasi -- Anshum Gupta http://www.anshumgupta.net