Re: Query by range of price
Hi Raymond, I keep trying to encode the '' but when I look at the solar log it show me that '%26' I'm using urlencode it didn't work what should i do? Please suggest me. Thank you very much, Rachun -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112251.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query by range of price
Hi Raymond, I keep trying to encode the '' but when I look at the solar log it show me that '%26' I'm using urlencode it didn't work what should i do? Im using PHPSolrClient. Please suggest me. Thank you very much, Rachun -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112252.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query by range of price
Hi Raymond, I keep trying to encode the '' but when I look at the solar log it show me that '%26' I'm using urlencode it didn't work what should i do? Im using SolrPHPClient. Please suggest me. Thank you very much, Rachun -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html Sent from the Solr - User mailing list archive at Nabble.com.
Memory Usage on Windows Os while indexing
Facts: OS Windows server 2008 4 Cpu 8 GB Ram Tomcat Service version 7.0 (64 bit) Only running Solr Optional JVM parameters set xmx = 3072, xms = 1024 Solr version 4.5.0. One Core instance (both for querying and indexing) *Schema config:* minGramSize=2 maxGramSize=20 most of the fields are stored = true (required) *Solr config:* ramBufferSizeMB: 100 maxIndexingThreads: 8 directoryFactory: MMapDirectory autocommit: maxdocs 1, maxtime 15000, opensearcher false cache (defaults): filtercache initialsize:512 size: 512 autowarm: 0 queryresultcache initialsize:512 size: 512 autowarm: 0 documentcache initialsize:512 size: 512 autowarm: 0 Problem description: We're using a .Net Service (based on Solr.Net) for updating and inserting documents on a single Solr Core instance. The size of documents sent to Solr vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one or multiple threads. The current size of the Solr Index is about 15GB. The indexing service is running around 4 a 5 hours per day, to complete all inserts and updates to Solr. While the indexing process is running the Tomcat process memory usage keeps growing up to 7GB Ram (using Process Explorer monitor tool) and does not reduce, even after 24 hours. After a restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat process, the memory usage of Tomcat seems ok, memory consumption is in range of defined jvm startup params (see image). So it seems that filesystem buffers are consuming all the leftover memory??, and don't release memory, even after a quite amount of time? Is there a way handle this behaviour, in a way that not all memory is consumed? Are there other alternatives? Best practices? http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query by range of price
That's exactly what I would expect from url-encoding ''. So, the thing that you're doing works as it should, but you're probably doing something that you should not do (in this case, urlencode). I have not used SolrPHPClient myself, but from the example at http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it appears that you should not do any urlencoding yourself, at all. Further, if you're using data that is already urlencoded, you should urldecode it before handing it over to SolrPHPClient. On Mon, Jan 20, 2014 at 10:34 AM, rachun rachun.c...@gmail.com wrote: Hi Raymond, I keep trying to encode the '' but when I look at the solar log it show me that '%26' I'm using urlencode it didn't work what should i do? Im using SolrPHPClient. Please suggest me. Thank you very much, Rachun -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query by range of price
Followup: I *think* something like this should work: $results = $solr-search($query, $start, $rows, array('sort' = 'price_min asc,update_date desc', 'facet.query' = 'price_min:[* TO 1300]')); On Mon, Jan 20, 2014 at 11:05 AM, Raymond Wiker rwi...@gmail.com wrote: That's exactly what I would expect from url-encoding ''. So, the thing that you're doing works as it should, but you're probably doing something that you should not do (in this case, urlencode). I have not used SolrPHPClient myself, but from the example at http://code.google.com/p/solr-php-client/wiki/FAQ#How_Can_I_Use_Additional_Parameters_%28like_fq,_facet,_etc%29it appears that you should not do any urlencoding yourself, at all. Further, if you're using data that is already urlencoded, you should urldecode it before handing it over to SolrPHPClient. On Mon, Jan 20, 2014 at 10:34 AM, rachun rachun.c...@gmail.com wrote: Hi Raymond, I keep trying to encode the '' but when I look at the solar log it show me that '%26' I'm using urlencode it didn't work what should i do? Im using SolrPHPClient. Please suggest me. Thank you very much, Rachun -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112256.html Sent from the Solr - User mailing list archive at Nabble.com.
LSH in Solr/Lucene
Hi folks, have any of you successfully implemented LSH (MinHash) in Solr? If so, could you share some details of how you went about it? I know LSH is available in Mahout, but was hoping if someone has a solr or Lucene implementation. Thanks
Re: Memory Usage on Windows Os while indexing
The fact that you see the memory consumed too high should be consecuency of that some memory of the heap is only released after a full GC. With the VisualVM tool you can try to force a full GC and see if the memory is released. /yago — /Yago Riveiro On Mon, Jan 20, 2014 at 10:03 AM, onetwothree joydivis...@telenet.be wrote: Facts: OS Windows server 2008 4 Cpu 8 GB Ram Tomcat Service version 7.0 (64 bit) Only running Solr Optional JVM parameters set xmx = 3072, xms = 1024 Solr version 4.5.0. One Core instance (both for querying and indexing) *Schema config:* minGramSize=2 maxGramSize=20 most of the fields are stored = true (required) *Solr config:* ramBufferSizeMB: 100 maxIndexingThreads: 8 directoryFactory: MMapDirectory autocommit: maxdocs 1, maxtime 15000, opensearcher false cache (defaults): filtercache initialsize:512 size: 512 autowarm: 0 queryresultcache initialsize:512 size: 512 autowarm: 0 documentcache initialsize:512 size: 512 autowarm: 0 Problem description: We're using a .Net Service (based on Solr.Net) for updating and inserting documents on a single Solr Core instance. The size of documents sent to Solr vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one or multiple threads. The current size of the Solr Index is about 15GB. The indexing service is running around 4 a 5 hours per day, to complete all inserts and updates to Solr. While the indexing process is running the Tomcat process memory usage keeps growing up to 7GB Ram (using Process Explorer monitor tool) and does not reduce, even after 24 hours. After a restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat process, the memory usage of Tomcat seems ok, memory consumption is in range of defined jvm startup params (see image). So it seems that filesystem buffers are consuming all the leftover memory??, and don't release memory, even after a quite amount of time? Is there a way handle this behaviour, in a way that not all memory is consumed? Are there other alternatives? Best practices? http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Memory Usage on Windows Os while indexing
Other thing, Solr use a lot the OS cache to cache the index and gain performance. This can be another reason why the solr process has a high memory value allocated. /yago — /Yago Riveiro On Mon, Jan 20, 2014 at 10:03 AM, onetwothree joydivis...@telenet.be wrote: Facts: OS Windows server 2008 4 Cpu 8 GB Ram Tomcat Service version 7.0 (64 bit) Only running Solr Optional JVM parameters set xmx = 3072, xms = 1024 Solr version 4.5.0. One Core instance (both for querying and indexing) *Schema config:* minGramSize=2 maxGramSize=20 most of the fields are stored = true (required) *Solr config:* ramBufferSizeMB: 100 maxIndexingThreads: 8 directoryFactory: MMapDirectory autocommit: maxdocs 1, maxtime 15000, opensearcher false cache (defaults): filtercache initialsize:512 size: 512 autowarm: 0 queryresultcache initialsize:512 size: 512 autowarm: 0 documentcache initialsize:512 size: 512 autowarm: 0 Problem description: We're using a .Net Service (based on Solr.Net) for updating and inserting documents on a single Solr Core instance. The size of documents sent to Solr vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one or multiple threads. The current size of the Solr Index is about 15GB. The indexing service is running around 4 a 5 hours per day, to complete all inserts and updates to Solr. While the indexing process is running the Tomcat process memory usage keeps growing up to 7GB Ram (using Process Explorer monitor tool) and does not reduce, even after 24 hours. After a restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat process, the memory usage of Tomcat seems ok, memory consumption is in range of defined jvm startup params (see image). So it seems that filesystem buffers are consuming all the leftover memory??, and don't release memory, even after a quite amount of time? Is there a way handle this behaviour, in a way that not all memory is consumed? Are there other alternatives? Best practices? http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262.html Sent from the Solr - User mailing list archive at Nabble.com.
Multi Lingual Analyzer
Hi, I have a query on Multi-Lingual Analyser. Which one of the below is the best approach? 1.1.To develop a translator that translates a/any language to English and then use standard English analyzer to analyse – use translator, both at index time and while search time? 2. 2. To develop a language specific analyzer and use that by creating specific field only for that language? We have client data coming in different Languages: Kannada and Telegu and others later.This data is basically the text written by customer in that language. Requirement is to develop analyzers particular for these language. Thanks - David
Re: Search Suggestion Filtering
Hi guys, following this thread I have some question : 1) regarding LUCENE-5350, what is the context quoted ? Is it the context a filter query ? 2) regarding https://issues.apache.org/jira/browse/SOLR-5378, do we have the final documentation available ? Cheers 2014/1/16 Hamish Campbell hamish.campb...@koordinates.com Thank you Jorge. We looked at phrase suggestions from previous user queries, but they're not so useful in our case. However, I have a follow-up question about similar functionality that I'll post shortly. The list might like to know that I've come up with a quick and exceedingly dirty strikehack/strike solution that works for our limited case. You have been warned! Note that we're using django-haystack to actually interact with Solr: 1. Set nonFuzzyPrefix of the Suggester to 4. 2. At index time, the haystack index will build suggestion terms by extracting the relevant terms and prefixing with a 4 (alpha) character reference for the target instance. 3. At search time, the user's query is split, terms are prefixed and concatenated. The new query is sent to solr and the results are cleaned of references before returned to the front end. I'm not proud of it, but it works. =D On Fri, Jan 17, 2014 at 3:13 AM, Jorge Luis Betancourt González jlbetanco...@uci.cu wrote: In a custom application we have, we use a separated core (under Solr 3.6.1) to store the queries used by the users and then provide the autocomplete feauture. In our case we need to filter some phrases, that we don't need to be suggested to the users. I build a custom UpdateRequestProcessor to implement this logic, so we define this blocking patterns in some external source of information (DB, files, etc.). For the suggestions per-se we use as a base https://github.com/cominvent/autocomplete configuration, described in www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ which is pretty usable as it comes. I found (personally) this approach way more flexible than the original suggester component, but it involves storing the user's queries into a separated core. Greetings, - Original Message - From: Hamish Campbell hamish.campb...@koordinates.com To: solr-user@lucene.apache.org Sent: Wednesday, January 15, 2014 9:10:16 PM Subject: Re: Search Suggestion Filtering Thanks Tomás, I'll take a look. Still interested to hear from anyone about using queries to populate the list - I'm willing to give up a bit of performance for the flexibility it would provide. On Thu, Jan 16, 2014 at 1:06 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think your use case is the one described in LUCENE-5350, maybe you want to take a look to the patch and comments there. Tomás On Wed, Jan 15, 2014 at 12:58 PM, Hamish Campbell hamish.campb...@koordinates.com wrote: Hi all, I'm looking into options for filtering the search suggestions dictionary. Using Solr 4.6.0, Suggester component and fst.FuzzyLookupFactory using a field based dictionary, we're indexing records for a multi-tenanted SaaS platform. SearchHandler records are always filtered by the particular client warehouse (e.g. by domain), however we need a way to apply a similar filter to the spell check dictionary to prevent leaking terms between clients. In other words: when client A searches for a document title they should not receive spelling suggestions for client B's document titles. This has been asked a couple of times, on the mailing list and on StackOverflow. Some of the suggested approaches: 1. Use dynamic fields to create dictionaries per-warehouse (mentioned here: http://lucene.472066.n3.nabble.com/Filtering-down-terms-in-suggest-tt4069627.html ) That might be a reasonable option for us (we already considered a similar approach), but at what point does this stop scaling efficiently? How many dynamic fields are too many? 2. Run a query to populate the suggestion list (also mentioned in that thread) If I understand this correctly, this would give us a lot of flexibility and power: for example to give a more nuanced result set using the users permissions to expose private documents in their spelling suggestions. I expect this would be a slow query, but our total document count is currently relatively small (on the order of 10^3 objects) and I imagine you could create a specific word index with the appropriate fields to keep this in check. Is this a feasible approach, and if so, how do you build a dynamic suggestion list? 3. Other options: It seems like this is a common problem - and we could through some resources at building an extension to provide some limited suggestion dictionary filtering. Is anyone already doing something similar, or
Re: Query by range of price
Thank you very much Mr. Raymond You just saved my world ;) It's worked and *sort by conditions * but facet.query=price_min:[* TO 1300] not working yet but I will try to google for the right solution. Million thanks _/|\_ Rachun. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-by-range-of-price-tp4111655p4112272.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Error when creating collection in Solr 4.6
Hi, I had the same problem. In my case the error was, I had a copy/paste typo in my solr.xml. str name=genericCoreNodeNames${genericCoreNodeNames:true}/str !^! Ouch! With the type 'bool' instead of 'str' it works definitely better. ;-) Uwe Am 28.11.2013 08:53, schrieb lansing: Thank you for your replies, I am using the new-style discovery It worked after adding this setting : bool name=genericCoreNodeNames${genericCoreNodeNames:true}/bool -- View this message in context: http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Memory Usage on Windows Os while indexing
On Mon, 2014-01-20 at 11:02 +0100, onetwothree wrote: Optional JVM parameters set xmx = 3072, xms = 1024 directoryFactory: MMapDirectory [...] So it seems that filesystem buffers are consuming all the leftover memory??, and don't release memory, even after a quite amount of time? As long as the memory is indeed leftover, that is the optimal strategy. Maybe Uwe's explanation of MMapDirectory will help: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Regards, Toke Eskildsen, State and University Library, Denmark
RE: Indexing URLs from websites
Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -Original message- From:Teague James teag...@insystechinc.com Sent: Friday 17th January 2014 18:13 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Progress! I changed the value of that property in nutch-default.xml and I am getting the anchor field now. However, the stuff going in there is a bit random and doesn't seem to correlate to the pages I'm crawling. The primary objective is that when there is something on the page that is a link to a file ...href=/blah/somefile.pdfGet the PDF!... (using ... to prevent actual code in the email) I want to capture that URL and the anchor text Get the PDF! into field(s). Am I going in the right direction on this? Thank you so much for sticking with me on this - I really appreciate your help! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 17, 2014 6:46 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 20:23 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I had used that previously and I just tried it again. The following generated no errors: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ Solr is still not getting an anchor field and the outlinks are not appearing in the index anywhere else. To be sure I deleted the crawl directory and did a fresh crawl using: bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Then bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments/ No errors, but no anchor fields or outlinks. One thing in the response from the crawl that I found interesting was a line that said: LinkDb: internal links will be ignored. Good catch! That is likely the problem. What does that mean? property namedb.ignore.internal.links/name valuetrue/value descriptionIf true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. /description /property So change the property, rebuild the linkdb and try reindexing once again :) -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 11:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Usage: SolrIndexer solr url crawldb [-linkdb linkdb] [-params k1=v1k2=v2...] (segment ... | -dir segments) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:57 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Okay. I changed my solrindex to this: bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb crawl/segments/20140115143147 I got the same errors: Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/.../crawl/linkdb/crawl_fetch Input path does not exist: file:/.../crawl/linkdb/crawl_parse Input path does not exist: file:/.../crawl/linkdb/parse_data Input path does not exist: file:/.../crawl/linkdb/parse_text Along with a Java stacktrace Those linkdb folders are not being created. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command. -Original message- From:Teague James teag...@insystechinc.com Sent: Thursday 16th January 2014 16:43 To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hello Markus, I do get a linkdb folder in the crawl folder that gets created - but it is created at the time that I execute the command automatically by Nutch. I just tried to use solrindex against yesterday's cawl and did not get any errors, but did not get the anchor field or any of the outlinks. I used this command: bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* I then tried: bin/nutch solrindex
Re: Changing existing index to use block-join
Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: On Sat, Jan 18, 2014 at 11:25 PM, d...@geschan.de wrote: So, my question now: can I change my existing index in just adding a is_parent and a _root_ field and saving the journal id there like I did with j-id or do I have to reindex all my documents? Absolutely, to use block-join you need to index nested documents as blocks, as it's described at http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml Thank you for the clarification. But there is no way to add new children without indexing the parent document and all existing childs again? So, in the example on github, if I want to add new sizes and colors to an existing T-Shirt, I have to reindex the already existing T-Shirt and all it's variations again? I understand that the blocks are created at index time, so I can't change an existing index to build blocks just in adding the _root_ field, but I don't get why it's not possible to add new children or did I missinterpret your statement? Thanks, -Gesh
Re: Multi Lingual Analyzer
It Depends (tm). Approach (2) will give you better, more specific search results. (1) is simpler to implement and might be good enough... On Mon, Jan 20, 2014 at 5:21 AM, David Philip davidphilipshe...@gmail.com wrote: Hi, I have a query on Multi-Lingual Analyser. Which one of the below is the best approach? 1.1.To develop a translator that translates a/any language to English and then use standard English analyzer to analyse – use translator, both at index time and while search time? 2. 2. To develop a language specific analyzer and use that by creating specific field only for that language? We have client data coming in different Languages: Kannada and Telegu and others later.This data is basically the text written by customer in that language. Requirement is to develop analyzers particular for these language. Thanks - David
[OT] Use Cases for Taming Text, 2nd ed.
Hi Solr Users, Drew Farris, Tom Morton and I are currently working on the 2nd Edition of Taming Text (http://www.manning.com/ingersoll for first ed.) and are soliciting interested parties who would be willing to contribute to a chapter on practical use cases (i.e. you have something in production and are willing to write about it) for search with Solr, NLP using OpenNLP or Stanford NLP and machine learning using Mahout, OpenNLP or MALLET -- ideally you are using combinations of 2 or more of these to solve your problems. We are especially interested in large scale use cases in eCommerce, Advertising, social media analytics, fraud, etc. The writing process is fairly straightforward. A section roughly equates to somewhere between 3 - 10 pages, including diagrams/pictures. After writing, there will be some feedback from editors and us, but otherwise the process is fairly simple. In order to participate, you must have permission from your company to write on the topic. You would not need to divulge any proprietary information, but we would want enough information for our readers to gain a high-level understanding of your use case. In exchange for your participation, you will have your name and company published on that section of the book as well as in the acknowledgments section. If you have a copy of Lucene in Action or Mahout In Action, it would be similar to the use case sections in those books. If you are interested, please respond privately to me using my gsing...@apache.org email address with this subject line. Thanks, Grant, Drew, Tom
Getting all search words relevant for the document to be found
Hi! I need a little help from you. We have complex documents stored in database. On the page we show them from database. We index them and not store them in Solr. So we can't use Solr Highlighter. But still we would like to highlight the search words found in the document. What approach would you suggest? Our approuch and idea is hidden in this basic question: Is it possible to get the list of all search words with which the specific document was found (with all the language varieties of the word). Let me explain what I mean with simplefied example. We index the sentence: The big cloud is verry dark. User puts these two words in search box: clouds dark rain. Can I get from Solr that that particular document was found because of words cloud and dark. So we can highlight them in the content. Ofcourse we can highlight the exact words user putted in search filed. But that's not enough. We woul also like to highlight all the language varieties that the document was found on. Thanks! Best regards, Tomaz
Re: Changing existing index to use block-join
On Mon, Jan 20, 2014 at 6:11 PM, d...@geschan.de wrote: Zitat von Mikhail Khludnev mkhlud...@griddynamics.com: On Sat, Jan 18, 2014 at 11:25 PM, d...@geschan.de wrote: So, my question now: can I change my existing index in just adding a is_parent and a _root_ field and saving the journal id there like I did with j-id or do I have to reindex all my documents? Absolutely, to use block-join you need to index nested documents as blocks, as it's described at http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml Thank you for the clarification. But there is no way to add new children without indexing the parent document and all existing childs again? Yes. There is no way to add children incrementally. You need to nuke whole block and add it with all necessary children. So, in the example on github, if I want to add new sizes and colors to an existing T-Shirt, I have to reindex the already existing T-Shirt and all it's variations again? Completely reindex t-shirts with all skus. I understand that the blocks are created at index time, so I can't change an existing index to build blocks just in adding the _root_ field, but I don't get why it's not possible to add new children or did I missinterpret your statement? Block join relies on internal Lucene docnums which are defined by the order in which documents has been indexed. this might help http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene Thanks, -Gesh -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Memory Usage on Windows Os while indexing
On 1/20/2014 3:02 AM, onetwothree wrote: OS Windows server 2008 4 Cpu 8 GB Ram snip We're using a .Net Service (based on Solr.Net) for updating and inserting documents on a single Solr Core instance. The size of documents sent to Solr vary from 1 Kb up to 8Mb, we're sending the documents in batches, using one or multiple threads. The current size of the Solr Index is about 15GB. The indexing service is running around 4 a 5 hours per day, to complete all inserts and updates to Solr. While the indexing process is running the Tomcat process memory usage keeps growing up to 7GB Ram (using Process Explorer monitor tool) and does not reduce, even after 24 hours. After a restart of Tomcat, or a Reload Core in the Solr Admin the memory drops back to 1 a 2 GB Ram. While using a tool like VisualVM to monitor the Tomcat process, the memory usage of Tomcat seems ok, memory consumption is in range of defined jvm startup params (see image). So it seems that filesystem buffers are consuming all the leftover memory??, and don't release memory, even after a quite amount of time? Is there a way handle this behaviour, in a way that not all memory is consumed? Are there other alternatives? Best practices? http://lucene.472066.n3.nabble.com/file/n4112262/Capture.png That picture seems to be a very low-res copy of your screenshot. I can't really make it out. I can tell you that it's completely normal for the OS disk cache (the filesystem buffers you mention) to take up all leftover memory. If an application requests some of that memory, the OS will instantly give it up. First, I'm going to explain something about memory reporting and Solr that I've noticed, then I will give you some news you probably won't like. The numbers reported by visualvm are a true picture of Java heap memory usage. The actual memory usage for Solr will be just a little bit more than those numbers. In the newest versions of Solr, there seems to be a side effect of the Java MMAP implementation that results in incorrect memory usage reporting at the operating system level. Here's a top output on one of my Solr servers running CentOS, sorted by memory usage. The process at the top of the list is Solr. https://www.dropbox.com/s/y1nus7lpzlb1mp9/solr-memory-usage-2014-01-20%2010.28.28.png Some quick numbers for you: The machine has 64GB of RAM. Solr shows a virtual memory size of 59.2GB. My indexes take up 51293336 of disk space, and Solr has a 6GB heap, so 59.2GB is not out of line for the virtual memory size. Now for where things get weird: There is 48GB of RAM taken up by the cached value, which is the OS disk cache. The screenshot also shows that Solr is using 22GB of resident RAM. If you add the 48GB in the OS disk cache and the 22GB of resident RAM for Solr, you get 70GB ... which is more memory than the machine even HAS, so we know something's off. The 'shared' memory for Solr is 15GB, which when you subtract it from the 22GB, gives you 7GB, which is much more realistic with a 6GB heap, and also makes it fit within the total system RAM. The news that you probably won't like: I'm assuming that the whole reason you looked into memory usage was because you're having performance problems. With 8GB of RAM and 3GB given to Solr, you basically have a little bit less than 5GB of RAM for the OS disk cache. With that much RAM, most people can effectively cache an index up to about 10GB before performance problems show up. Your index is 15GB. You need more total system RAM. If Solr isn't crashing, you can probably leave the heap at 3GB with no problem. http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Facet count mismatch.
Hello! I've installed a classical two shards Solr 4.5 topology without SolrCloud balancing with an HA proxy. I've got a *copyField* like this: * field name=tagValues type=string indexed=true stored=true multiValued=false/* Copied from this one: * field name=tags type=searchableTextTokenized indexed=true stored=true multiValued=false/* * !-- Fieldtype used in fields available to test searching --* *fieldType name=searchableTextTokenized class=solr.TextField positionIncrementGap=100* * analyzer* * tokenizer class=solr.PatternTokenizerFactory pattern=[\s\t\n\?\!\¿\¡:,;@\\.,\\(\\)\\{\\}\\/\\-]+ /* * filter class=solr.ASCIIFoldingFilterFactory/* * filter class=solr.LowerCaseFilterFactory/* * filter class=solr.ReversedWildcardFilterFactory/* * filter class=solr.RemoveDuplicatesTokenFilterFactory/* * /analyzer * */fieldType* When faceting with *tagValues* field I've got a total count of 3: - facet_counts: { - facet_queries: { }, - facet_fields: { - tagsValues: [ - sucks, - 3 ] }, - facet_dates: { }, - facet_ranges: { } } Bug when searching like this with *tagValues* the total number of documents is not three, but two: - params: { - facet: true, - shards: solr1.test:8081/comments/data,solr2.test:8080/comments/data, - facet.mincount: 1, - facet.sort: count, - q: tagsValues:sucks, - facet.limit: -1, - facet.field: tagsValues, - wt: json } Any idea of what's happening here? I'm confused, :-/ Regards, -- - Luis Cappa
Solr Cloud Bulk Indexing Questions
We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Facet count mismatch.
Hi Luis, Do you have deletions? What happens when you expunge Deletes? http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22 Ahmet On Monday, January 20, 2014 10:08 PM, Luis Cappa Banda luisca...@gmail.com wrote: Hello! I've installed a classical two shards Solr 4.5 topology without SolrCloud balancing with an HA proxy. I've got a *copyField* like this: * field name=tagValues type=string indexed=true stored=true multiValued=false/* Copied from this one: * field name=tags type=searchableTextTokenized indexed=true stored=true multiValued=false/* * !-- Fieldtype used in fields available to test searching --* * fieldType name=searchableTextTokenized class=solr.TextField positionIncrementGap=100* * analyzer* * tokenizer class=solr.PatternTokenizerFactory pattern=[\s\t\n\?\!\¿\¡:,;@\\.,\\(\\)\\{\\}\\/\\-]+ /* * filter class=solr.ASCIIFoldingFilterFactory/* * filter class=solr.LowerCaseFilterFactory/* * filter class=solr.ReversedWildcardFilterFactory/* * filter class=solr.RemoveDuplicatesTokenFilterFactory/* * /analyzer * * /fieldType* When faceting with *tagValues* field I've got a total count of 3: - facet_counts: { - facet_queries: { }, - facet_fields: { - tagsValues: [ - sucks, - 3 ] }, - facet_dates: { }, - facet_ranges: { } } Bug when searching like this with *tagValues* the total number of documents is not three, but two: - params: { - facet: true, - shards: solr1.test:8081/comments/data,solr2.test:8080/comments/data, - facet.mincount: 1, - facet.sort: count, - q: tagsValues:sucks, - facet.limit: -1, - facet.field: tagsValues, - wt: json } Any idea of what's happening here? I'm confused, :-/ Regards, -- - Luis Cappa
Re: Solr Cloud Bulk Indexing Questions
Questions: How often do you commit your updates? What is your indexing rate in docs/second? In a SolrCloud setup, you should be using a CloudSolrServer. If the server is having trouble keeping up with updates, switching to CUSS probably wouldn't help. So I suspect there's something not optimal about your setup that's the culprit. Best, Erick On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote: We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Solr Cloud Bulk Indexing Questions
We commit have a soft commit every 5 seconds and hard commit every 30. As far as docs/second it would guess around 200/sec which doesn't seem that high. On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson erickerick...@gmail.comwrote: Questions: How often do you commit your updates? What is your indexing rate in docs/second? In a SolrCloud setup, you should be using a CloudSolrServer. If the server is having trouble keeping up with updates, switching to CUSS probably wouldn't help. So I suspect there's something not optimal about your setup that's the culprit. Best, Erick On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote: We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Solr Cloud Bulk Indexing Questions
We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all updates get sent to one machine or something? On Mon, Jan 20, 2014 at 2:42 PM, Software Dev static.void@gmail.comwrote: We commit have a soft commit every 5 seconds and hard commit every 30. As far as docs/second it would guess around 200/sec which doesn't seem that high. On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson erickerick...@gmail.comwrote: Questions: How often do you commit your updates? What is your indexing rate in docs/second? In a SolrCloud setup, you should be using a CloudSolrServer. If the server is having trouble keeping up with updates, switching to CUSS probably wouldn't help. So I suspect there's something not optimal about your setup that's the culprit. Best, Erick On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote: We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Solr Cloud Bulk Indexing Questions
What version are you running? - Mark On Jan 20, 2014, at 5:43 PM, Software Dev static.void@gmail.com wrote: We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all updates get sent to one machine or something? On Mon, Jan 20, 2014 at 2:42 PM, Software Dev static.void@gmail.comwrote: We commit have a soft commit every 5 seconds and hard commit every 30. As far as docs/second it would guess around 200/sec which doesn't seem that high. On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson erickerick...@gmail.comwrote: Questions: How often do you commit your updates? What is your indexing rate in docs/second? In a SolrCloud setup, you should be using a CloudSolrServer. If the server is having trouble keeping up with updates, switching to CUSS probably wouldn't help. So I suspect there's something not optimal about your setup that's the culprit. Best, Erick On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote: We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Solr Cloud Bulk Indexing Questions
4.6.0 On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller markrmil...@gmail.com wrote: What version are you running? - Mark On Jan 20, 2014, at 5:43 PM, Software Dev static.void@gmail.com wrote: We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all updates get sent to one machine or something? On Mon, Jan 20, 2014 at 2:42 PM, Software Dev static.void@gmail.com wrote: We commit have a soft commit every 5 seconds and hard commit every 30. As far as docs/second it would guess around 200/sec which doesn't seem that high. On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson erickerick...@gmail.comwrote: Questions: How often do you commit your updates? What is your indexing rate in docs/second? In a SolrCloud setup, you should be using a CloudSolrServer. If the server is having trouble keeping up with updates, switching to CUSS probably wouldn't help. So I suspect there's something not optimal about your setup that's the culprit. Best, Erick On Mon, Jan 20, 2014 at 4:00 PM, Software Dev static.void@gmail.com wrote: We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Multi Lingual Analyzer
MT is not nearly good enough to allow approach 1 to work. On Mon, Jan 20, 2014 at 9:25 AM, Erick Erickson erickerick...@gmail.com wrote: It Depends (tm). Approach (2) will give you better, more specific search results. (1) is simpler to implement and might be good enough... On Mon, Jan 20, 2014 at 5:21 AM, David Philip davidphilipshe...@gmail.com wrote: Hi, I have a query on Multi-Lingual Analyser. Which one of the below is the best approach? 1.1.To develop a translator that translates a/any language to English and then use standard English analyzer to analyse – use translator, both at index time and while search time? 2. 2. To develop a language specific analyzer and use that by creating specific field only for that language? We have client data coming in different Languages: Kannada and Telegu and others later.This data is basically the text written by customer in that language. Requirement is to develop analyzers particular for these language. Thanks - David
Optimizing index on Slave
All, I know normally index should be optimized on master and it will be replicated to slave but we have an issue with the network bandwidth. We optimize indexes weekly (total size is around 1.5TB). We have few slaves set up on local network so replication the whole indexes is not a big issue. However, we have one slave in another city too (on a backup network) which of course gets replicated over internet which is quite slow and expensive. We want to avoid copying the complete indexes every week after optimization and were thinking if its possible to optimize it independently on slave so that there is no delta between master and slave? We tried to do it but still the slave replicated from master. -- Regards, Salman Akram
Re: Memory Usage on Windows Os while indexing
Thanks for the reply, dropbox image added. -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-Usage-on-Windows-Os-while-indexing-tp4112262p4112403.html Sent from the Solr - User mailing list archive at Nabble.com.