Indexing gets significantly slower after every batch commit
hi guys, I'm crawling a file system folder and indexing 10 million docs, and I am adding them in batches of 5000, committing every 50 000 docs. The problem I am facing is that after each commit, the documents per sec that are indexed gets less and less. If I do not commit at all, I can index those docs very quickly, and then I commit once at the end, but once i start indexing docs _after_ that (for example new files get added to the folder), indexing is also slowing down a lot. Is it normal that the SOLR indexing speed depends on the number of documents that are _already_ indexed? I think it shouldn't matter if i start from scratch or I index a document in a core that already has a couple of million docs. Looks like SOLR is either doing something in a linear fashion, or there is some magic config parameter that I am not aware of. I've read all perf docs, and I've tried changing mergeFactor, autowarmCounts, and the buffer sizes - to no avail. I am using SOLR 5.1 Thanks ! Angel
Re: Indexing gets significantly slower after every batch commit
On 5/21/2015 2:07 AM, Angel Todorov wrote: I'm crawling a file system folder and indexing 10 million docs, and I am adding them in batches of 5000, committing every 50 000 docs. The problem I am facing is that after each commit, the documents per sec that are indexed gets less and less. If I do not commit at all, I can index those docs very quickly, and then I commit once at the end, but once i start indexing docs _after_ that (for example new files get added to the folder), indexing is also slowing down a lot. Is it normal that the SOLR indexing speed depends on the number of documents that are _already_ indexed? I think it shouldn't matter if i start from scratch or I index a document in a core that already has a couple of million docs. Looks like SOLR is either doing something in a linear fashion, or there is some magic config parameter that I am not aware of. I've read all perf docs, and I've tried changing mergeFactor, autowarmCounts, and the buffer sizes - to no avail. I am using SOLR 5.1 Have you changed the heap size? If you use the bin/solr script to start it and don't change the heap size with the -m option or another method, Solr 5.1 runs with a default size of 512MB, which is *very* small. I bet you are running into problems with frequent and then ultimately constant garbage collection, as Java attempts to free up enough memory to allow the program to continue running. If that is what is happening, then eventually you will see an OutOfMemoryError exception. The solution is to increase the heap size. I would probably start with at least 4G for 10 million docs. Thanks, Shawn
Search for numbers
Hi, I try to search numbers with a certain deviation. My parser is ExtendedDisMax. A possible search expression could be 'twist drill 1.23 mm'. It will not match any documents, because the document contains the keywords 'twist drill', '1.2' and 'mm'. In order to reach my goal, I've indexed all numbers as points with the solr.SpatialRecursivePrefixTreeFieldType. For example '1.2' as field name=feature_nr1.2 0.0/field. A search with 'drill mm' and a filter query 'fq={!geofilt pt=0,1.23 sfield=feature_nr d=5}' delivers the expected results. Now I have two problems: 1. How can I get ExtendedDisMax, to 'replace' the value 1.2 with the '{!geofilt}' function? My first attemts were - Build a field type in schema.xml and replace the field content with a regular expression '... replacement=_query_:quot;{!geofilt pt=0,$1 sfield=feature_nr d=5}quot;'. The idea was to use a nested query. But edismax searches 'feature_nr:_query_:{!geofilt pt=0,$1 sfield=feature_nr d=5}'. No documents are found. - Program a new parser that analyzes the query terms, finds all numbers and does the geospatial stuff. Added this parser in the 'appends' section of the 'requestHandler' definition. But I can get this parser only to filter my results, not to extend them. 2. I want to calculate the distance (d) of the '{!geofilt}' function relative to the value, for example 1%. Could there be a simple solution? Thank you in advance. Holger
Re: Need help with Nested docs situation
This scenario is a perfect fit to play with Solr Joins [1] . As you observed, you would prefer to go with a query time join. THis kind of join can be done inter-collection . You can have you deal collection and product collection . Every product will have one field dealId to match all the parent deals. When you add,remove,update a new deal, you have to update in the product index all the related products. Then you can query over the products and get related parent deals in the response. Can you give me a little bit more details about your expected use case ? Example of queries and a better explanation of the product previews ? Cheers [1] https://www.youtube.com/watch?v=-OiIlIijWH0feature=youtu.be , http://blog.griddynamics.com/2013/09/solr-block-join-support.html 2015-05-20 18:56 GMT+01:00 Mikhail Khludnev mkhlud...@griddynamics.com: data scale and request rate can judge between block, plain joins and field collapsing. On Thu, Apr 30, 2015 at 1:07 PM, roySolr royrutten1...@gmail.com wrote: Hello, I have a situation and i'm a little bit stuck on the way how to fix it. For example the following data structure: *Deal* All Coca Cola 20% off *Products* Coca Cola light Coca Cola Zero 1L Coca Cola Zero 20CL Coca Cola 1L When somebody search to Cola discount i want the result of the deal with related products. Solution #1: I could index it with nested docs(solr 4.9). But the problem is when a product has some changes(let's say Zero gets a new name Extra Light) i have to re-index every deal with these products. Solution #2: I could make 2 collections, one with deals and one with products. A Product will get a parentid(dealid). Then i have to do 2 queries to get the information? When i have a resultpage with 10 deals i want to preview the first 2 products. That means a lot of queries but it's doesn't have the update problem from solution #1. Does anyone have a good solution for this? Thanks, any help is appreciated. Roy -- View this message in context: http://lucene.472066.n3.nabble.com/Need-help-with-Nested-docs-situation-tp4203190.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Solr suggester
right. File-based suggestions should be much faster to build, but it's certainly the case with large indexes that you have to build it periodically so they won't be completely up to date. However, this stuff is way cool. AnalyzingInfixSuggester, for instance, suggests entire fields rather than isolated words, returning the original case, punctuation etc. The index-based spellcheck/suggest just reads terms from the indexed fields which takes no time to build but suffers from reading _indexed_ terms, i.e. terms that have gone through the analysis process that may have been stemmed, lowercased, all that. On Thu, May 21, 2015 at 9:03 AM, jon kerling jonkerl...@yahoo.com.invalid wrote: Hi Erick, I have read your blog and it is really helpful.I'm thinking about upgrading to Solr 5.1 but it won't solve all my problems with this issue, as you said each build will have to read all docs, and analyze it's fields. The only advantage is that I can skip default suggest.build on start up. Thank you for your reply. Jon Kerling. On Thursday, May 21, 2015 6:38 PM, Erick Erickson erickerick...@gmail.com wrote: Frankly, the suggester is rather broken in Solr 4.x with large indexes. Building the suggester index (or FST) requires that _all_ the docs get read, the stored fields analyzed and added to the suggester. Unfortunately, this happens _every_ time you start Solr and can take many minutes whether or not you have buildOnStartup set to false, see: https://issues.apache.org/jira/browse/SOLR-6845. See: http://lucidworks.com/blog/solr-suggester/ See inline. On Thu, May 21, 2015 at 6:12 AM, jon kerling jonkerl...@yahoo.com.invalid wrote: Hi, I'm using solr 4.10 and I'm trying to add autosuggest ability to my application. I'm currently using this kind of configuration: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldfield2/str str name=weightFieldweightField/str str name=suggestAnalyzerFieldTypetext_general/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler I wanted to know how the suggester Index/file is being rebuilt. Is it suppose to have all the terms of the desired field in the suggester? Yes. if not, is it related to this kind of lookup implementation? if I'll use other lookup implementation which suggest also infix terms of fields, doesn't it has to hold all terms of the field? Yes. When i call suggest.build, does it build from scratch the suggester Index/file, or is it just doing something like sort of delta indexing suggestions? Builds from scratch Thank You, Jon
Re: Reindex of document leaves old fields behind
I'm relying on an autocommit of 60 secs. I just ran the same test via my SolrJ client and result was the same, SolrCloud query always returns correct number of fields. Is there a way to find out which shard and replica a particular document lives on? -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206908.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reindex of document leaves old fields behind
My guess is that you're not committing from your SolrJ program. That's automatic when you post. Best, Erick On Thu, May 21, 2015 at 10:13 AM, tuxedomoon dancolem...@yahoo.com wrote: OK it is composite I've just used post.sh to index a test doc with 3 fields to leader 1 of my SolrCloud. I then reindexed it with 1 field removed and the query on it shows 2 fields. I repeated this a few times and always get the correct field count from Solr. I'm now wondering if SolrJ is somehow involved in performing an atomic update rather than replacement. I will try the above test via SolrJ. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206886.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr suggester
Hi, I'm using solr 4.10 and I'm trying to add autosuggest ability to my application. I'm currently using this kind of configuration: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldfield2/str str name=weightFieldweightField/str str name=suggestAnalyzerFieldTypetext_general/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler I wanted to know how the suggester Index/file is being rebuilt. Is it suppose to have all the terms of the desired field in the suggester? if not, is it related to this kind of lookup implementation? if I'll use other lookup implementation which suggest also infix terms of fields, doesn't it has to hold all terms of the field? When i call suggest.build, does it build from scratch the suggester Index/file, or is it just doing something like sort of delta indexing suggestions? Thank You, Jon
Logic on Term Frequency Calculation : Bug or Functionality
Hi, I am puzzled on the Term Frequency Behaviour of the DefaultSimilarity implementation I have suppressed the IDF by setting to 1. TF-IDF would inturn reflect the same value as in Term Frequency Below are the inferences: Red coloured are expected to give a hit count(Term Frequency) of 2 but was one. *Is it bug or is it how the behaviour is?* Search Query: AAA BBB Parsed Query: PhraseQuery(Contents:\aaa bbb\~5000) DocumentContentSlopTFslop0TFslop2TF1AAA BBB-101212BBB AAA-10-213AAA AAA BBB- 101214AAA BBB AAA-201225BBB AAA AAA-10-216AAA BBB BBB-101217BBB AAA BBB-1012 18BBB BBB AAA-10-21 *Am I missing something?!* Cheers *Ariya *
Re: [solr 5.1] Looking for full text + collation search field
Thanks for the advice. I have tried the field type and it seems to do what it is supposed to in combination with a lower case filter. However, that raises another slight problem: German umlauts are supposed to be treated slightly different for the purpose of searching than for sorting. For sorting a normal ICUCollationField with standard rules should suffice*, for the purpose of searching I cannot just replace an ü with a u, ü is supposed to equal ue, or, in terms of RuleBasedCollators, there is a secondary difference. The rules for the collator include: ue , ü ae , ä oe , ö ss , ß (again, that applies to searching *only*, for the sorting the rule a , ä would apply, which is implied in the default rules.) I can of course program a filter that does these rudimentary replacements myself, at best after the lower case filter but before the ASCIIFoldingFilter, I am just wondering if there isn't some way to use collations keys for full text search. * even though Latin script and specifically German is my primary concern, I want some rudimentary support for all European languages, including ones that use Cyrillic and Greek script, special symbols in Icelandic that are not strictly Latin and ligatures like Æ, which collation keys could easily provide. Ahmet Arslan iori...@yahoo.com.INVALID schrieb am 22:10 Mittwoch, 20.Mai 2015: Hi Bjorn, solr.ICUCollationField is useful for *sorting*, and you cannot sort on tokenized fields. Your example looks like diacritics insensitive search. Please see : ASCIIFoldingFilterFactory Ahmet On Wednesday, May 20, 2015 2:53 PM, Björn Keil deeph...@web.de wrote: Hello, might anyone suggest a field type with which I may do both a full text search (i.e. there is an analyzer including a tokenizer) and apply a collation? An example for what I want to do: There is a field composer for which I passed the value Dvořák, Antonín. I want the following queries to match: composer:(antonín dvořák) composer:dvorak composer:dvorak, antonin the latter case is possible using a solr.ICUCollationField, but that type does not support an Analyzer and consequently no tokenizer, thus, it is not helpful. Unlike former versions of solr there do not seem to be CollationKeyFilters which you may hang into the analyzer of a solr.TextField... so I am a bit at a loss how I get *both* a tokenizer and a collation at the same time. Thanks for help, Björn
Index optimize runs in background.
Hi, I am using Solr-5.1.0. I have an indexer class which invokes cloudSolrClient.optimize(true, true, 1). My indexer exits after the invocation of optimize and the optimization keeps on running in the background. Kindly let me know if it is per design and how can I make my indexer to wait until the optimization is over. Is there a configuration/parameter I need to set for the same. Please note that the same indexer with cloudSolrServer.optimize(true, true, 1) on Solr-4.10 used to wait till the optimize was over before exiting. Thanks, Modassar
Re: solr 5.x on glassfish/tomcat instead of jetty
Hi TK, Can you share the thread you found on this WAR topic? Thanks, Steve On Wed, May 20, 2015 at 8:58 PM, TK Solr tksol...@sonic.net wrote: Never mind. I found that thread. Sorry for the noise. On 5/20/15, 5:56 PM, TK Solr wrote: On 5/20/15, 8:21 AM, Shawn Heisey wrote: As of right now, there is still a .war file. Look in the server/webapps directory for the .war, server/lib/ext for logging jars, and server/resources for the logging configuration. Consult your container's documentation to learn where to place these things. At some point in the future, such deployments will no longer be possible, While we are still at this subject, I have been aware there has been an anti-WAR movement in the tech but I don't quite understand where this movement is coming from. Can someone point me to some website summarizing why WARs are bad? Thanks.
Is it possible to do term Search for the filtered result set
Hi all, Is it possible to do term search for the filtered result set. we can do term search for all documents. Can we do the term search only for the specified filtered result set. Lets says we have, Doc1 -- type: A tags: T1 T2 Doc2 -- type: A tags: T1 T3 Doc3 -- type: B tags: T1 T4 T5 Can we do term search for tags only in type:A documents, So that it gives the results as T1 - 02 T2 - 01 T3 - 01 Is this possible? If so can you please share documents on this. Thanks Danesh
Price Range Faceting Based on Date Constraints
Hi, I have an unique requirement to facet on product prices based on date constraints, for which I have been thinking for a solution for a couple of days now, but to no avail. The details are as follows: 1. Each product can have multiple prices, each price has a start-date and an end-date. 2. At search time, we need to facet on price ranges ($0 - $5, $5-$20, $20-$50...) 3. When faceting, a date is first determined. It can be either the current system date or a future date (call it date X) 4. For each product, the price to be used for faceting has to meet the following condition: start-date date X, and date X end-date, in other words, date X has to fall within start-date and end-date. 5. My Solr version: 3.5 Hopefully I explained the requirement clearly. I have tried single price field with multivalue and each price value has startdate and enddate appended. I also tried one field per price with the field name containing both startdate and enddate. Neither approach seems to work. Can someone please shed some light as to how the index should be designed and what the facet query should look like? Thanks in advance for your help! -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud Leader Election
This shouldn't happen, but if it does, there's no good way currently for Solr to automatically fix it. There are a couple of issues being worked on to do that currently. But till then, your best bet is to restart the node which you expect to be the leader (you can look at ZK to see who is at the head of the queue it maintains). If you can't figure that out, safest is to just stop/start all nodes in sequence, and if that doesn't work, stop all nodes and start them back one after the other. On 21 May 2015 00:24, Ryan Steele ryan.ste...@pgi.com wrote: My SolrCloud cluster isn't reassigning the collections leaders from downed cores--the downed cores are still listed as the leaders. The cluster has been in the state for a few hours and the logs continue to report No registered leader was found after waiting for 4000ms. Is there a way to force it to reassign the leader? I'm running SolrCloud 5.0. I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections. Thanks, Ryan --- This email has been scanned for email related threats and delivered safely by Mimecast. For more information please visit http://www.mimecast.com ---
Re: Confused about whether Real-time Gets must be sent to leader?
On Thu, May 21, 2015 at 3:15 PM, Timothy Potter thelabd...@gmail.com wrote: I'm seeing that RTG requests get routed to any active replica of the shard hosting the doc requested by /get ... I was thinking only the leader should handle that request since there's a brief window of time where the latest update may not be on the replica (albeit usually very brief) and the latest update is definitely on the leader. There are different levels of consistency. You are guaranteed that after an update completes, a RTG will retrieve that version of the update (or later). The fact that a replica gets the update after the leader is not material to this guarantee since the update has not yet completed. What can happen is that if you are doing multiple RTG requests, you can see a later version of a document, then see a previous version (because you're hitting different shards). This will only be an issue in certain types of use-cases. Optimistic concurrency, for example, will *not* be bothered by this phenomenon. In the past, we've talked about an option to route search requests to the leader. But really, any type of server affinity would work to ensure a monotonic view of a document's history. Off the top of my head, I'm not really sure what types of apps require it, but I'd be interested in hearing about them. -Yonik
Re: Reindex of document leaves old fields behind
a few further clues to this unresolved problem 1. I found one of my 5 zookeeper instances was down 2. I tried another reindex of a bad document but no change on the SOLR side 3. I deleted and reindexed the same doc, that worked (obviously, but at this point I don't know what to expect) -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206946.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Another more modern option, very related to this, is to use DateRangeField in 5.0. You have full 64 bit precision. More info is in the Solr Ref Guide. If Alessandro sticks with RPT, then the best reference to give is this: http://wiki.apache.org/solr/SpatialForTimeDurations ~ David https://www.linkedin.com/in/davidwsmiley On May 21, 2015, at 11:49 AM, Holger Rieß holger.ri...@werkzeug-eylert.de wrote: Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false. The date is located on the X-axis, prices on the Y axis. For every price you get a horizontal line between start and end date. Index a rectangle with height 0.001( 1 cent) and width 'end date - start date'. Find all prices that are valid on a given day or in a given date range with the 'geofilt' function. The field type could look like (not tested): fieldType name=price_date_range class=solr.SpatialRecursivePrefixTreeFieldType geo=false distErrPct=0.025 maxDistErr=0.09 units=degrees worldBounds=1 0 366 1 / Faceting possibly can be done with a facet query for every of your price ranges. For example day 20, price range 0-5$, rectangle: field name=pdr20.0 0.0 21.0 5.0/field. Regards Holger
Re: Price Range Faceting Based on Date Constraints
Thanks Holger and Alessandro, SpatialRecursivePrefixTreeFieldType is a new concept to me, and I need some time to dig into it and see how it can help solve my problem. Alex Wang Technical Architect Crossview, Inc. C: (647) 409-3066 aw...@crossview.com On Thu, May 21, 2015 at 11:50 AM, Holger Rieß [via Lucene] ml-node+s472066n4206868...@n3.nabble.com wrote: Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false. The date is located on the X-axis, prices on the Y axis. For every price you get a horizontal line between start and end date. Index a rectangle with height 0.001( 1 cent) and width 'end date - start date'. Find all prices that are valid on a given day or in a given date range with the 'geofilt' function. The field type could look like (not tested): fieldType name=price_date_range class=solr.SpatialRecursivePrefixTreeFieldType geo=false distErrPct=0.025 maxDistErr=0.09 units=degrees worldBounds=1 0 366 1 / Faceting possibly can be done with a facet query for every of your price ranges. For example day 20, price range 0-5$, rectangle: field name=pdr20.0 0.0 21.0 5.0/field. Regards Holger -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206868.html To unsubscribe from Price Range Faceting Based on Date Constraints, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4206817code=YXdhbmdAY3Jvc3N2aWV3LmNvbXw0MjA2ODE3fDE4OTQ1NzE1NTI= . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- https://www.youtube.com/user/CrossViewInc1 http://www.crossview.com http://www.crossview.com https://twitter.com/CrossView_Inc https://www.youtube.com/user/CrossViewInc1 http://www.linkedin.com/company/crossview-inc./products https://plus.google.com/u/0/+Crossview/about http://blog.crossview.com This message may contain confidential and/or privileged information or information related to CrossView Intellectual Property. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206951.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is it possible to search for the empty string?
: Subject: Re: Is it possible to search for the empty string? : : Not out of the box. : : Fields are parsed into tokens and queries search on tokens. An empty : string has no tokens for that field and a missing field has no tokens : for that field. that's a missleading over simplification of what *normally* happens. it is absolutely possible to have documents with fields whose indexed temrs consist of the empty string, and to search for those empty strings -- the most trivial way being with a simple StrField -- but using TExtField with some creative analyzers it's also very possible.. $ curl 'http://localhost:8983/solr/techproducts/select?q=*:*facet=truefacet.field=foo_swt=jsonindent=trueomitHeader=true' { response:{numFound:3,start:0,docs:[ { id:foo_blank, foo_s:, _version_:1501816569733316608}, { id:foo_non_blank, foo_s:bar, _version_:1501816583564034048}, { id:foo_missing, _version_:1501816591383265280}] }, facet_counts:{ facet_queries:{}, facet_fields:{ foo_s:[ ,1, bar,1]}, facet_dates:{}, facet_ranges:{}, facet_intervals:{}, facet_heatmaps:{}}} $ curl 'http://localhost:8983/solr/techproducts/select?q=foo_s:wt=jsonindent=trueomitHeader=true' { response:{numFound:1,start:0,docs:[ { id:foo_blank, foo_s:, _version_:1501816569733316608}] }} $ curl 'http://localhost:8983/solr/techproducts/select?q=foo_s:*wt=jsonindent=trueomitHeader=true' { response:{numFound:2,start:0,docs:[ { id:foo_blank, foo_s:, _version_:1501816569733316608}, { id:foo_non_blank, foo_s:bar, _version_:1501816583564034048}] }} $ curl 'http://localhost:8983/solr/techproducts/select?q=-foo_s:*wt=jsonindent=trueomitHeader=true' { response:{numFound:1,start:0,docs:[ { id:foo_missing, _version_:1501816591383265280}] }} -Hoss http://www.lucidworks.com/
optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
Hi I'd like some feedback on how I'd like to solve the following sharding problem I have a collection that will eventually become big Average document size is 1.5kb Every year 30 Million documents will be indexed Data come from different document producers (a person, owner of his documents) and queries are almost always performed by a document producer who can only query his own document. So shard by document producer seems a good choice there are 3 types of doc producer type A, cardinality 105 (there are 105 producers of this type) produce 17M docs/year (the aggregated production af all type A producers) type B cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 9M docs/year I'm thinking about use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to the same shards. When a shard becomes too large I can use shard splitting. problems -documents from type A producers could be oddly distributed among shards, because hashing doesn't work well on small numbers (105) see Appendix As a solution I could do this when a new typeA producer (producerA1) arrives: 1) client app: generate a producer code 2) client app: simulate murmurhashing and shard assignment 3) client app: check shard assignment is optimal (producer code is assigned to the shard with the least type A producers) otherwise goto 1) and try with another code when I add documents or perform searches for producerA1 I use it's producer code respectively in the compositeId or in the route parameter What do you think? ---Appendix: murmurhash shard assignment simulation--- import mmh3 hashes = [mmh3.hash(str(i))16 for i in xrange(105)] num_shards = 16 shards = [0]*num_shards for hash in hashes: idx = hash % num_shards shards[idx] += 1 print shards print sum(shards) - result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8] so with 16 shards and 105 shard keys I can have shards with 1 key shards with 11 keys
Re: Price Range Faceting Based on Date Constraints
Hi Alex, Thanks for the link to the presentation. I am going through the slides and trying to figure out the time-sensitive search it talks about and how it relates to the problem I am facing. It looks like it tries to solve the problem of sku availability based on date, while in my case, all skus are available, but the prices are time-sensitive, and faceting logic needs to pick the right price for each sku when counting. -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206856.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Did you look at Gilt's presentation from a while ago: http://www.slideshare.net/trenaman/personalized-search-on-the-largest-flash-sale-site-in-america Slides 33 on might be most relevant. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 21 May 2015 at 22:58, alexw aw...@crossview.com wrote: Hi, I have an unique requirement to facet on product prices based on date constraints, for which I have been thinking for a solution for a couple of days now, but to no avail. The details are as follows: 1. Each product can have multiple prices, each price has a start-date and an end-date. 2. At search time, we need to facet on price ranges ($0 - $5, $5-$20, $20-$50...) 3. When faceting, a date is first determined. It can be either the current system date or a future date (call it date X) 4. For each product, the price to be used for faceting has to meet the following condition: start-date date X, and date X end-date, in other words, date X has to fall within start-date and end-date. 5. My Solr version: 3.5 Hopefully I explained the requirement clearly. I have tried single price field with multivalue and each price value has startdate and enddate appended. I also tried one field per price with the field name containing both startdate and enddate. Neither approach seems to work. Can someone please shed some light as to how the index should be designed and what the facet query should look like? Thanks in advance for your help! -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reindex of document leaves old fields behind
l If it is implicit then you may have indexed the new document to a different shard, which means that it is now in your index more than once, and which one gets returned may not be predictable. If a document with uniqueKey 1234 is assigned to a shard by SolrCloud, implicit routing won't a reindex of 1234 be assigned to the same shard? If not you'd have dups all over the cluster. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206849.html Sent from the Solr - User mailing list archive at Nabble.com.
Clarification on Collections API for 5.x
Hi, In the guide for moving from Solr 4.x to 5.x, it states the following: Solr 5.0 only supports creating and removing SolrCloud collections through the Collections APIhttps://cwiki.apache.org/confluence/display/solr/Collections+API, unlike previous versions. While not using the collections API may still work in 5.0, it is unsupported, not recommended, and the behavior will change in a 5.x release. Currently, we launch several solr nodes with identical cores defined using the new Core Discovery process. These nodes are also connected to a zookeeper ensemble. Part of the core definition is to set the configSet to use. This configSet is uploaded to zookeeper separately. This effectively creates a Collection. Is this method no long supported in 5.x? Thanks! Jim Musil
Re: Reindex of document leaves old fields behind
let's see the code. simplified code and some comments 1. solrUrl points at leader 1 of 3 leaders, each with a replica 2. createSolrDoc takes a full Mongo doc and returns a valid SolrInputDocument 3. I have done dumps of the returned solrDoc and verified it does not have the unwanted fields SolrServer solrServer = new HttpSolrServer(solrUrl); SolrInputDocument solrDoc = solrDocFactory.createSolrDoc(mongoDoc, dbName); UpdateResponse uresponse = solrServer.add(solrDoc); issue a query on some of the unique ids in question SolrCloud is returning only 1 document per uniqueKey Did you push your schema up to Zookeeper and reload (or restart) your collection before re-indexing things? no. the config was pushed up to Zookeeper only once a few months ago. The documents in question were updated in Mongo and given an updated create_date. Based on this new create_date my SolrJ client detects and reindexes them. are you sure the documents are actually getting indexed and that the update is succeeding? yes, I see a new value in the timestamp field each time I reindex -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206841.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Thanks Alessandro. I am implementing this in the Hybris framework. It is not easy to create nested documents during indexing using the Hybris Solr indexer. So I am trying to avoid additional documents and cores if at all possible. -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4206854.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Hi Alex, this is not a simple problem. In your domain we can consider a Product as a document and the list of Price nested Documents. Ideally we would model the Product as the father and the prices as children. Each Price will be defined by : - *start_date * - *end_date * - *price * - *productId* We can define 2 collections this way and play with Joins and faceting. Take a look here : http://lucene.472066.n3.nabble.com/How-do-I-get-faceting-to-work-with-Solr-JOINs-td4147785.html#a4148838 If redundancy of data is not a problem for you, you can proceed with a simple approach where you add redundant documents. Each document will have the start_date,end_date and price as single value fields. In the redundant scenario, the approach to follow is quite easy : - always filtering by date the docs and then proceed faceting . Cheers 2015-05-21 13:58 GMT+01:00 alexw aw...@crossview.com: Hi, I have an unique requirement to facet on product prices based on date constraints, for which I have been thinking for a solution for a couple of days now, but to no avail. The details are as follows: 1. Each product can have multiple prices, each price has a start-date and an end-date. 2. At search time, we need to facet on price ranges ($0 - $5, $5-$20, $20-$50...) 3. When faceting, a date is first determined. It can be either the current system date or a future date (call it date X) 4. For each product, the price to be used for faceting has to meet the following condition: start-date date X, and date X end-date, in other words, date X has to fall within start-date and end-date. 5. My Solr version: 3.5 Hopefully I explained the requirement clearly. I have tried single price field with multivalue and each price value has startdate and enddate appended. I also tried one field per price with the field name containing both startdate and enddate. Neither approach seems to work. Can someone please shed some light as to how the index should be designed and what the facet query should look like? Thanks in advance for your help! -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Logic on Term Frequency Calculation : Bug or Functionality
Hi Ariya, DefaultSimilarity does not use raw term frequency, but instead it uses square root of raw term frequency. If you want to observe raw term frequency information in explain section, I suggest you to play with org.apache.lucene.search.similarities.SimilarityBase and its sub-classes. ahmet On Thursday, May 21, 2015 3:59 PM, ariya bala ariya...@gmail.com wrote: Hi, I am puzzled on the Term Frequency Behaviour of the DefaultSimilarity implementation I have suppressed the IDF by setting to 1. TF-IDF would inturn reflect the same value as in Term Frequency Below are the inferences: Red coloured are expected to give a hit count(Term Frequency) of 2 but was one. *Is it bug or is it how the behaviour is?* Search Query: AAA BBB Parsed Query: PhraseQuery(Contents:\aaa bbb\~5000) DocumentContentSlopTFslop0TFslop2TF1AAA BBB-101212BBB AAA-10-213AAA AAA BBB- 101214AAA BBB AAA-201225BBB AAA AAA-10-216AAA BBB BBB-101217BBB AAA BBB-1012 18BBB BBB AAA-10-21 *Am I missing something?!* Cheers *Ariya *
Re: Indexing gets significantly slower after every batch commit
bq: Which is logical as index growth and time needed to put something to it is log(n) Not really. Solr indexes to segments, each segment is a fully consistent mini index. When a segment gets flushed to disk, a new one is started. Of course there'll be a _little bit_ of added overyead, but it shouldn't be all that noticeable. Furthermore, they're append only. In the past, when I've indexed the Wiki example, my indexing speed actually goes faster. So on the surface this sounds very strange to me. Are you seeing anything at all in the Solr logs that's supsicious? Best, Erick On Thu, May 21, 2015 at 12:22 PM, Sergey Shvets ser...@bintime.com wrote: Hi Angel We also noticed that kind of performance degrade in our workloads. Which is logical as index growth and time needed to put something to it is log(n) четверг, 21 мая 2015 г. пользователь Angel Todorov написал: hi Shawn, Thanks a bunch for your feedback. I've played with the heap size, but I don't see any improvement. Even if i index, say , a million docs, and the throughput is about 300 docs per sec, and then I shut down solr completely - after I start indexing again, the throughput is dropping below 300. I should probably experiment with sharding those documents to multiple SOLR cores - that should help, I guess. I am talking about something like this: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Thanks, Angel On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 5/21/2015 2:07 AM, Angel Todorov wrote: I'm crawling a file system folder and indexing 10 million docs, and I am adding them in batches of 5000, committing every 50 000 docs. The problem I am facing is that after each commit, the documents per sec that are indexed gets less and less. If I do not commit at all, I can index those docs very quickly, and then I commit once at the end, but once i start indexing docs _after_ that (for example new files get added to the folder), indexing is also slowing down a lot. Is it normal that the SOLR indexing speed depends on the number of documents that are _already_ indexed? I think it shouldn't matter if i start from scratch or I index a document in a core that already has a couple of million docs. Looks like SOLR is either doing something in a linear fashion, or there is some magic config parameter that I am not aware of. I've read all perf docs, and I've tried changing mergeFactor, autowarmCounts, and the buffer sizes - to no avail. I am using SOLR 5.1 Have you changed the heap size? If you use the bin/solr script to start it and don't change the heap size with the -m option or another method, Solr 5.1 runs with a default size of 512MB, which is *very* small. I bet you are running into problems with frequent and then ultimately constant garbage collection, as Java attempts to free up enough memory to allow the program to continue running. If that is what is happening, then eventually you will see an OutOfMemoryError exception. The solution is to increase the heap size. I would probably start with at least 4G for 10 million docs. Thanks, Shawn
Re: Reindex of document leaves old fields behind
I'm posting the fields from one of my problem document, based on this comment I found from Shawn on Grokbase. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. This is just a solrDoc.toString() with linebreaks where commas were. Maybe some of these are being seen as map fields by SOLR. = SolrInputDocument[ mynamespaces_s_mv=[drama], changedates_s_mv=[Tue May 19 17:21:26 EDT 2015, Thu Dec 30 19:00:00 EST ], networks_t_mv=[{ abcitem-id : 288578fd-6596-47bc-af95-80daecd1f24a , abccontentType : Standard:SocialHandle , SocialNetwork : { $uuid : 73553c4c-4919-4ba9-b16c-fb340f3e4c31} , Handle : in my imaginationseries}], links_s_mv=[ { $uuid : 4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e} , { $uuid : 9fd75c26-35f2-4f48-b55a-6e82089cc3ba} , { $uuid : 150e43ed-9ebe-41b4-86cc-bdf4885a50fe} , { $uuid : e20b0040-561f-4c34-9dd3-df85250b5a5b} , { $uuid : 0cff75d0-4f32-46c9-9092-60eec2dc847a} , { $uuid : 73553c4c-4919-4ba9-b16c-fb340f3e4c31}], ratings_t_mv=[{ abcitem-id : 56058649-579a-4160-9439-e59448eb3dff , abccontentType : Standard:TVPG , Rating : { $uuid : 150e43ed-9ebe-41b4-86cc-bdf4885a50fe}}], title_ci_t=in my imagination, urlkey_s=in-my imagination, title_cs_t=In My Imagination, dp2_1_s_mv=[ { _id : { $uuid : 4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e} , _rules : [ { _startDate : { $date : 2015-03-23T14:58:00.000Z} , _endDate : { $date : -12-31T00:00:00.000Z} , _r : { $uuid : 47b6b31d-d690-437a-9bab-6eeb7be3c8a4} , _p : { $uuid : d478874f-8fc7-4b3d-97f3-f7e63222d633} , _o : { $uuid : 983b6ae9-7882-4af8-bb2f-cff342be99b3} , _a : null }]}], seriestype_s=e20b0040-561f-4c34-9dd3-df85250b5a5b, shortid_s=x5jqqf, i shorttitle_t=In My Imagination, uuid_s=90a1fbbf-ddf8-47a7-9f00-55f05e7dc297, status_s=DEFAULT, updatedby_s=maceirar, description_t=sometext, review_s_mv=[{ abcpublished : { $date : 2015-05-19T21:21:30.930Z} , abcpublishedBy : jelly , abctargetEnvironment : entertainment-staging , abcrequestId : { $uuid : 56769138-4a03-4ed6-8b29-8030d0941b08} , abcsourceEnvironment : fishing , abcstate : true}, { abcpublished : { $date : 2015-05-19T21:21:31.731Z} , abcpublishedBy : jelly , abctargetEnvironment : myshow-live , abcrequestId : { $uuid : 56769138-4a03-4ed6-8b29-8030d0941b08} , abcsourceEnvironment : myshow-staging , abcstate : true}], sorttitle_t=In My Imagination, images_s_mv=[ { $uuid : 9fd75c26-35f2-4f48-b55a-6e82089cc3ba} , { $uuid : 0cff75d0-4f32-46c9-9092-60eec2dc847a}], title_ci_s=in my imagination, firmuuids_s_mv=[ { $uuid : 4d8eb47c-ce2d-4e7f-a567-d8d6692fed4e}], id=mongo-v2.abcnservices.com-fishing-90a1fbbf-ddf8-47a7-9f00-55f05e7dc297, timestamp=Thu May 21 17:29:58 EDT 2015 ] -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206963.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Thanks David. Unfortunately we are on Solr 3.5, so I am not sure whether RPT is available. If not, is there a way to patch 3.5 to make it work? -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4207003.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Price Range Faceting Based on Date Constraints
Indeed: https://github.com/dsmiley/SOLR-2155 On Thu, May 21, 2015 at 8:59 PM alexw aw...@crossview.com wrote: Thanks David. Unfortunately we are on Solr 3.5, so I am not sure whether RPT is available. If not, is there a way to patch 3.5 to make it work? -- View this message in context: http://lucene.472066.n3.nabble.com/Price-Range-Faceting-Based-on-Date-Constraints-tp4206817p4207003.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud with local configs
On 5/21/2015 7:24 PM, Steven Bower wrote: Is it possible to run in cloud mode with zookeeper managing collections/state/etc.. but to read all config files (solrconfig, schema, etc..) from local disk? Obviously this implies that you'd have to keep them in sync.. My thought here is of running Solr in a docker container, but instead of having to manage schema changes/etc via zk I can just build the config into the container.. and then just produce a new docker image with a solr version and the new config and just do rolling restarts of the containers.. As far as I am aware, this is not possible. As I think about it, I'm not convinced that it's a good idea. If you're going to be using zookeeper for ANY purpose, the config should be centralized in zookeeper. The ZK chroot (or new ZK ensemble, if you choose to go that route) will be dedicated to that specific cluster. It won't be shared with any other cluster. Any automation you've got that fires up a new cluster can simply upload the cluster-specific config into the new ZK chroot as it builds the container(s) for the cluster. Teardown automation can delete the chroot. The idea is probably worth an issue in jira. I won't veto the implementation, but as I said above, I'm not yet convinced that it's a good idea -- ZK is already in use for the clusterstate, using it for the config completely eliminates the need for config synchronization. Do you have a larger compelling argument? Thanks, Shawn
Re: Index Sizes
On 1/7/2014 7:48 AM, Steven Bower wrote: I was looking at the code for getIndexSize() on the ReplicationHandler to get at the size of the index on disk. From what I can tell, because this does directory.listAll() to get all the files in the directory, the size on disk includes not only what is searchable at the moment but potentially also files that are being created by background merges/etc.. I am wondering if there is an API that would give me the size of the currently searchable index files (doubt this exists, but maybe).. If not what is the most appropriate way to get a list of the segments/files that are currently in use by the active searcher such that I could then ask the directory implementation for the size of all those files? For a more complete picture of what I'm trying to accomplish, I am looking at building a quota/monitoring component that will trigger when index size on disk gets above a certain size. I don't want to trigger if index is doing a merge and ephemerally uses disk for that process. If anyone has any suggestions/recommendations here too I'd be interested.. Dredging up a VERY old thread here. As I was replying to your most recent query, I was looking through my email archive for your previous messages and this one caught my eye, especially because it never got a reply. It must have escaped my notice last year. This is a very good idea. I imagine that the active searcher object directly or indirectly knows exactly which files are in use for that searcher, so I think it should be relatively easy for it to retrieve a list, and the index size code should be able to return both the active index size as well as the total directory size. I've been putting a little bit of work in to get the index size code moved out of the replication handler so that it is available even if replication is completely disabled, but my free time has been limited. I don't recall the issue number(s) for that work. Thanks, Shawn
Re: Index optimize runs in background.
Hi An insight on the question will be really helpful. Thanks, Modassar On Thu, May 21, 2015 at 5:51 PM, Modassar Ather modather1...@gmail.com wrote: Hi, I am using Solr-5.1.0. I have an indexer class which invokes cloudSolrClient.optimize(true, true, 1). My indexer exits after the invocation of optimize and the optimization keeps on running in the background. Kindly let me know if it is per design and how can I make my indexer to wait until the optimization is over. Is there a configuration/parameter I need to set for the same. Please note that the same indexer with cloudSolrServer.optimize(true, true, 1) on Solr-4.10 used to wait till the optimize was over before exiting. Thanks, Modassar
SolrCloud with local configs
Is it possible to run in cloud mode with zookeeper managing collections/state/etc.. but to read all config files (solrconfig, schema, etc..) from local disk? Obviously this implies that you'd have to keep them in sync.. My thought here is of running Solr in a docker container, but instead of having to manage schema changes/etc via zk I can just build the config into the container.. and then just produce a new docker image with a solr version and the new config and just do rolling restarts of the containers.. Thanks, Steve
Re: Search for numbers
Hi Holger, It’s not apparent to me why you are using the spatial field to index a number. Why not simply a “tfloat” or whatever numeric field? Then you could use {!frange} with a function to get the difference and filter it to be in the range you want. RE query parsing (problem #1): you should write a custom query parser… perhaps by forking ExtendedDisMaxQParser to meet your needs. But I think you’ll have something cleaner / more maintainable if you write one from scratch while looking at that QParser for tips/inspiration; not porting the features you don’t want. RE problem #2: I’m a little unclear on what you want to do, but it’s likely you can express it with {!frange} on a number field (not spatial) with the right functions. If you can’t), you could write either a custom function (AKA ValueSource) or if needed a frange like thing for your custom needs. ~ David http://www.linkedin.com/in/davidwsmiley On Thu, May 21, 2015 at 3:22 AM Holger Rieß holger.ri...@werkzeug-eylert.de wrote: Hi, I try to search numbers with a certain deviation. My parser is ExtendedDisMax. A possible search expression could be 'twist drill 1.23 mm'. It will not match any documents, because the document contains the keywords 'twist drill', '1.2' and 'mm'. In order to reach my goal, I've indexed all numbers as points with the solr.SpatialRecursivePrefixTreeFieldType. For example '1.2' as field name=feature_nr1.2 0.0/field. A search with 'drill mm' and a filter query 'fq={!geofilt pt=0,1.23 sfield=feature_nr d=5}' delivers the expected results. Now I have two problems: 1. How can I get ExtendedDisMax, to 'replace' the value 1.2 with the '{!geofilt}' function? My first attemts were - Build a field type in schema.xml and replace the field content with a regular expression '... replacement=_query_:quot;{!geofilt pt=0,$1 sfield=feature_nr d=5}quot;'. The idea was to use a nested query. But edismax searches 'feature_nr:_query_:{!geofilt pt=0,$1 sfield=feature_nr d=5}'. No documents are found. - Program a new parser that analyzes the query terms, finds all numbers and does the geospatial stuff. Added this parser in the 'appends' section of the 'requestHandler' definition. But I can get this parser only to filter my results, not to extend them. 2. I want to calculate the distance (d) of the '{!geofilt}' function relative to the value, for example 1%. Could there be a simple solution? Thank you in advance. Holger
Applying gzip compression in Solr 5.1
Hi, I'm trying to apply gzip compression in Solr 5.1. I understand that Running Solr on Tomcat is no longer supported from Solr 5.0, so I've tried to implement it in Solr. I've downloaded jetty-servlets-9.3.0.RC0.jar and placed it in my webapp\WEB-INF folder, and have added the following in webapp\WEB-INF\web.xml filter filter-nameGzipFilter/filter-name filter-classorg.eclipse.jetty.servlets.GzipFilter/filter-class init-param param-namemethods/param-name param-valueGET,POST/param-value param-namemimeTypes/param-name param-valuetext/html;charset=UTF-8,text/plain,text/xml,text/json,text/javascript,text/css,text/plain;charset=UTF-8,application/xhtml+xml,application/javascript,image/svg+xml,application/json,application/xml; charset=UTF-8/param-value /init-param /filter filter-mapping filter-nameGzipFilter/filter-name url-pattern/*/url-pattern /filter-mapping However, when I start Solr and check the browser, there's no gzip compression. Is there anything which I configure wrongly or might have missed out? I'm also running zookeeper-3.4.6. Regards, Edwin
Re: Java upgrade for solr in master-slave configuration
Hi, Anybody tried upgrading master first prior to slave Java upgrade. Please suggest. On Tue, May 19, 2015 at 6:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 5/19/2015 12:21 AM, Kamal Kishore Aggarwal wrote: I am currently working with Java-1.7, Solr-4.8.1 with tomcat 7. The solr configuration has slave master architecture. I am looking forward to upgrade Java from 1.7 to 1.8 version in order to take advantage of memory optimization done in latest version. So, I am confused if I should upgrade java first on master server and then on slave server or the other way round. What should be the ideal steps, so that existing solr index and other things should not get corrupted . Please suggest. I am not aware of any changes in index format resulting from changing your Java version. It should not matter which machines you upgrade first. Thanks, Shawn
Confused about whether Real-time Gets must be sent to leader?
I'm seeing that RTG requests get routed to any active replica of the shard hosting the doc requested by /get ... I was thinking only the leader should handle that request since there's a brief window of time where the latest update may not be on the replica (albeit usually very brief) and the latest update is definitely on the leader. Am I overthinking this since we've always maintained that Solr is eventually consistent or ??? Cheers, Tim
Re: solr uima and opennlp
Hi Andreaa, 2015-05-21 18:12 GMT+02:00 hossmaa andreea.hossm...@gmail.com: Hi everyone I'm trying to plug in a new UIMA annotator into solr. What is necessary for this? Is is enough to build a Jar similarly to the ones from the uima-addons package? yes, exactly. Actually you just need a jar containing the Annotator class (and dependencies) that you reference from within the UIMAUpdateRequestProcessor. More specifically, are the uima-addona Jars identical to the ones found in solr's contrib folder? they are the 2.3.1 versions of those jars. Regards, Tommaso Thanks! Andreea -- View this message in context: http://lucene.472066.n3.nabble.com/solr-uima-and-opennlp-tp4206873.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing gets significantly slower after every batch commit
Hi Angel We also noticed that kind of performance degrade in our workloads. Which is logical as index growth and time needed to put something to it is log(n) четверг, 21 мая 2015 г. пользователь Angel Todorov написал: hi Shawn, Thanks a bunch for your feedback. I've played with the heap size, but I don't see any improvement. Even if i index, say , a million docs, and the throughput is about 300 docs per sec, and then I shut down solr completely - after I start indexing again, the throughput is dropping below 300. I should probably experiment with sharding those documents to multiple SOLR cores - that should help, I guess. I am talking about something like this: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Thanks, Angel On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey apa...@elyograg.org javascript:; wrote: On 5/21/2015 2:07 AM, Angel Todorov wrote: I'm crawling a file system folder and indexing 10 million docs, and I am adding them in batches of 5000, committing every 50 000 docs. The problem I am facing is that after each commit, the documents per sec that are indexed gets less and less. If I do not commit at all, I can index those docs very quickly, and then I commit once at the end, but once i start indexing docs _after_ that (for example new files get added to the folder), indexing is also slowing down a lot. Is it normal that the SOLR indexing speed depends on the number of documents that are _already_ indexed? I think it shouldn't matter if i start from scratch or I index a document in a core that already has a couple of million docs. Looks like SOLR is either doing something in a linear fashion, or there is some magic config parameter that I am not aware of. I've read all perf docs, and I've tried changing mergeFactor, autowarmCounts, and the buffer sizes - to no avail. I am using SOLR 5.1 Have you changed the heap size? If you use the bin/solr script to start it and don't change the heap size with the -m option or another method, Solr 5.1 runs with a default size of 512MB, which is *very* small. I bet you are running into problems with frequent and then ultimately constant garbage collection, as Java attempts to free up enough memory to allow the program to continue running. If that is what is happening, then eventually you will see an OutOfMemoryError exception. The solution is to increase the heap size. I would probably start with at least 4G for 10 million docs. Thanks, Shawn
Re: Solr suggester
Hi Erick, I have read your blog and it is really helpful.I'm thinking about upgrading to Solr 5.1 but it won't solve all my problems with this issue, as you said each build will have to read all docs, and analyze it's fields. The only advantage is that I can skip default suggest.build on start up. Thank you for your reply. Jon Kerling. On Thursday, May 21, 2015 6:38 PM, Erick Erickson erickerick...@gmail.com wrote: Frankly, the suggester is rather broken in Solr 4.x with large indexes. Building the suggester index (or FST) requires that _all_ the docs get read, the stored fields analyzed and added to the suggester. Unfortunately, this happens _every_ time you start Solr and can take many minutes whether or not you have buildOnStartup set to false, see: https://issues.apache.org/jira/browse/SOLR-6845. See: http://lucidworks.com/blog/solr-suggester/ See inline. On Thu, May 21, 2015 at 6:12 AM, jon kerling jonkerl...@yahoo.com.invalid wrote: Hi, I'm using solr 4.10 and I'm trying to add autosuggest ability to my application. I'm currently using this kind of configuration: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldfield2/str str name=weightFieldweightField/str str name=suggestAnalyzerFieldTypetext_general/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler I wanted to know how the suggester Index/file is being rebuilt. Is it suppose to have all the terms of the desired field in the suggester? Yes. if not, is it related to this kind of lookup implementation? if I'll use other lookup implementation which suggest also infix terms of fields, doesn't it has to hold all terms of the field? Yes. When i call suggest.build, does it build from scratch the suggester Index/file, or is it just doing something like sort of delta indexing suggestions? Builds from scratch Thank You, Jon
Re: Reindex of document leaves old fields behind
I'm doing all my index to leader 1 and have not specified any router configuration. But there is an equal distribution of 240M docs across 5 shards. I think I've been stating I have 3 shards in these posts, I have 5, sorry. How do I know what kind of routing I am using? -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206869.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting
I question your base assumption: bq: So shard by document producer seems a good choice Because what this _also_ does is force all of the work for a query onto one node and all indexing for a particular producer ditto. And will cause you to manually monitor your shards to see if some of them grow out of proportion to others. And I think it would be much less hassle to just let Solr distribute the docs as it may based on the uniqueKey and forget about it. Unless you want, say, to do joins etc There will, of course, be some overhead that you pay here, but unless you an measure it and it's a pain I wouldn't add the complexity you're talking about, especially at the volumes you're talking. Best, Erick On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com wrote: Hi I'd like some feedback on how I'd like to solve the following sharding problem I have a collection that will eventually become big Average document size is 1.5kb Every year 30 Million documents will be indexed Data come from different document producers (a person, owner of his documents) and queries are almost always performed by a document producer who can only query his own document. So shard by document producer seems a good choice there are 3 types of doc producer type A, cardinality 105 (there are 105 producers of this type) produce 17M docs/year (the aggregated production af all type A producers) type B cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 9M docs/year I'm thinking about use compositeId ( solrDocId = producerId!docId ) to send all docs of the same producer to the same shards. When a shard becomes too large I can use shard splitting. problems -documents from type A producers could be oddly distributed among shards, because hashing doesn't work well on small numbers (105) see Appendix As a solution I could do this when a new typeA producer (producerA1) arrives: 1) client app: generate a producer code 2) client app: simulate murmurhashing and shard assignment 3) client app: check shard assignment is optimal (producer code is assigned to the shard with the least type A producers) otherwise goto 1) and try with another code when I add documents or perform searches for producerA1 I use it's producer code respectively in the compositeId or in the route parameter What do you think? ---Appendix: murmurhash shard assignment simulation--- import mmh3 hashes = [mmh3.hash(str(i))16 for i in xrange(105)] num_shards = 16 shards = [0]*num_shards for hash in hashes: idx = hash % num_shards shards[idx] += 1 print shards print sum(shards) - result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8] so with 16 shards and 105 shard keys I can have shards with 1 key shards with 11 keys
Re: Reindex of document leaves old fields behind
On 5/21/2015 9:02 AM, tuxedomoon wrote: l If it is implicit then you may have indexed the new document to a different shard, which means that it is now in your index more than once, and which one gets returned may not be predictable. If a document with uniqueKey 1234 is assigned to a shard by SolrCloud, implicit routing won't a reindex of 1234 be assigned to the same shard? If not you'd have dups all over the cluster. The implicit router basically means manual routing. Whatever shard actually receives the request will be the one that indexes it. If you want documents automatically routed according to their hash, you need the compositeId router. Thanks, Shawn
Re: Is it possible to do term Search for the filtered result set
Have you tried fq=type:A Best, Erick On Thu, May 21, 2015 at 5:49 AM, Danesh Kuruppu dknkuru...@gmail.com wrote: Hi all, Is it possible to do term search for the filtered result set. we can do term search for all documents. Can we do the term search only for the specified filtered result set. Lets says we have, Doc1 -- type: A tags: T1 T2 Doc2 -- type: A tags: T1 T3 Doc3 -- type: B tags: T1 T4 T5 Can we do term search for tags only in type:A documents, So that it gives the results as T1 - 02 T2 - 01 T3 - 01 Is this possible? If so can you please share documents on this. Thanks Danesh
Re: Solr suggester
Frankly, the suggester is rather broken in Solr 4.x with large indexes. Building the suggester index (or FST) requires that _all_ the docs get read, the stored fields analyzed and added to the suggester. Unfortunately, this happens _every_ time you start Solr and can take many minutes whether or not you have buildOnStartup set to false, see: https://issues.apache.org/jira/browse/SOLR-6845. See: http://lucidworks.com/blog/solr-suggester/ See inline. On Thu, May 21, 2015 at 6:12 AM, jon kerling jonkerl...@yahoo.com.invalid wrote: Hi, I'm using solr 4.10 and I'm trying to add autosuggest ability to my application. I'm currently using this kind of configuration: searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldfield2/str str name=weightFieldweightField/str str name=suggestAnalyzerFieldTypetext_general/str /lst /searchComponent requestHandler name=/suggest class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler I wanted to know how the suggester Index/file is being rebuilt. Is it suppose to have all the terms of the desired field in the suggester? Yes. if not, is it related to this kind of lookup implementation? if I'll use other lookup implementation which suggest also infix terms of fields, doesn't it has to hold all terms of the field? Yes. When i call suggest.build, does it build from scratch the suggester Index/file, or is it just doing something like sort of delta indexing suggestions? Builds from scratch Thank You, Jon
Re: Is it possible to do term Search for the filtered result set
and then facet on the tags field. facet=onfacet.field=tags Upayavira On Thu, May 21, 2015, at 04:34 PM, Erick Erickson wrote: Have you tried fq=type:A Best, Erick On Thu, May 21, 2015 at 5:49 AM, Danesh Kuruppu dknkuru...@gmail.com wrote: Hi all, Is it possible to do term search for the filtered result set. we can do term search for all documents. Can we do the term search only for the specified filtered result set. Lets says we have, Doc1 -- type: A tags: T1 T2 Doc2 -- type: A tags: T1 T3 Doc3 -- type: B tags: T1 T4 T5 Can we do term search for tags only in type:A documents, So that it gives the results as T1 - 02 T2 - 01 T3 - 01 Is this possible? If so can you please share documents on this. Thanks Danesh
AW: Price Range Faceting Based on Date Constraints
Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false. The date is located on the X-axis, prices on the Y axis. For every price you get a horizontal line between start and end date. Index a rectangle with height 0.001( 1 cent) and width 'end date - start date'. Find all prices that are valid on a given day or in a given date range with the 'geofilt' function. The field type could look like (not tested): fieldType name=price_date_range class=solr.SpatialRecursivePrefixTreeFieldType geo=false distErrPct=0.025 maxDistErr=0.09 units=degrees worldBounds=1 0 366 1 / Faceting possibly can be done with a facet query for every of your price ranges. For example day 20, price range 0-5$, rectangle: field name=pdr20.0 0.0 21.0 5.0/field. Regards Holger
solr uima and opennlp
Hi everyone I'm trying to plug in a new UIMA annotator into solr. What is necessary for this? Is is enough to build a Jar similarly to the ones from the uima-addons package? More specifically, are the uima-addona Jars identical to the ones found in solr's contrib folder? Thanks! Andreea -- View this message in context: http://lucene.472066.n3.nabble.com/solr-uima-and-opennlp-tp4206873.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing gets significantly slower after every batch commit
hi Shawn, Thanks a bunch for your feedback. I've played with the heap size, but I don't see any improvement. Even if i index, say , a million docs, and the throughput is about 300 docs per sec, and then I shut down solr completely - after I start indexing again, the throughput is dropping below 300. I should probably experiment with sharding those documents to multiple SOLR cores - that should help, I guess. I am talking about something like this: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Thanks, Angel On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/21/2015 2:07 AM, Angel Todorov wrote: I'm crawling a file system folder and indexing 10 million docs, and I am adding them in batches of 5000, committing every 50 000 docs. The problem I am facing is that after each commit, the documents per sec that are indexed gets less and less. If I do not commit at all, I can index those docs very quickly, and then I commit once at the end, but once i start indexing docs _after_ that (for example new files get added to the folder), indexing is also slowing down a lot. Is it normal that the SOLR indexing speed depends on the number of documents that are _already_ indexed? I think it shouldn't matter if i start from scratch or I index a document in a core that already has a couple of million docs. Looks like SOLR is either doing something in a linear fashion, or there is some magic config parameter that I am not aware of. I've read all perf docs, and I've tried changing mergeFactor, autowarmCounts, and the buffer sizes - to no avail. I am using SOLR 5.1 Have you changed the heap size? If you use the bin/solr script to start it and don't change the heap size with the -m option or another method, Solr 5.1 runs with a default size of 512MB, which is *very* small. I bet you are running into problems with frequent and then ultimately constant garbage collection, as Java attempts to free up enough memory to allow the program to continue running. If that is what is happening, then eventually you will see an OutOfMemoryError exception. The solution is to increase the heap size. I would probably start with at least 4G for 10 million docs. Thanks, Shawn
Re: Reindex of document leaves old fields behind
On 5/21/2015 9:54 AM, tuxedomoon wrote: I'm doing all my index to leader 1 and have not specified any router configuration. But there is an equal distribution of 240M docs across 5 shards. I think I've been stating I have 3 shards in these posts, I have 5, sorry. How do I know what kind of routing I am using? If all your indexing is going to the same place and the docs are distributed evenly, then quite possibly your router is compositeId. To see for sure, go to the admin UI, click on Cloud, then Tree. Click the little arrow next to collections, then click on the collection name. In the far right pane, there will be a small snippet of JSON below the other attributes, defining the configName and router. Thanks, Shawn
Re: Price Range Faceting Based on Date Constraints
Just thinking a little bit on it, I should investigate more the . SpatialRecursivePrefixTreeFieldType . Each value of that field is it a Point ? Actually each of our values must be the rectangle. Because the time frame and the price are a single value ( not only the duration of the price 'end date - start date'). Could you give an example of the indexing as well ? Cheers 2015-05-21 17:28 GMT+01:00 Alessandro Benedetti benedetti.ale...@gmail.com : The geo-spatial idea is brilliant ! Do you think translating the date into ms ? Alex, you should try that approach, it can work ! Cheers 2015-05-21 16:49 GMT+01:00 Holger Rieß holger.ri...@werkzeug-eylert.de: Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false. The date is located on the X-axis, prices on the Y axis. For every price you get a horizontal line between start and end date. Index a rectangle with height 0.001( 1 cent) and width 'end date - start date'. Find all prices that are valid on a given day or in a given date range with the 'geofilt' function. The field type could look like (not tested): fieldType name=price_date_range class=solr.SpatialRecursivePrefixTreeFieldType geo=false distErrPct=0.025 maxDistErr=0.09 units=degrees worldBounds=1 0 366 1 / Faceting possibly can be done with a facet query for every of your price ranges. For example day 20, price range 0-5$, rectangle: field name=pdr20.0 0.0 21.0 5.0/field. Regards Holger -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Price Range Faceting Based on Date Constraints
The geo-spatial idea is brilliant ! Do you think translating the date into ms ? Alex, you should try that approach, it can work ! Cheers 2015-05-21 16:49 GMT+01:00 Holger Rieß holger.ri...@werkzeug-eylert.de: Give geospatial search a chance. Use the 'SpatialRecursivePrefixTreeFieldType' field type, set 'geo' to false. The date is located on the X-axis, prices on the Y axis. For every price you get a horizontal line between start and end date. Index a rectangle with height 0.001( 1 cent) and width 'end date - start date'. Find all prices that are valid on a given day or in a given date range with the 'geofilt' function. The field type could look like (not tested): fieldType name=price_date_range class=solr.SpatialRecursivePrefixTreeFieldType geo=false distErrPct=0.025 maxDistErr=0.09 units=degrees worldBounds=1 0 366 1 / Faceting possibly can be done with a facet query for every of your price ranges. For example day 20, price range 0-5$, rectangle: field name=pdr20.0 0.0 21.0 5.0/field. Regards Holger -- -- Benedetti Alessandro Visiting card : http://about.me/alessandro_benedetti Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry? William Blake - Songs of Experience -1794 England
Re: Reindex of document leaves old fields behind
OK it is composite I've just used post.sh to index a test doc with 3 fields to leader 1 of my SolrCloud. I then reindexed it with 1 field removed and the query on it shows 2 fields. I repeated this a few times and always get the correct field count from Solr. I'm now wondering if SolrJ is somehow involved in performing an atomic update rather than replacement. I will try the above test via SolrJ. -- View this message in context: http://lucene.472066.n3.nabble.com/Reindex-of-document-leaves-old-fields-behind-tp4206710p4206886.html Sent from the Solr - User mailing list archive at Nabble.com.