Re: SolrCloud result correctness compared with single core
Pretty helpful, thanks Erick! 2015-01-24 9:48 GMT+08:00 Erick Erickson erickerick...@gmail.com: you might, but probably not enough to notice. At 50G, the tf/idf stats will _probably_ be close enough you won't be able to tell. That said, recently distributed tf/idf has been implemented but you need to ask for it, see SOLR-1632. This is Solr 5.0 though. I've rarely seen it matter except in fairly specialized situations. Consider a single core. Deleted documents still count towards some of the tf/idf stats. So your scoring could theoretically change after, say, an optimize. So called bottom line is that yes, the scoring may change, but IMO not any more radically than was possible with single cores, and I wouldn't worry about unless I had evidence that it was biting me. Best Erick On Fri, Jan 23, 2015 at 2:52 PM, Yandong Yao yydz...@gmail.com wrote: Hi Guys, As the main scoring mechanism is based tf/idf, so will same query running against SolrCloud return different result against running it against single core with same data sets as idf will only count df inside one core? eg: Assume I have 100GB data: A) Index those data using single core B) Index those data using SolrCloud with two cores (each has 50GB data index) Then If I query those with same query like 'apple', then will I get different result for A and B? Regards, Yandong
SolrCloud result correctness compared with single core
Hi Guys, As the main scoring mechanism is based tf/idf, so will same query running against SolrCloud return different result against running it against single core with same data sets as idf will only count df inside one core? eg: Assume I have 100GB data: A) Index those data using single core B) Index those data using SolrCloud with two cores (each has 50GB data index) Then If I query those with same query like 'apple', then will I get different result for A and B? Regards, Yandong
Re: Index optimize takes more than 40 minutes for 18M documents
Thans Walter for info, we will disable optimize then and do more testing. Regards, Yandong 2013/2/22 Walter Underwood wun...@wunderwood.org That seems fairly fast. We index about 3 million documents in about half that time. We are probably limited by the time it takes to get the data from MySQL. Don't optimize. Solr automatically merges index segments as needed. Optimize forces a full merge. You'll probably never notice the difference, either in disk space or speed. It might make sense to force merge (optimize) if you reindex everything once per day and have no updates in between. But even then it may be a waste of time. You need lots of free disk space for merging, whether a forced merge or automatic. Free space equal to the size of the index is usually enough, but worst case can need double the size of the index. wunder On Feb 21, 2013, at 9:20 AM, Yandong Yao wrote: Hi Guys, I am using Solr 4.1 and have indexed 18M documents using solrj ConcurrentUpdateSolrServer (each document contains 5 fields, and average length is less than 1k). 1) It takes 70 minutes to index those documents without optimize on my mac 10.8, how is the performance, slow, fast or common? 2) It takes about 40 minutes to optimize those documents, following is top output, and there are lots of FAULTS, what does this means? Processes: 118 total, 2 running, 8 stuck, 108 sleeping, 719 threads 00:56:52 Load Avg: 1.48, 1.56, 1.73 CPU usage: 6.63% user, 6.40% sys, 86.95% idle SharedLibs: 31M resident, 0B data, 6712K linkedit. MemRegions: 34734 total, 5801M resident, 39M private, 638M shared. PhysMem: 982M wired, 3600M active, 3567M inactive, 8150M used, 38M free. VM: 254G vsize, 1285M framework vsize, 1469887(368) pageins, 1095550(0) pageouts. Networks: packets: 14842595/9661M in, 14777685/9395M out. Disks: 820048/43G read, 523814/53G written. PID COMMAND %CPU TIME #TH #WQ #POR #MRE RPRVT RSHRD RSIZE VPRVT VSIZE PGRP PPID STATE UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH 4585 java 11.7 02:52:01 32 1483 342 3866M+ 6724K 3856M+ 4246M 6908M 4580 4580 sleepin 501 1490340+ 402 3000781+ 231785+ 15044055+ 10033109+ 3) If I don't run optimize, what is the impact? bigger disk size or slow query performance? Following is my index config in solrconfig.xml: ramBufferSizeMB100/ramBufferSizeMB mergeFactor10/mergeFactor autoCommit maxDocs10/maxDocs!-- 100K docs -- maxTime30/maxTime!-- 5 minutes -- openSearcherfalse/openSearcher /autoCommit Thanks very much in advance! Regards, Yandong
Re: How to run many MoreLikeThis request efficiently?
Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao yydz...@gmail.com Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
Re: How to run many MoreLikeThis request efficiently?
Hi Otis, Really appreciate your help on this!! Will go with multi-thread firstly, and then provide a custom component when performance is not good enough. Regards, Yandong 2013/1/10 Otis Gospodnetic otis.gospodne...@gmail.com Patience, young Yandong :) Multi-threading *in your application* is the way to go. Alternatively, one could write a custom SearchComponent that is called once and inside of which the whole work is done after just one call to it. This component could then write the output somewhere, like in a new index since making a blocking call to it may time out. Otis Solr ElasticSearch Support http://sematext.com/ On Jan 9, 2013 6:07 PM, Yandong Yao yydz...@gmail.com wrote: Any comments on this? Thanks very much in advance! 2013/1/9 Yandong Yao yydz...@gmail.com Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
How to run many MoreLikeThis request efficiently?
Hi Solr Guru, I have two set of documents in one SolrCore, each set has about 1M documents with different document type, say 'type1' and 'type2'. Many documents in first set are very similar with 1 or 2 documents in the second set, What I want to get is: for each document in set 2, return the most similar document in set 1 using either 'MoreLikeThisHandler' or 'MoreLikeThisComponent'. Currently I use following code to get the result, while it will send far too many request to Solr server serially. Is there any way to enhance this besides using multi-threading? Thanks very much! for each document in set 2 whose type is 'type2' run MoreLikeThis request against Solr server and get the most similar document end. Regards, Yandong
Re: mergeindex: what happens if there is deletion during index merging
Hi Shalin, Thanks very much for your detailed explanation! Regards, Yandong 2012/8/21 Shalin Shekhar Mangar shalinman...@gmail.com On Tue, Aug 21, 2012 at 8:47 AM, Yandong Yao yydz...@gmail.com wrote: Hi guys, From http://wiki.apache.org/solr/MergingSolrIndexes, it said 'Using srcCore, care is taken to ensure that the merged index is not corrupted even if writes are happening in parallel on the source index'. What does it means? If there are deletion request during merging, will this deletion be processed correctly after merging finished? Solr keeps an instance of the IndexReader for each srcCore which is a static snapshot of the index at the time of the merge request. This static snapshot is merged to the target core. Therefore any insert/delete request made to the srcCores after the merge request will not affect the merged index. 1) eg: I have an existing core 'core0', and I want to merge core 'core1' and 'core2' to core 'core0', so I will use http://localhost:8983/solr/admin/cores?action=mergeindexescore=core0srcCore=core1srcCore=core2 , During the merging happens, core0, core1, core2 have received deletion request to delete some old documents, will the final core 'core0' contains all content from 'core1' and 'core2' and also all documents matches deletion criteria has been deleted? The final core0 will not have documents deleted by requests made on core0. However, documents deleted on core1 and core2 will still be in core0 if the merge started before those requests were made. 2) And if core0, core1, and core2 are processing deletion request, at the same time core merge request comes in, what will happen then? Will merge request block until deletion finished on all cores? I believe core0 will continue to process deletion requests concurrently with the merge. As for core1 and core2, since a merge reserves their IndexReader, the answer depends on when a commit happens on core1 and core2. If, for example, 2 deletions were made on core1 and then a commit was issued (or autoCommit happened) and then the merge was triggered then the final core0 will not have those documents but it may still have docs deleted after the commit. Thanks very much in advance! Regards, Yandong -- Regards, Shalin Shekhar Mangar.
mergeindex: what happens if there is deletion during index merging
Hi guys, From http://wiki.apache.org/solr/MergingSolrIndexes, it said 'Using srcCore, care is taken to ensure that the merged index is not corrupted even if writes are happening in parallel on the source index'. What does it means? If there are deletion request during merging, will this deletion be processed correctly after merging finished? 1) eg: I have an existing core 'core0', and I want to merge core 'core1' and 'core2' to core 'core0', so I will use http://localhost:8983/solr/admin/cores?action=mergeindexescore=core0srcCore=core1srcCore=core2 , During the merging happens, core0, core1, core2 have received deletion request to delete some old documents, will the final core 'core0' contains all content from 'core1' and 'core2' and also all documents matches deletion criteria has been deleted? 2) And if core0, core1, and core2 are processing deletion request, at the same time core merge request comes in, what will happen then? Will merge request block until deletion finished on all cores? Thanks very much in advance! Regards, Yandong
Count is inconsistent between facet and stats
Hi Guys, Steps to reproduce: 1) Download apache-solr-4.0.0-ALPHA 2) cd example; java -jar start.jar 3) cd exampledocs; ./post.sh *.xml 4) Use statsComponent to get the stats info for field 'popularity' based on facet 'cat'. And the 'count' for 'electronics' is 3 http://localhost:8983/solr/collection1/select?q=cat:electronicswt=jsonrows=0stats=truestats.field=popularitystats.facet=cat { - stats_fields: { - popularity: { - min: 0, - max: 10, - count: 14, - missing: 0, - sum: 75, - sumOfSquares: 503, - mean: 5.357142857142857, - stddev: 2.7902892835178013, - facets: { - cat: { - music: { - min: 10, - max: 10, - count: 1, - missing: 0, - sum: 10, - sumOfSquares: 100, - mean: 10, - stddev: 0 }, - monitor: { - min: 6, - max: 6, - count: 2, - missing: 0, - sum: 12, - sumOfSquares: 72, - mean: 6, - stddev: 0 }, - hard drive: { - min: 6, - max: 6, - count: 2, - missing: 0, - sum: 12, - sumOfSquares: 72, - mean: 6, - stddev: 0 }, - scanner: { - min: 6, - max: 6, - count: 1, - missing: 0, - sum: 6, - sumOfSquares: 36, - mean: 6, - stddev: 0 }, - memory: { - min: 0, - max: 7, - count: 3, - missing: 0, - sum: 12, - sumOfSquares: 74, - mean: 4, - stddev: 3.605551275463989 }, - graphics card: { - min: 7, - max: 7, - count: 2, - missing: 0, - sum: 14, - sumOfSquares: 98, - mean: 7, - stddev: 0 }, - electronics: { - min: 1, - max: 7, - count: 3, - missing: 0, - sum: 9, - sumOfSquares: 51, - mean: 3, - stddev: 3.4641016151377544 } } } } } } 5) Facet on 'cat' and the count is 14. http://localhost:8983/solr/collection1/select?q=cat:electronicswt=jsonrows=0facet=truefacet.field=cat { - cat: [ - electronics, - 14, - memory, - 3, - connector, - 2, - graphics card, - 2, - hard drive, - 2, - monitor, - 2, - camera, - 1, - copier, - 1, - multifunction printer, - 1, - music, - 1, - printer, - 1, - scanner, - 1, - currency, - 0, - search, - 0, - software, - 0 ] }, So from StatsComponent the count for 'electronics' cat is 3, while FacetComponent report 14 'electronics'. Is this a bug? Following is the field definition for 'cat'. field name=cat type=string indexed=true stored=true multiValued=true/ Thanks, Yandong
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Mark, Darren Thanks very much for your help, Will try collection for each customer then. Regards, Yandong 2012/5/22 Mark Miller markrmil...@gmail.com I think the key is this: you want to think of a SolrCore on a single node Solr installation as a collection on a multi node SolrCloud installation. So if you would use multiple SolrCore's with a std Solr setup, you should be using multiple collections in SolrCloud. If you were going to try to do everything in one SolrCore, that would be like putting everything in one collection in SolrCloud. I don't think it generally makes sense to try and work at the SolrCore level when working with SolrCloud. This will be made more clear once we add a simple collections api. So I think your choice should be similar to using a single node - do you want to put everything in one 'collection' and use a filter to separate customers (with all its caveats and limitations) or do you want to use a collection per customer. You can always start up more clusters if you reach any limits. On May 22, 2012, at 10:08 AM, Darren Govoni wrote: I'm curious what the solrcloud experts say, but my suggestion is to try not to over-engineering the search architecture on solrcloud. For example, what is the benefit of managing the what cores are indexed and searched? Having to know those details, in my mind, works against the automation in solrcore, but maybe there's a good reason you want to do it this way. brbrbr--- Original Message --- On 5/22/2012 07:35 AM Yandong Yao wrote:brHi Darren, br brThanks very much for your reply. br brThe reason I want to control core indexing/searching is that I want to bruse one core to store one customer's data (all customer share same brconfig): such as customer 1 use coreForCustomer1 and customer 2 bruse coreForCustomer2. br brIs there any better way than using different core for different customer? br brAnother way maybe use different collection for different customer, while brnot sure how many collections solr cloud could support. Which way is better brin terms of flexibility/scalability? (suppose there are tens of thousands brcustomers). br brRegards, brYandong br br2012/5/22 Darren Govoni dar...@ontrenet.com br br Why do you want to control what gets indexed into a core and then br knowing what core to search? That's the kind of knowing that SolrCloud br solves. In SolrCloud, it handles the distribution of documents across br shards and retrieves them regardless of which node is searched from. br That is the point of cloud, you don't know the details of where br exactly documents are being managed (i.e. they are cloudy). It can br change and re-balance from time to time. SolrCloud performs the br distributed search for you, therefore when you try to search a node/core br with no documents, all the results from the cloud are retrieved br regardless. This is considered A Good Thing. br br It requires a change in thinking about indexing and searching br br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: br Hi Guys, br br I use following command to start solr cloud according to solr cloud wiki. br br yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf br -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar br yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 br -jar br start.jar br br Then I have created several cores using CoreAdmin API ( br http://localhost:8983/solr/admin/cores?action=CREATEname= br coreNamecollection=collection1), and clusterstate.json show following br topology: br br br collection1: br -- shard1: br-- collection1 br-- CoreForCustomer1 br-- CoreForCustomer3 br-- CoreForCustomer5 br -- shard2: br-- collection1 br-- CoreForCustomer2 br-- CoreForCustomer4 br br br 1) Index: br br Using following command to index mem.xml file in exampledocs directory. br br yydzero:exampledocs bjcoe$ java -Durl= br http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml br SimplePostTool: version 1.4 br SimplePostTool: POSTing files to br http://localhost:8983/solr/coreForCustomer3/update.. br SimplePostTool: POSTing file mem.xml br SimplePostTool: COMMITting Solr index changes. br br And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', br 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 br core has 0 documents. br br *Question 1:* Is this expected behavior? How do I to index documents br into br a specific core? br br *Question 2*: If SolrCloud don't support this yet, how could I extend it br to support this feature (index document to particular core), where br should i br
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Darren, Thanks very much for your reply. The reason I want to control core indexing/searching is that I want to use one core to store one customer's data (all customer share same config): such as customer 1 use coreForCustomer1 and customer 2 use coreForCustomer2. Is there any better way than using different core for different customer? Another way maybe use different collection for different customer, while not sure how many collections solr cloud could support. Which way is better in terms of flexibility/scalability? (suppose there are tens of thousands customers). Regards, Yandong 2012/5/22 Darren Govoni dar...@ontrenet.com Why do you want to control what gets indexed into a core and then knowing what core to search? That's the kind of knowing that SolrCloud solves. In SolrCloud, it handles the distribution of documents across shards and retrieves them regardless of which node is searched from. That is the point of cloud, you don't know the details of where exactly documents are being managed (i.e. they are cloudy). It can change and re-balance from time to time. SolrCloud performs the distributed search for you, therefore when you try to search a node/core with no documents, all the results from the cloud are retrieved regardless. This is considered A Good Thing. It requires a change in thinking about indexing and searching On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: Hi Guys, I use following command to start solr cloud according to solr cloud wiki. yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Then I have created several cores using CoreAdmin API ( http://localhost:8983/solr/admin/cores?action=CREATEname= coreNamecollection=collection1), and clusterstate.json show following topology: collection1: -- shard1: -- collection1 -- CoreForCustomer1 -- CoreForCustomer3 -- CoreForCustomer5 -- shard2: -- collection1 -- CoreForCustomer2 -- CoreForCustomer4 1) Index: Using following command to index mem.xml file in exampledocs directory. yydzero:exampledocs bjcoe$ java -Durl= http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/coreForCustomer3/update.. SimplePostTool: POSTing file mem.xml SimplePostTool: COMMITting Solr index changes. And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 core has 0 documents. *Question 1:* Is this expected behavior? How do I to index documents into a specific core? *Question 2*: If SolrCloud don't support this yet, how could I extend it to support this feature (index document to particular core), where should i start, the hashing algorithm? *Question 3*: Why the documents are also indexed into 'coreForCustomer1' and 'coreForCustomer5'? The default replica for documents are 1, right? Then I try to index some document to 'coreForCustomer2': $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar post.jar ipod_video.xml While 'coreForCustomer2' still have 0 documents and documents in ipod_video are indexed to core for customer 1/3/5. *Question 4*: Why this happens? 2) Search: I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to search against 'CoreForCustomer2', while it will return all documents in the whole collection even though this core has no documents at all. Then I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2 , and it will return 0 documents. *Question 5*: So If want to search against a particular core, we need to use 'shards' parameter and use solrCore name as parameter value, right? Thanks very much in advance! Regards, Yandong
SolrCloud: how to index documents into a specific core and how to search against that core?
Hi Guys, I use following command to start solr cloud according to solr cloud wiki. yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Then I have created several cores using CoreAdmin API ( http://localhost:8983/solr/admin/cores?action=CREATEname= coreNamecollection=collection1), and clusterstate.json show following topology: collection1: -- shard1: -- collection1 -- CoreForCustomer1 -- CoreForCustomer3 -- CoreForCustomer5 -- shard2: -- collection1 -- CoreForCustomer2 -- CoreForCustomer4 1) Index: Using following command to index mem.xml file in exampledocs directory. yydzero:exampledocs bjcoe$ java -Durl= http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/coreForCustomer3/update.. SimplePostTool: POSTing file mem.xml SimplePostTool: COMMITting Solr index changes. And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 core has 0 documents. *Question 1:* Is this expected behavior? How do I to index documents into a specific core? *Question 2*: If SolrCloud don't support this yet, how could I extend it to support this feature (index document to particular core), where should i start, the hashing algorithm? *Question 3*: Why the documents are also indexed into 'coreForCustomer1' and 'coreForCustomer5'? The default replica for documents are 1, right? Then I try to index some document to 'coreForCustomer2': $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar post.jar ipod_video.xml While 'coreForCustomer2' still have 0 documents and documents in ipod_video are indexed to core for customer 1/3/5. *Question 4*: Why this happens? 2) Search: I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to search against 'CoreForCustomer2', while it will return all documents in the whole collection even though this core has no documents at all. Then I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;, and it will return 0 documents. *Question 5*: So If want to search against a particular core, we need to use 'shards' parameter and use solrCore name as parameter value, right? Thanks very much in advance! Regards, Yandong
Re: Faster Solr Indexing
I have similar issues by using DIH, and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) consumes most of the time when indexing 10K rows (each row is about 70K) - DIH nextRow takes about 10 seconds totally - If index uses whitespace tokenizer and lower case filter, then addDoc() methods takes about 80 seconds - If index uses whitespace tokenizer, lower case filer, WDF, then addDoc uses about 112 seconds - If index uses whitespace tokenizer, lower case filer, WDF and porter stemmer, then addDoc uses about 145 seconds We have more than million rows totally, and am wondering whether i am using sth. wrong or is there any way to improve the performance of addDoc()? Thanks very much in advance! Following is the configure: 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m 2) Solr version 3.5 3) solrconfig.xml (almost copied from solr's example/solr directory.) indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor10/mergeFactor !-- Sets the amount of RAM that may be used by Lucene indexing for buffering added documents and deletions before they are flushed to the Directory. -- ramBufferSizeMB64/ramBufferSizeMB !-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. -- !-- maxBufferedDocs1000/maxBufferedDocs -- maxFieldLength2147483647/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypenative/lockType /indexDefaults 2012/3/11 Peyman Faratin pey...@robustlinks.com Hi I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram). Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields: - content, and - shingledContent (populated using copyField of content). The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section The Solution: Shingling). Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. thank you Peyman
How to use nested query in fq?
Hi Guys, I am using Solr 3.5, and would like to use a fq like 'getField(getDoc(uuid:workspace_${workspaceId})), isPublic):true? - workspace_${workspaceId}: workspaceId is indexed field. - getDoc(uuid:concat(workspace_, workspaceId): return the document whose uuid is workspace_${workspaceId} - getField(getDoc(uuid:workspace_${workspaceId})), isPublic): return the matched document's isPublic field The use case is that I have workspace objects and workspace contains many sub-objects, such as work files, comments, datasets and so on. And workspace has a 'isPublic' field. If this field is true, then all registered user could access this workspace and all its sub-objects. Otherwise, only workspace member could access this workspace and its sub-objects. So I want to use fq to determine whether document in question belongs to public workspace or not. Is it possible? If not, how to implement similar feature like this? implement a ValueSourcePlugin? any guidance or example on this? Or is there any better solutions? It is possible to add 'isPublic' field to all sub-objects, while it makes indexing update more complex. so try to find better solution. Thanks very much in advance! Regards, Yandong
Re: Need help for solr searching case insensative item
Sounds like WordDelimiterFilter config issue, please refer to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory . Also it will help if you could provide: 1) Tokenizers/Filters config in schema file 2) analysis.jsp output in admin page. 2010/10/26 wu liu wul...@mail.usask.ca Hi all, I just noticed a wierd thing happend to my solr search result. if I do a search for ecommons, it cannot get the result for eCommons, instead, if i do a search for eCommons, i can only get all the match for eCommons, but not ecommons. I cannot figure it out why? please help me Thanks very much in advance
A question on WordDelimiterFilterFactory
Hi Guys, I encountered a problem when enabling WordDelimiterFilterFactory for both index and query (pasted relative part of schema.xml at the bottom of email). *1. Steps to reproduce:* 1.1 The indexed sample document contains only one sentence: This is a TechNote. 1.2 Query is: q=TechNote 1.3 Result: no matches return, while the above sentence contains word 'TechNote' absolutely. * 2. Output when enabling debugQuery* By turning on debugQuery http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl=, get following information: str name=rawquerystringTechNote/str str name=querystringTechNote/str str name=parsedqueryPhraseQuery(all:tech note)/str str name=parsedquery_toStringall:tech note/str lst name=explain/ str name=otherQueryid:001/str lst name=explainOther str name=001 0.0 = fieldWeight(all:tech note in 0), product of: 0.0 = tf(phraseFreq=0.0) 0.61370564 = idf(all: tech=1 note=1) 0.25 = fieldNorm(field=all, doc=0) /str /lst Seems that the raw query string is converted to phrase query tech note, while its term frequency is 0, so no matches. *3. Result from admin/analysis.jsp page* From analysis.jsp, seems the query 'TechNote' matches the input document, see below words marked by RED color. Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 12345 term text ThisisaTechNote TechNote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 term text thisisatechnote technote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12345 term text thisisa*tech**note* technot term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 12 term text TechNote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12 term text technote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12 term text tech note term type wordword source start,end 0,44,8 payload * 4. My questions are:* 4.1: Why debugQuery and analysis.jsp has different result? 4.2: From my understanding, during indexing, the word 'TechNote' will be converted to: 1) 'technote' and 2) 'tech note' according to my config in schema.xml. And at query time, 'TechNote' will be converted to 'tech note', thus it SHOULD match. Am I right? 4.3: Why the phrase frequency 'tech note' is 0 in the output of debugQuery result (0.0 = tf(phraseFreq=0.0))? Any suggestion/comments are absolutely welcome! *5. fieldType definition in schema.xml* fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter
Re: A question on WordDelimiterFilterFactory
Hi Robert, I am using solr 1.4, will try with 1.4.1 tomorrow. Thanks very much! Regards, Yandong Yao 2010/9/14 Robert Muir rcm...@gmail.com did you index with solr 1.4 (or are you using solr 1.4) ? at a quick glance, it looks like it might be this: https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in 1.4.1 On Tue, Sep 14, 2010 at 5:40 AM, yandong yao yydz...@gmail.com wrote: Hi Guys, I encountered a problem when enabling WordDelimiterFilterFactory for both index and query (pasted relative part of schema.xml at the bottom of email). *1. Steps to reproduce:* 1.1 The indexed sample document contains only one sentence: This is a TechNote. 1.2 Query is: q=TechNote 1.3 Result: no matches return, while the above sentence contains word 'TechNote' absolutely. * 2. Output when enabling debugQuery* By turning on debugQuery http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl= , get following information: str name=rawquerystringTechNote/str str name=querystringTechNote/str str name=parsedqueryPhraseQuery(all:tech note)/str str name=parsedquery_toStringall:tech note/str lst name=explain/ str name=otherQueryid:001/str lst name=explainOther str name=001 0.0 = fieldWeight(all:tech note in 0), product of: 0.0 = tf(phraseFreq=0.0) 0.61370564 = idf(all: tech=1 note=1) 0.25 = fieldNorm(field=all, doc=0) /str /lst Seems that the raw query string is converted to phrase query tech note, while its term frequency is 0, so no matches. *3. Result from admin/analysis.jsp page* From analysis.jsp, seems the query 'TechNote' matches the input document, see below words marked by RED color. Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 12345 term text ThisisaTechNote TechNote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 term text thisisatechnote technote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12345 term text thisisa*tech**note* technot term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 12 term text TechNote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12 term text technote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12 term text tech note term type wordword source start,end 0,44,8 payload * 4. My questions are:* 4.1: Why debugQuery and analysis.jsp has different result? 4.2: From my understanding, during indexing, the word 'TechNote' will be converted to: 1) 'technote' and 2) 'tech note' according to my config in schema.xml. And at query time, 'TechNote' will be converted to 'tech note', thus it SHOULD match. Am I right? 4.3: Why the phrase frequency 'tech note' is 0 in the output of debugQuery result (0.0 = tf(phraseFreq=0.0))? Any suggestion/comments are absolutely welcome! *5. fieldType definition in schema.xml* fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1
Re: A question on WordDelimiterFilterFactory
After upgrading to 1.4.1, it is fixed. Thanks very much for your help! Regards, Yandong Yao 2010/9/14 yandong yao yydz...@gmail.com Hi Robert, I am using solr 1.4, will try with 1.4.1 tomorrow. Thanks very much! Regards, Yandong Yao 2010/9/14 Robert Muir rcm...@gmail.com did you index with solr 1.4 (or are you using solr 1.4) ? at a quick glance, it looks like it might be this: https://issues.apache.org/jira/browse/SOLR-1852 , which was fixed in 1.4.1 On Tue, Sep 14, 2010 at 5:40 AM, yandong yao yydz...@gmail.com wrote: Hi Guys, I encountered a problem when enabling WordDelimiterFilterFactory for both index and query (pasted relative part of schema.xml at the bottom of email). *1. Steps to reproduce:* 1.1 The indexed sample document contains only one sentence: This is a TechNote. 1.2 Query is: q=TechNote 1.3 Result: no matches return, while the above sentence contains word 'TechNote' absolutely. * 2. Output when enabling debugQuery* By turning on debugQuery http://localhost:7111/solr/test/select?indent=onversion=2.2q=TechNotefq=start=0rows=0fl=*%2Cscoreqt=standardwt=standarddebugQuery=onexplainOther=id%3A001hl.fl= , get following information: str name=rawquerystringTechNote/str str name=querystringTechNote/str str name=parsedqueryPhraseQuery(all:tech note)/str str name=parsedquery_toStringall:tech note/str lst name=explain/ str name=otherQueryid:001/str lst name=explainOther str name=001 0.0 = fieldWeight(all:tech note in 0), product of: 0.0 = tf(phraseFreq=0.0) 0.61370564 = idf(all: tech=1 note=1) 0.25 = fieldNorm(field=all, doc=0) /str /lst Seems that the raw query string is converted to phrase query tech note, while its term frequency is 0, so no matches. *3. Result from admin/analysis.jsp page* From analysis.jsp, seems the query 'TechNote' matches the input document, see below words marked by RED color. Index Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1234 term text ThisisaTechNote. term type wordwordwordword source start,end 0,45,78,910,19 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1} term position 12345 term text ThisisaTechNote TechNote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12345 term text thisisatechnote technote term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12345 term text thisisa*tech**note* technot term type wordwordwordwordword word source start,end 0,45,78,910,1414,18 10,18 payload Query Analyzer org.apache.solr.analysis.WhitespaceTokenizerFactory {} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt, expand=true, ignoreCase=true} term position 1 term text TechNote term type word source start,end 0,8 payload org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1, generateNumberParts=1, catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0} term position 12 term text TechNote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.LowerCaseFilterFactory {} term position 12 term text technote term type wordword source start,end 0,44,8 payload org.apache.solr.analysis.SnowballPorterFilterFactory {protected=protwords.txt, language=English} term position 12 term text tech note term type wordword source start,end 0,44,8 payload * 4. My questions are:* 4.1: Why debugQuery and analysis.jsp has different result? 4.2: From my understanding, during indexing, the word 'TechNote' will be converted to: 1) 'technote' and 2) 'tech note' according to my config in schema.xml. And at query time, 'TechNote' will be converted to 'tech note', thus it SHOULD match. Am I right? 4.3: Why the phrase frequency 'tech note' is 0 in the output of debugQuery result (0.0 = tf(phraseFreq=0.0))? Any suggestion/comments are absolutely welcome! *5. fieldType definition in schema.xml* fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true
Re: how to support implicit trailing wildcards
Hi Jan, Seems q=mount OR mount* have different sorting order with q=mount for those documents including mount. Change to q=mount^100 OR (mount?* -mount)^1.0, and test well. Thanks very much! 2010/8/10 Jan Høydahl / Cominvent jan@cominvent.com Hi, You don't need to duplicate the content into two fields to achieve this. Try this: q=mount OR mount* The exact match will always get higher score than the wildcard match because wildcard matches uses constant score. Making this work for multi term queries is a bit trickier, but something along these lines: q=(mount OR mount*) AND (everest OR everest*) -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote: you could satisfy this by making 2 fields: 1. exactmatch 2. wildcardmatch use copyfield in your schema to copy 1 -- 2 . q=exactmatch:mount+wildcardmatch:mount*q.op=OR this would score exact matches above (solely) wildcard matches Geert-Jan 2010/8/10 yandong yao yydz...@gmail.com Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F , Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer bspit...@magix.net Wildcard-Search is already built in, just use: ?q=umoun* ?q=mounta* -Ursprüngliche Nachricht- Von: yandong yao [mailto:yydz...@gmail.com] Gesendet: Montag, 9. August 2010 15:57 An: solr-user@lucene.apache.org Betreff: how to support implicit trailing wildcards Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
how to support implicit trailing wildcards
Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
Re: how to support implicit trailing wildcards
Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer bspit...@magix.net Wildcard-Search is already built in, just use: ?q=umoun* ?q=mounta* -Ursprüngliche Nachricht- Von: yandong yao [mailto:yydz...@gmail.com] Gesendet: Montag, 9. August 2010 15:57 An: solr-user@lucene.apache.org Betreff: how to support implicit trailing wildcards Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!