Re: Are there any restrictions on what kind of how many fields you can use in Pivot Query? I get ClassCastException when I use some of my string fields, and don't when I use some other sting fields
Hello Ravish, Erick, I'm facing the same issue with solr-trunk (as of r1071282) - Field configuration : positionIncrementGap="100"> - Schema configuration : In my test index, I have documents with sparse values : Some documents may or may not have a value for f1, f2 and/or f3 The number of indexed documents is around 25. I'm facing the issue at query time, depending on my query, and the temperature of the index. Parameters having an effect on the reproducibility : - number of levels of the decision tree : the deeper the tree, the faster the exceptions arises - facet.limit parameter : the higher the limit, the faster the exception arises. Examples : All docs, facet-pivoting on all fields that matters, varying on facet.limit : q=*:* pivot=f1,f2,f3 facet.limit=1 : OK q=*:* pivot=f1,f2,f3 facet.limit=2 : OK ... q=*:* pivot=f1,f2,f3 facet.limit=8 : OK q=*:* pivot=f1,f2,f3 facet.limit=9 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=9 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=9 : OK q=*:* pivot=f1,f2,f3 facet.limit=10 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=10 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=10 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=10 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=10 : NOT OK retry q=*:* pivot=f1,f2,f3 facet.limit=10 : OK q=*:* pivot=f1,f2,f3 facet.limit=11 : NOT OK ... It really looks like a cache issue. After some retries, I can finally obtain my results, and not an HTTP 500. Once I obtain my results, I can ask for more, if wait a little. That's very odd. So before I continue, here is my query configuration : 1024 autowarmCount="0"/> autowarmCount="0"/> autowarmCount="0"/> true 20 200 solr rocks0name="rows">10 static firstSearcher warming query from solrconfig.xml false 2 That's very much like the default configuration. I guess that the default cache configuration is not perfectly suitable for facet pivoting, so any hint on how to tweak it right is welcome. Kind regards, -- Tanguy On 02/15/2011 06:05 PM, Erick Erickson wrote: To get meaningful help, you have to post a minimum of: 1> the relevant schema definitions for the field that makes it blow up. include the and tags. 2> the query you used, with some indication of the field that makes it blow up. 3> What version you're using 4> any changes you've made to the standard configurations. 5> whether you've recently installed a new version. It might help if you reviewed: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Tue, Feb 15, 2011 at 11:27 AM, Ravish Bhagdev wrote: Looks like its a bug? Is it not? Ravish On Tue, Feb 15, 2011 at 4:03 PM, Ravish Bhagdevwrote: When include some of the fields in my search query: SEVERE: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.solr.common.util.ConcurrentLRUCache$CacheEntry; at org.apache.solr.common.util.ConcurrentLRUCache$PQueue.myInsertWithOverflow(ConcurrentLRUCache.java:377) at org.apache.solr.common.util.ConcurrentLRUCache.markAndSweep(ConcurrentLRUCache.java:329) at org.apache.solr.common.util.ConcurrentLRUCache.put(ConcurrentLRUCache.java:144) at org.apache.solr.search.FastLRUCache.put(FastLRUCache.java:131) at org.apache.solr.search.SolrIndexSearcher.getDocSet(SolrIndexSearcher.java:904) at org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:121) at org.apache.solr.handler.component.PivotFacetHelper.doPivots(PivotFacetHelper.java:126) at org.apache.solr.handler.component.PivotFacetHelper.process(PivotFacetHelper.java:85) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:84) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1298) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:340) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.
Re: Selecting (and sorting!) by the min/max value from multiple fields
Hello, Have you tried reading : http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function From that page I would try something like : http://host:port/solr/select?q=sony&sort=min(min(priceCash,priceCreditCard),priceCoupon)+asc&rows=10&indent=on&debugQuery=on Is that of any help ? -- Tanguy On 04/20/2011 09:41 AM, jmaslac wrote: Hello, short question is this - is there a way for a search to return a field that is not defined in the schema but is a minimal/maximum value of several (int/float) fields in solrDocument? (and how would that search look like?) Longer explanation. I have products and each of them can have a several prices (price for cash, price for credit cards, coupon price and so on) - not every product has all the price options. (Don't ask why - that's the use case:) ) +2 more Is there a way to ask "give me the products containing for example 'sony' and in the results return me the minimal price of all possible prices (for each product) and SORT the results by that (minimal) price"? I know I can calculate the minimal price at import/index time and store it in one separate field but the idea is that users will have checkboxes in which they could say - i'm only interested in products that have the priceCreditCard and priceCoupon, show me the smaller of those two and sort by that value. My idea is something like this: ?q=sony&minPrice:min(priceCash,priceCreditCard,priceCoupon...) (the field minPrice is not defined in schema but should return in the results) For searching this actually doesn't represent a problem as I can easily programatically compare the prices and present it to the user. The problem is sorting - I could do that also programatically but that would mean that I'd have to pull out all the results query returned (which can be quite big of course) and then sort them, so that a option I would naturally like to avoid. Don't know if I'm asking too much of solr:) but I can see usefulness of something like this in other examples other then mine. Hope the question is clear and if I'm going about things completely the wrong way please advise in the right direction. (If there is a similar question asked somewhere else please redirect me - i didn't find it) Help much appreciated! Josip -- View this message in context: http://lucene.472066.n3.nabble.com/Selecting-and-sorting-by-the-min-max-value-from-multiple-fields-tp2841944p2841944.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Tanguy
Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances
Dear list, I'm posting here after some unsuccessful investigations. In my setup I push documents to Solr using the StreamingUpdateSolrServer. I'm sending a comfortable initial amount of documents (~250M) and wished to perform overwriting of duplicated documents at index time, during the update, taking advantage of the UpdateProcessorChain. At the beginning of the indexing stage, everything is quite fast; documents arrive at a rate of about 1000 doc/s. The only extra processing during the import is computation of a couple of hashes that are used to identify uniquely documents given their content, using both stock (MD5Signature) and custom (derived from Lookup3Signature) update processors. I send a commit command to the server every 500k documents sent. During a first period, the server is CPU bound. After a short while (~10 minutes), the rate at which documents are received starts to fall dramatically, the server being IO bound. I've been firstly thinking of a normal speed decrease during the commit, while my push client is waiting for the flush to occur. That would have been a normal slowdown. The thing that retained my attention was the fact that unexpectedly, the server was performing a lot of small reads, way more the number writes, which seem to be larger. The combination of the many small reads with the constant amount of bigger writes seem to be creating a lot of IO contention on my commodity SATA drive, and the ETA of my built index started to increase scarily =D I then restarted the JVM with JMX enabled so I could start investigating a little bit more. I've the realized that the UpdateHandler was performing many reads while processing the update request. Are there any known limitations around the UpdateProcessorChain, when overwriteDupes is set to true ? I turned that off, which of course breaks the intent of my built index, but for comparison purposes it's good. That did the trick, indexing is fast again, even with the periodic commits. I therefor have two questions, an interesting first one and a boring second one : 1 / What's the workflow of the UpdateProcessorChain when one or more processors have overwriting of duplicates turned on ? What happens under the hood ? I tried to answer that myself looking at DirectUpdateHandler2 and my understanding stopped at the following : - The document is added to the lucene IW - The duplicates are deleted from the lucene IW The dark magic I couldn't understand seems to occur around the idTerm and updateTerm things, in the addDoc method. The deletions seem to be buffered somewhere, I just didn't get it :-) I might be wrong since I didn't read the code more than that, but the point might be at how does solr handles deletions, which is something still unclear to me. In anyways, a lot of reads seem to occur for that precise task and it tends to produce a lot of IO, killing indexing performances when overwriteDupes is on. I don't even understand why so many read operations occur at this stage since my process had a comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used so far. Any help, recommandation or idea is welcome :-) 2 / In the case there isn't a simple fix for this, I'll have to do with duplicates in my index. I don't mind since solr offers a great grouping feature, which I already use in some other applications. The only thing I don't know yet is that if I do rely on grouping at search time, in combination with the Stats component (which is the intent of that index), and limiting the results to 1 document per group, will the computed statistics take those duplicates into account or not ? Shortly, how well does the Stats component behave when combined to hits collapsing ? I had firstly implemented my solution using overwriteDupes because it would have reduced both the target size of my index and the complexity of queries used to obtain statistics on the search results, at one time. Thank you very much in advance. -- Tanguy
Re: how can i index data in different documents
Hi Romi, A simple way to do so is to define in your schema.xml the union of all the columns you need plus a "type" field to distinguish your entities. eg, In your DB table1 : - col1 : varchar - col2 : int - col3 : float table2 : - col1 : int - col2 : varchar - col3 : int - col4 : varchar in solr's schema : field name="table1_col1" type="text" field name="table1_col2" type="int" field name="table1_col3" type="float" field name="table2_col1" type="int" field name="table2_col2" type="text" field name="table2_col3" type="int" field name="table2_col4" type="string" field name="type" type="string" required="true" multivalued="false" Ensure that when you add your documents, their "type" value is effectively set to either "table1" or table"2". That's a possibility amongst others. -- Tanguy On 05/26/11 14:57, Romi wrote: Hi, i was not getting reply for this post, so here i am reposting this, please reply. In my database i have two types of entity customer and product. I want to index customer related information in one document and product related information in other document. is it possible via solr , if so how can i achieve this. Thanks& Regards Romi. - Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-index-data-in-different-documents-tp2988621p2988621.html Sent from the Solr - User mailing list archive at Nabble.com. -- -- Tanguy
Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances
Hello, Sorry for re-posting this but it seems my message got lost in the mailing list's messages stream without hitting anyone's attention... =D Shortly, has anyone already experienced dramatic indexing slowdowns during large bulk imports with overwriteDupes turned on and a fairly high duplicates rate (around 4-8x) ? It seems to produce a lot of deletions, which in turn appear to make the merging of segments pretty slow, by fairly increasing the number of little reads operations occuring simultaneously with the regular large write operations of the merge. Added to the poor IO performances of a commodity SATA drive, indexing takes ages. I temporarily bypassed that limitation by disabling the overwriting of duplicates, but that changes the way I request the index, requiring me to turn on field collapsing at search time. Is this a known limitation ? Has anyone a few hints on how to optimize the handling of index time deduplication ? More details on my setup and the state of my understanding are in my previous message here-after. Thank you very much in advance. Regards, Tanguy On 05/25/11 15:35, Tanguy Moal wrote: Dear list, I'm posting here after some unsuccessful investigations. In my setup I push documents to Solr using the StreamingUpdateSolrServer. I'm sending a comfortable initial amount of documents (~250M) and wished to perform overwriting of duplicated documents at index time, during the update, taking advantage of the UpdateProcessorChain. At the beginning of the indexing stage, everything is quite fast; documents arrive at a rate of about 1000 doc/s. The only extra processing during the import is computation of a couple of hashes that are used to identify uniquely documents given their content, using both stock (MD5Signature) and custom (derived from Lookup3Signature) update processors. I send a commit command to the server every 500k documents sent. During a first period, the server is CPU bound. After a short while (~10 minutes), the rate at which documents are received starts to fall dramatically, the server being IO bound. I've been firstly thinking of a normal speed decrease during the commit, while my push client is waiting for the flush to occur. That would have been a normal slowdown. The thing that retained my attention was the fact that unexpectedly, the server was performing a lot of small reads, way more the number writes, which seem to be larger. The combination of the many small reads with the constant amount of bigger writes seem to be creating a lot of IO contention on my commodity SATA drive, and the ETA of my built index started to increase scarily =D I then restarted the JVM with JMX enabled so I could start investigating a little bit more. I've the realized that the UpdateHandler was performing many reads while processing the update request. Are there any known limitations around the UpdateProcessorChain, when overwriteDupes is set to true ? I turned that off, which of course breaks the intent of my built index, but for comparison purposes it's good. That did the trick, indexing is fast again, even with the periodic commits. I therefor have two questions, an interesting first one and a boring second one : 1 / What's the workflow of the UpdateProcessorChain when one or more processors have overwriting of duplicates turned on ? What happens under the hood ? I tried to answer that myself looking at DirectUpdateHandler2 and my understanding stopped at the following : - The document is added to the lucene IW - The duplicates are deleted from the lucene IW The dark magic I couldn't understand seems to occur around the idTerm and updateTerm things, in the addDoc method. The deletions seem to be buffered somewhere, I just didn't get it :-) I might be wrong since I didn't read the code more than that, but the point might be at how does solr handles deletions, which is something still unclear to me. In anyways, a lot of reads seem to occur for that precise task and it tends to produce a lot of IO, killing indexing performances when overwriteDupes is on. I don't even understand why so many read operations occur at this stage since my process had a comfortable amount of RAM (with Xms=Xmx=8GB), with only 4.5GB are used so far. Any help, recommandation or idea is welcome :-) 2 / In the case there isn't a simple fix for this, I'll have to do with duplicates in my index. I don't mind since solr offers a great grouping feature, which I already use in some other applications. The only thing I don't know yet is that if I do rely on grouping at search time, in combination with the Stats component (which is the intent of that index), and limiting the results to 1 document per group, will the computed statistics take those duplicates into account or not ? Shortly, how well does the Stats component behave when combined to
Re: Bulk indexing, UpdateProcessor overwriteDupes and poor IO performances
Lee, Thank you very much for your answer. Using the signature field as the uniqueKey is effectively what I was doing, so the "overwriteDupes=true" parameter in my solrconfig was somehow redundant, although I wasn't aware of it! =D In practice it works perfectly and that's the nice part. By the way, I wonder what happens when we enter in the following code snippet when the id field is the same as the signature field, from addDoc@DirectUpdateHandler2(AddUpdateCommand) : if(del) { // ensure id remains unique BooleanQuery bq = new BooleanQuery(); bq.add(new BooleanClause(new TermQuery(updateTerm), Occur.MUST_NOT)); bq.add(new BooleanClause(new TermQuery(idTerm), Occur.MUST)); writer.deleteDocuments(bq); } May be all my problems started from here... I'll try to reproduce using a different uniqueKey field and turning overwriteDupes back to "on" to see if the problem was because of the signature field being the same as the uniqueKey field *and* having overwriteDupes on, when I'll have some time. If so, maybe that a simple configuration check should be performed to avoid the issue. Otherwise it means that having overwriteDupes turned on simply doesn't scale and that should be added to the wiki's Deduplication page, IMHO. Thank you again. Regards, -- Tanguy On 31/05/2011 14:58, lee carroll wrote: Tanguy You might have tried this already but can you set overwritedupes to false and set the signiture key to be the id. That way solr will manage updates? from the wiki http://wiki.apache.org/solr/Deduplication HTH Lee On 30 May 2011 08:32, Tanguy Moal wrote: Hello, Sorry for re-posting this but it seems my message got lost in the mailing list's messages stream without hitting anyone's attention... =D Shortly, has anyone already experienced dramatic indexing slowdowns during large bulk imports with overwriteDupes turned on and a fairly high duplicates rate (around 4-8x) ? It seems to produce a lot of deletions, which in turn appear to make the merging of segments pretty slow, by fairly increasing the number of little reads operations occuring simultaneously with the regular large write operations of the merge. Added to the poor IO performances of a commodity SATA drive, indexing takes ages. I temporarily bypassed that limitation by disabling the overwriting of duplicates, but that changes the way I request the index, requiring me to turn on field collapsing at search time. Is this a known limitation ? Has anyone a few hints on how to optimize the handling of index time deduplication ? More details on my setup and the state of my understanding are in my previous message here-after. Thank you very much in advance. Regards, Tanguy On 05/25/11 15:35, Tanguy Moal wrote: Dear list, I'm posting here after some unsuccessful investigations. In my setup I push documents to Solr using the StreamingUpdateSolrServer. I'm sending a comfortable initial amount of documents (~250M) and wished to perform overwriting of duplicated documents at index time, during the update, taking advantage of the UpdateProcessorChain. At the beginning of the indexing stage, everything is quite fast; documents arrive at a rate of about 1000 doc/s. The only extra processing during the import is computation of a couple of hashes that are used to identify uniquely documents given their content, using both stock (MD5Signature) and custom (derived from Lookup3Signature) update processors. I send a commit command to the server every 500k documents sent. During a first period, the server is CPU bound. After a short while (~10 minutes), the rate at which documents are received starts to fall dramatically, the server being IO bound. I've been firstly thinking of a normal speed decrease during the commit, while my push client is waiting for the flush to occur. That would have been a normal slowdown. The thing that retained my attention was the fact that unexpectedly, the server was performing a lot of small reads, way more the number writes, which seem to be larger. The combination of the many small reads with the constant amount of bigger writes seem to be creating a lot of IO contention on my commodity SATA drive, and the ETA of my built index started to increase scarily =D I then restarted the JVM with JMX enabled so I could start investigating a little bit more. I've the realized that the UpdateHandler was performing many reads while processing the update request. Are there any known limitations around the UpdateProcessorChain, when overwriteDupes is set to true ? I turned that off, which of course breaks the intent of my built index, but for comparison purposes it's good. That did the trick, indexing is fast again, even with the periodic commits. I therefor have two questions, an interesting first one and a boring second one :
"Virtual field", Statistics
Dear solr-user folks, I would like to use the stats module to perform very basic statistics (mean, min and max) which is actually working just fine. Nethertheless I found a little limitation that bothers me a tiny bit : how to perform the exact same statistics, but on the result of a function query rather than a field. Example : schema : - string : id - float : width - float : height - float : depth - string : color - float : price What I'd like to do is something like : select?price:[45.5 TO 99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width, height), depth)} I would expect to obtain : ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Of course computing the volume can be performed before indexing data, but defining virtual fields on the fly given an arbitrary function is powerful and I am comfortable with the idea that many others would appreciate. Especially for BI needs and so on... :-D Is there a way to do it easily that I would have not been able to find, or is it actually impossible ? Thank you very much in advance for your help. -- Tanguy
Re: "Virtual field", Statistics
Hello Lance, thank you for your reply. I created the following JIRA issue: https://issues.apache.org/jira/browse/SOLR-2171, as suggested. Can you tell me how new issues are handled by the development teams, and whether there's a way I could help/contribute ? -- Tanguy 2010/10/16 Lance Norskog : > Please add a JIRA issue requesting this. A bunch of things are not > supported for functions: returning as a field value, for example. > > On Thu, Oct 14, 2010 at 8:31 AM, Tanguy Moal wrote: >> Dear solr-user folks, >> >> I would like to use the stats module to perform very basic statistics >> (mean, min and max) which is actually working just fine. >> >> Nethertheless I found a little limitation that bothers me a tiny bit : >> how to perform the exact same statistics, but on the result of a >> function query rather than a field. >> >> Example : >> schema : >> - string : id >> - float : width >> - float : height >> - float : depth >> - string : color >> - float : price >> >> What I'd like to do is something like : >> select?price:[45.5 TO >> 99.99]&stats=on&stats.facet=color&stats.field={volume=product(product(width, >> height), depth)} >> I would expect to obtain : >> >> >> >> >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> >> >> >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> >> >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> ... >> >> >> >> >> >> >> >> Of course computing the volume can be performed before indexing data, >> but defining virtual fields on the fly given an arbitrary function is >> powerful and I am comfortable with the idea that many others would >> appreciate. Especially for BI needs and so on... :-D >> Is there a way to do it easily that I would have not been able to >> find, or is it actually impossible ? >> >> Thank you very much in advance for your help. >> >> -- >> Tanguy >> > > > > -- > Lance Norskog > goks...@gmail.com >
Re: [Wildcard query] Weird behaviour
Thank you very much Robert for replying that fast and accurately. I have effectively an other idea in mind to provide similar suggestions less expansively, I was balancing between the work around and the report issue options. I don't regret it since you came with a possible fix. I'll give it a try as soon as possible, and let the list know. Regards, Tanguy 2010/12/3 Robert Muir : > Actually, i took a look at the code again, the queries you mentioned: > "I send queries to that field in the form (*term1*term2*)" > > I think the patch will not fix your problem... The only way i know you > can fix this would be to upgrade to lucene/solr trunk, where wildcard > comparison is linear to the length of the string. > > In all other versions, it has much worse runtime, and thats what you > are experiencing. > > Separately, even better than this would be to see if you can index > your content in a way to avoid these expensive queries. But this is > just a suggestion, what you are doing should still work fine. > > On Fri, Dec 3, 2010 at 6:56 AM, Robert Muir wrote: >> On Fri, Dec 3, 2010 at 6:28 AM, Tanguy Moal wrote: >>> However suddenly CPU usage simply doubles, and sometimes eventually >>> start using all 16 cores of the server, whereas the number of handled >>> request is pretty stable, and even starts decreasing because of >>> degraded user experience due to dramatic response times. >>> >> >> Hi Tanguy: This was fixed here: >> https://issues.apache.org/jira/browse/LUCENE-2620. >> >> You can apply the patch file there >> (https://issues.apache.org/jira/secure/attachment/12452947/LUCENE-2620_3x.patch) >> and recompile your own lucene 2.9.x, or you can replace the lucene jar >> file in your solr war with the newly released lucene-2.9.4 core jar... >> which I think is due to be released later today! >> >> Thanks for spending the time to report the problem... let us know the >> patch/lucene 2.9.4 doesnt fix it! >> >
Re: Autosuggest terms which GOOGLE uses?
Kind of : their suggestions are based on users queries with some filtering. You can have a little read there : http://www.google.com/support/websearch/bin/answer.py?hl=en&answer=106230 They perform "little" filtering to remove offending content such as "hate speech, violence and pornography" (quoting the page). You can also have a look at this slideshow : http://www.slideshare.net/sturlese/use-ofsolrattrovitclassifiedads-marcsturlese . You'll see how they build their suggest service using a dedicated solr instance. Hope this helps ;-) -- Tanguy 2010/12/8 Anurag : > > How Google selects the autosuggest terms? Is that Google uses "Userrs > Queries" from Log files to suggest only those terms? > > - > Kumar Anurag > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Autosuggest-terms-which-GOOGLE-uses-tp2039078p2039078.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Google like search
Hi Satya, I think what you'e looking for is called "highlighting" in the sense of "highlighting" the query terms in their matching context. You could start by googling "solr highlight", surely the first results will make sense. Solr's wiki results are usually a good entry point : http://wiki.apache.org/solr/HighlightingParameters . Maybe I misunderstood your question, but I hope that'll help... Regards, Tanguy 2010/12/14 satya swaroop : > Hi All, > Can we get the results like google having some data about the > search... I was able to get the data that is the first 300 characters of a > file, but it is not helpful for me, can i be get the data that is having the > first found key in that file > > Regards, > Satya >
Re: Google like search
Satya, In fact the highlighter will select the relevant part of the whole text and return it with the matched terms highlighted. If you do so for a whole book, you will face the issue spotted by Dave (too long text). To address that issue, you have the possibility to split your book in chapters, and index each chapter as a unique document. You would then be interested in adding a field to identify uniquely each book (using ISBN number for example) and turn on grouping (or collapsing) on that field ... (see this very good blog post : http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ ) Moreover, you might be interested by the following JIRA issue : https://issues.apache.org/jira/browse/SOLR-2272 . Using this patch, you could for example ensure that if a given document-chapter is selected by the query, then another (or several) document(s) (maybe a father "book-document", or all the other chapters) get selected along the way (by doing a self-join on the ISBN number). Here again, grouping afterward would return a group of document representing each book. Good luck! -- Tanguy 2010/12/14 Dave Searle : > Highlighting is exactly what you need, although if you highlight the whole > book, this could slow down your queries. Index/store the first 5000-1 > characters and see how you get on > > -Original Message- > From: satya swaroop [mailto:satya.yada...@gmail.com] > Sent: 14 December 2010 10:08 > To: solr-user@lucene.apache.org > Subject: Re: Google like search > > Hi Tanguy, > I am not asking for highlighting.. I think it can be > explained with an example.. Here i illustarte it:: > > when i post the query like dis:: > > http://localhost:8080/solr/select?q=Java&version=2.2&start=0&rows=10&indent=on > > i Would be getting the result as follows:: > > - > - > 0 > 1 > > - > - > Java%20debugging.pdf > 122 > - > - > Table of Contents > If you're viewing this document online, you can click any of the topics > below to link directly to that section. > 1. Tutorial tips 2 > 2. Introducing debugging 4 > 3. Overview of the basics 6 > 4. Lessons in client-side debugging 11 > 5. Lessons in server-side debugging 15 > 6. Multithread debugging 18 > 7. Jikes overview 20 > > > > > > > Here the str field contains the first 300 characters of the file as i kept a > field to copy only 300 characters in schema.xml... > But i dont want the content like dis.. Is there any way to make an o/p as > follows:: > > Java is one of the best language,java is easy to learn... > > > where this content is at start of the chapter,where the first word of java > is occured in the file... > > > Regards, > Satya >
Re: Google like search
To do so, you have several possibilities, I don't know if there is a best one. It depends pretty much on the format of the input file(s), your affinities with a given programing language,some libraries you might need and the time you're ready to spend on this task. Consider having a look at SolrJ (http://wiki.apache.org/solr/Solrj) or at the DataImportHandler (http://wiki.apache.org/solr/DataImportHandler) . Cheers, -- Tanguy 2010/12/14 satya swaroop : > Hi Tanguy, > Thanks for ur reply. sorry to ask this type of question. > how can we index each chapter of a file as seperate document.As for i know > we just give the path of file to solr to index it... Can u provide me any > sources for this type... I mean any blogs or wiki's... > > Regards, > satya >
Re: PHPSolrClient
Hi Dennis, Not particular to the client you use (solr-php-client) for sending documents, think of update as an overwrite. This means that if you update a particular document, the previous version indexed is lost. Therefore, when updating a document, make sure that all the fields to be indexed and retrieved are present in the update. For an update to occur, only the uniqueKey id (as specified in your schema.xml) has to be the same as the document you want to update. Shortly, an update is like an add, (and performed the same way) except that the added document was previously indexed. It simple gets replaced by the update. Hope that helps, -- Tanguy 2010/12/16 Dennis Gearon : > First of all, it's a very nice piece of work. > > I am just getting my feet wet with Solr in general. So I 'am not even sure > how a > document is NORMALLY deleted. > > The library PHPDocs say 'add', 'get' 'delete', But does anyone know about > 'update'? > (obviously one can read-delete-modify-create) > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
Re: Tuning StatsComponent
Hello, You could try taking advantage of Solr's facetization feature : provided that you have the amount stored in the amount field and the currency stored in the currency field, try the following request : http://host:port /solr/select?q=YOUR_QUERY&stats=on&stats.field=amount&f.amount.stats.facet=currency&rows=0 You'll get a top-level stats node with false numbers, because of the mixed currencies, but after that, you'll have one stat node per currency. Alternatively, you could try indexing different currencies in seperate fields (e.g. amount_usd, amount_eur, ...) and send your queries that way : http://host:portsolr /select?q=amount_us:*+OR+amount_eur:*[+OR+amount_...:*]&stats=on&stats.field=amount_usd&stats.field=amount_eur[&stats.field=amount_...]&rows=0 That way in one query you'll get everything you want, excepted that you can't trust "missing" count for each sum computed. May be your query isn't a "select all" one, I which case you should get results even faster. Hope that helps a little... -- Tanguy 2011/1/10 stockii > > Hello. > > i`m using the StatsComponent to get the sum of amounts. but solr > statscomponent is very slow on a huge index of 30 Million documents. how > can > i tune the statscomponent ? > > the problem is, that i have 5 currencys and i need to send for each > currency > a new request. thats make the solr search sometimes very slow. =( > > any ideas ? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2225809.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: best way for sum of fields
Hi, If you only need to sum over "displayed" results, go with the post-processing of hits solution, that's fast and easy. If you sum over the whole data set (i.e your sum is not query dependant), have it computed at indexing time, depending on your indexing workflow. Otherwise, (sum over the whole result set, query dependant but independantly of displayed results) you should give a try to sharding... You generally want that when your index size is too large to be searched quickly (see http://wiki.apache.org/solr/DistributedSearch) (here the sum operation is part of a search query) Basically what you need is: - On the master host : n master instances (each being a shard) - On slave host : n slave instances (each being a replica of its master side counterpart) Only the slave instances will need a comfortable amount of RAM in order to serve queries rapidly. Slave instances can be deployed over several hosts if the total amount of RAM required is high. Your main effort here might be in finding the 'n' value. You have 45M documents in a single shard and that may be the cause of your issue, especially for queries returning a high number of results. You may need to split it into more shards to achieve your goal. This should enable you to reduce the time to perform the sum operation at search time (but adds complixity at data indexing time : you need to define a way to send documents to shard #1, #2, ..., or #n). If you keep having more and more documents over time, may be you'll want to have a fixed maximum shard size (say 5M docs, if performing the sum on 5M docs is fast enough) and simply add shards as required, when more documents are to be indexed/searched. This addresses the importing issue because you'll simply need to change the target shard every 5M documents. The last shard is always the smallest. Such sharding can involve a little overhead at search time : make sure you don't allow for retrieval of far documents (start=k, where k is high -- see http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations). -> When using stats component, have start and rows parameters set to 0 if you don't need the documents themselves. After that, if you face high search load issues, you could still duplicate the slave host to match your load requirements, and load-balance your search traffic over slaves as required. Hope this helps, Tanguy Le 07/11/2011 09:49, stockii a écrit : sry. i need the sum of values of the found documents. e.g. the total amount of one day. each doc in index has ist own amount. i try out something with StatsComponent but with 48 Million docs in Index its to slow. - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores< 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486406.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: best way for sum of fields
Hi again, Since you have a custom high availability solution over your solr instances, I can't help much I guess... :-) I usually rely on master/slave replication to separate index build and index search processes. The fact is that resources consumption at build time and search time are not necessarily the same, and therefor hardware dimensioning can be adapted as required. I like to have the service related processes isolated and easy to deploy wherever needed. Just in case things go wrong, hardware failures occur. Build services on the other hand don't have the same availability constraints, and can be off for a while, it's no issue (unless near realtime indexing comes into party, that's an other thing) In a slave configuration, the index doesn't need to commit. It simply replicates its data from its associated master whenever the master changes and performs a reopen of the searcher. "Change" events can be triggered at commit, startup and / or optimize. (see http://wiki.apache.org/solr/SolrReplication , although you seemed to be not interested by this feature :) ) Having search and build on the same host is no bad point. It simply depends on available resources and build vs service load requirements. For example with a big core such as the one you have, segments merging can occur from time to time, which is an operation that is IO bound (i.e. time is dependant of disk performances). Under high IO load, a server can become less responsive and therefor having the service separated from the build could became handy at that time. As you see, I can't tell you what makes sense and what doesn't. It's all about what you're doing, at which frequency, etc. :-) Regards, Tanguy Le 07/11/2011 12:12, stockii a écrit : hi thanks for the big reply ;) i had the idea with the several and small 5M shards too. and i think thats the next step i have to go, because our biggest index grows each day with avg. 50K documents. but make it sense to keep searcher AND updater cores on one big server? i dont want to use replication, because with our own high avalibility solution is this not possible. my system is split into searcher and updater cores, each with his own index. some search requests are over all this 8 cores with distributed search. - --- System One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 1 Core with 45 Million Documents other Cores< 200.000 - Solr1 for Search-Requests - commit every Minute - 5GB Xmx - Solr2 for Update-Request - delta every Minute - 4GB Xmx -- View this message in context: http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486652.html Sent from the Solr - User mailing list archive at Nabble.com.
Core reload vs servlet container restart
Dear list, I've experienced a weird (unexpected?) behaviour concerning core reload on a master instance. My setup : master/slave on separate hosts. On the master, I update the schema.xml file, adding a dynamic field of type random sort field. I reload the master using core admin. The new field is *not* taken into account. I restart the servlet container (jetty in my case). The new field is taken into account, I can perform random sort operations. On the slave side, no problem : at startup the schema.xml replicated, the core reloaded, and I was able to perform random sorts as well. Now the question is : what was wrong with the core reload on the master ? The output it gave to me was something like : "sort param field can't be found : ${fieldName}". At this point, in admin/schema view, the schema showing up was indeed showing the freshly added dynamic field. I had to restart jetty (not a big issue here, but just to be sure). Thanks! -- Tanguy