Re: improving score of result set
Interesting but not exactly what I want to get. If I group items then I will get small number of docs. I don't want this. I need all of them. Best Regards Alexander Aristov On 29 October 2012 12:05, yunfei wu yunfei...@gmail.com wrote: Besides changing the scoring algorithm, what about Field Collapsing - http://wiki.apache.org/solr/FieldCollapsing - to collapse the results from same website url? Yunfei On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi everybody, I have a question about scoring calculation algorithms and approaches. Lets say I have 10 documents. 8 of the them come from one web site (I have a field in schema with URL) and the other 2 from other different web sites. So for this example I have 3 web sites. For some queries those 8 documents have better terms matching and they appear at the top of results. It makes that 8 docs from one source come first and the other two come next and the last. I want to maybe artificially improve score of those 2 docs and put them atop. I don't want that they necessarily go first but if they come in the middle of the result set it would be perfect. One of the ideas is to reduce score for docs in the result set from one site so that if it contains too many docs from one source total scoring of each those docs would be reduced proportionally. Important thing is that I don't want to reduce doc score permanently. Only at query time. Maybe some functional queries can help me? How can I do this or maybe there are other ideas. Best Regards Alexander Aristov
Re: improving score of result set
I think I get it right way. Referring back to my example. I will get 3 groups: Large group with 8 documents in it and two other groups with one document in each If I limit a group by 5 docs then 1st group will have only 5 docs and the other two will stay contain one doc. And the order (based on score) won't be different. Each document in the first group will have higher score,won't it? Or document score in each group is calculated relatively so that top docs have similar score? So this approach just limits number of similar documents. Instead I want to keep all documents in results but shuffle them appropriately. Best Regards Alexander Aristov On 29 October 2012 15:55, Erick Erickson erickerick...@gmail.com wrote: I don't think you're reading the grouping right. When you use grouping, you get the top N groups, and within each group you get the top M scoring documents. So you can actually get _more_ documents back than in the non-grouping case and your app can then intelligently intersperse them however you want. Best Erick On Mon, Oct 29, 2012 at 5:02 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Interesting but not exactly what I want to get. If I group items then I will get small number of docs. I don't want this. I need all of them. Best Regards Alexander Aristov On 29 October 2012 12:05, yunfei wu yunfei...@gmail.com wrote: Besides changing the scoring algorithm, what about Field Collapsing - http://wiki.apache.org/solr/FieldCollapsing - to collapse the results from same website url? Yunfei On Mon, Oct 29, 2012 at 12:43 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi everybody, I have a question about scoring calculation algorithms and approaches. Lets say I have 10 documents. 8 of the them come from one web site (I have a field in schema with URL) and the other 2 from other different web sites. So for this example I have 3 web sites. For some queries those 8 documents have better terms matching and they appear at the top of results. It makes that 8 docs from one source come first and the other two come next and the last. I want to maybe artificially improve score of those 2 docs and put them atop. I don't want that they necessarily go first but if they come in the middle of the result set it would be perfect. One of the ideas is to reduce score for docs in the result set from one site so that if it contains too many docs from one source total scoring of each those docs would be reduced proportionally. Important thing is that I don't want to reduce doc score permanently. Only at query time. Maybe some functional queries can help me? How can I do this or maybe there are other ideas. Best Regards Alexander Aristov
Re: improving score of result set
Perhapse this is a XY problem. First of all I don't have a site which I want to boost. All docs are equal. Secondly I will explain what I have. I have 100 docs indexed. I do a query which returns 10 found docs. 8 of them from one site and 2 from other different sites. I dont like order. Technically scores are good. I understand why these 8 docs go first - because they havebetter matching. But i dont like it. I want that articles from smaller collections would somehow compete with other docs. For other queries situation can change and another site can produce more results. In that case i would lower that site. I've had a deep thought and think can try grouping. More insites on my problem. These 8 docs have similar text which matches query and thats why they all get similar and relatively high score. For example docs have text: 1. Red apple felt from tree 2 blue apple felt from tree 3 green apple felt from tree ... 8 orange pineapple felt from tree 9 a boy felt suddenly ill. A tree was green. 10 two pices felt apart and newer collapse. Family tree was reach. I query felt tree. Docs 1-8 from one site. I would like to make the score of docs 9 and 10 higher. Grouping can help but maybe there are othe solutions. Alexander 29.10.2012 22:11 пользователь Chris Hostetter hossman_luc...@fucit.org написал: You've mentioned that you want ot improve the scores of these documents, but you haven't really given any specifics about when/how/why you wnat to improve the score in general -- ie: in this examples you have a total of 10 docs, but how do you distinguish the 2 special docs from the 8 other docs? is it because they are the only two docs with some specific field value, or is it just because they are in the smaller of two sets of documents if you partition on some field? if you added 100 more docs that were all in the same set as those two, would you want the other 8 documents to start getting boosted? Let's assume that what you are trying to ask is.. I want to artificially boost the scores of documents when the 'site' field contains 'cnn.com' A simple way to do that is just to add an optional clause to your query that matches on site:cnn.com so the scores of those documents will be increased, but make the main part of your query required... q=+(your main query) site:cnn.com Or if you use the dismax or edismax parsers there are special params (bq and/or boost) that help make this easy to split out... https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents FWIW: this smells like an XY problem ... more details baout your actaul situation and end goal would be helpful... https://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: improving score of result set
You absolutely follow my problem. I want to put Obama from espn atop just because this is exceptional and probably interesting occurance. And the score is low because content is long or there are no matches in title. 29.10.2012 23:18 пользователь Chris Hostetter hossman_luc...@fucit.org написал: You haven't really explained things enough for us to help you... : First of all I don't have a site which I want to boost. All docs are equal. : : Secondly I will explain what I have. I have 100 docs indexed. I do a query : which returns 10 found docs. 8 of them from one site and 2 from other : different sites. I dont like order. Technically scores are good. I : understand why these 8 docs go first - because they havebetter matching. : But i dont like it. I want that articles from smaller collections would : somehow compete with other docs. For other queries situation can change and : another site can produce more results. In that case i would lower that : site. *why* don't you like that order? what is it that makes you think that order is bad? you say you want to articles fro mteh smalller collection to compete with the other docs -- but they already have. unless part of your query included a clause that is biased in favor of one collection then all of those documents got a fair score for the query you passed in. It might help if you gave us a specific, concrete example of some *real* queries and the *real* docments they return, and why you don't think those scores are fair. Because if i'm following your reasoning, and thinking about a situation where i might have an index full of webpages, and some of those web pages are from cnn.com and some of those pages are from espn.com then a query for Obama might match lots of pages from cnn.com, with high scores, and there might be *one* match on espn.com with an extremely low score, because Obama is mentioned one time in some quote or something in a *very* long page ... in what situation would it make any sense to bias the score of that one espn.com document to make it score higher then other documents from cnn.com that legitimately score better because they mention Obama in the title, or many times in the body of the page? -Hoss
Re: query syntax to find ??? chars
don't know why but doesn't work. :( Best Regards Alexander Aristov On 11 July 2012 23:54, Yury Kats yuryk...@yahoo.com wrote: On 7/11/2012 2:55 PM, Alexander Aristov wrote: content:?? doesn't work :) I would try escaping them: content:\?\?\?\?\?\?
Re: null pointer error with solr deduplication
Hi I might be wrong but it's your responsibility to put unique doc IDs across shards. read this page http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations particualry - Documents must have a unique key and the unique key must be stored (stored=true in schema.xml) - *The unique key field must be unique across all shards.* If docs with duplicate unique keys are encountered, Solr will make an attempt to return valid results, but the behavior may be non-deterministic. So solr bahaves as it should :) _unexpectidly_ But I agree in that sence that there must be no error especially such as NPE. Best Regards Alexander Aristov On 21 April 2012 03:42, Peter Markey sudoma...@gmail.com wrote: Hello, I have been trying out deduplication in solr by following: http://wiki.apache.org/solr/Deduplication. I have defined a signature field to hold the values of the signature created based on few other fields in a document and the idea seems to work like a charm in a single solr instance. But, when I have multiple cores and try to do a distributed search ( Http://localhost:8080/solr/core0/select?q=*shards=localhost:8080/solr/dedupe,localhost:8080/solr/dedupe2facet=truefacet.field=doc_id ) I get the error pasted below. While normal search (with just q) works fine, the facet/stats queries seem to be the culprit. The doc_id contains duplicate ids since I'm testing the same set of documents indexed in both the cores(dedupe, dedupe2). Any insights would be highly appreciated. Thanks 20-Apr-2012 11:39:35 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:887) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:633) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:612) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:307) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Solr is not extracting the CDATA part of xml
Hi This is not solr format. You must re-format your XML into solr XML. you may find examples on solr wiki or in solr examples dir. Best Regards Alexander Aristov On 13 April 2012 23:13, srini softtec...@gmail.com wrote: Erick, Thanks for your reply. when you say Solr does not index arbitery xml document, then below is the way my xml document looks like which is sitting in oracle. Could you suggest the best of indexing it ? which method should I follow? Should I use XPathEntityProcessor? ?xml version=1.0 encoding=UTF-8 ? message xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xmlns=someurl xmlns:csp=someurl.xsd xsi:schemaLocation=somelocation jar: id=002 message-type=create content dsp:row dsp:channel100/dsp:channel dsp:role115/dsp:role /dsp:row /body/content/message Thanks in Advance Erick Erickson wrote Solr does not index arbitrary XML content. There is and XML form of a solr document that can be sent to Solr, but it is a specific form of XML. An example of the XML you're trying to index and what you mean by not working would be helpful. Best Erick On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote: not sure why CDATA part did not get interpreted. this is how xml content looks like. I added quotes just to present the exact content xml content. body/body -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html Sent from the Solr - User mailing list archive at Nabble.com. Erick Erickson wrote Solr does not index arbitrary XML content. There is and XML form of a solr document that can be sent to Solr, but it is a specific form of XML. An example of the XML you're trying to index and what you mean by not working would be helpful. Best Erick On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote: not sure why CDATA part did not get interpreted. this is how xml content looks like. I added quotes just to present the exact content xml content. body/body -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html Sent from the Solr - User mailing list archive at Nabble.com. Erick Erickson wrote Solr does not index arbitrary XML content. There is and XML form of a solr document that can be sent to Solr, but it is a specific form of XML. An example of the XML you're trying to index and what you mean by not working would be helpful. Best Erick On Fri, Apr 13, 2012 at 11:50 AM, srini lt;softtech88@gt; wrote: not sure why CDATA part did not get interpreted. this is how xml content looks like. I added quotes just to present the exact content xml content. body/body -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908341.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-is-not-extracting-the-CDATA-part-of-xml-tp3908317p3908791.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: default operation for a field
Ok. got it. thanks Best Regards Alexander Aristov On 2 April 2012 16:37, Erick Erickson erickerick...@gmail.com wrote: You can't set the default operator for a single field. This implies you're using edismax? If that's the case, your app layer can massage the query to something like term1 term2 term3 field_x:(term1 AND term2 AND term3). In which case field_x probably should not be in your qf parameter. Best Erick On Mon, Apr 2, 2012 at 2:05 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi, Just curious if it's possible to set default operator for a field, not for all application. I have a field and I want it always had AND operation. Is it feasible? Users don't enter any opeartors for this field. Only one term or several separated by empty spaces. But if default operation is set to OR then the field doesn't work as I expect. I need only AND. Maybe another solution is possible? Best Regards Alexander Aristov
Re: SolrCloud war?
!!! OFF TOPIC, srry I cannot stand but I want to write this. Subject is very intriguing beacuse of two meanings of the war word. :) Best Regards Alexander Aristov On 4 February 2012 01:50, Mark Miller markrmil...@gmail.com wrote: On Feb 3, 2012, at 1:04 PM, Darren Govoni wrote: I deployed each war app into the /solr context. I presume its needed by remote URL addressing. Yup - but you can override this by setting the hostContext in solr.xml. It defaults to solr as that fits the example jetty distribution. - Mark Miller lucidimagination.com
Re: solr keep old docs
I have never developed for solr yet and don't know much internals but Today I have tried one approach with searcher. In my update processor I get searcher and search for ID. It works but I need to load test it. Will index traversal be faster (less resource consuming) than search? Best Regards Alexander Aristov On 29 December 2011 17:03, Erick Erickson erickerick...@gmail.com wrote: Hmmm, we're not communicating G... The update processor wouldn't search in the classic sense. It would just use lower-level index traversal to determine if the doc (identified by your unique key) was already in the index and skip indexing that document if it was. No real *searching* involved (see TermDocs.seek for one approach). The price would be that you are transmitting the document over to the Solr instance and then throwing it away. Best Erick On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Alexander, I have two ideas how to implement fast dedupe externally, assuming your PKs don't fit to java.util.*Map: - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; - if your crawler is stateless - it doesn't track PKs which has been already crawled, you can retrieve it from Solr via http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, but it might be a problem with removed documents (I'm not sure). And it's also can lead to OOMException (if you have too much PKs). Let me know if you need a workaround for one of these problems. If you choose internal dedupe (UpdateProcessor), pls let me know if querying one-by-one will be to slow for your and you'll need to do it page-by-page. I did some of such paging, and will do something similar soon, so I'm interested in it. Regards On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Unfortunately I have a lot of duplicates and taking that searching might suffer I will try with implementing update procesor. But your idea is interesting and I will consider it, thanks. Best Regards Alexander Aristov On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote: Hello Alexander, I don't know much about your requirements in terms of size and performances, but I've had a similar use case and found a pretty simple workaround. If your duplicate rate is not too high, you can have the SignatureProcessor to generate fingerprint of documents (you already did that). Simply turn off overwritting of duplicates, you can then rely on solr's grouping / field collapsing to group your search results by fingerprints. You'll then have one document group per real document. You can use group.sort to sort your groups by indexing date ascending, and group.limit=1 to keep only the oldest one. You can even use group.format = simple to serve results as if no collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to get the real number of deduplicated documents. Of course the index will be larger, as I said, I made no assumption regarding your operating requirements. And search can be a bit slower, depending on the average rate of duplicated documents. But you've got your issue addressed by configuration tuning only... Depending on your project's sizing, it could be time saving. The advantage is that you have the precious information of what content is duplicated from where :-) Hope this helps, -- Tanguy Le 28/12/2011 15:45, Alexander Aristov a écrit : Thanks Eric, it sets me direction. I will be writing new plugin and will get back to the dev forum with results and then we will decide next steps. Best Regards Alexander Aristov On 28 December 2011 18:08, Erick Ericksonerickerickson@gmail.**com erickerick...@gmail.com wrote: Well, the short answer is that nobody else has 1 had a similar requirement AND 2 not found a suitable work around AND 3 implemented the change and contributed it back. So, if you'd like to volunteerG. Seriously. If you think this would be valuable and are willing to work on it, hop on over to the dev list and discuss it, open a JIRA and make it work. I'd start by opening a discussion on the dev list before opening a JIRA, just to get a sense of where the snags would be to changing the Solr code, but that's optional. That said, writing your own update request handler that detected this case isn't very difficult, extend UpdateRequestProcessorFactory/**UpdateRequestProcessor and use it as a plugin. Best Erick On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov alexander.aris...@gmail.com wrote: the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already. Best Regards
Re: solr keep old docs
well. The first results are ready. I have implemented custom update processor following your suggestion using low level index reader and termdocs. Launched scripts which add about 10 000 docs. Indexing took about 1 minute including commit that is quite good for me. I don't have larger datasets so won't be able to check with heavier conditions. If someone is interested I can send over my jar file with my update processor. As I said I am ready to contribute it to solr but will get back to it in the New Year after 10 Jan. thanks everybody. Best Regards Alexander Aristov On 29 December 2011 18:12, Erick Erickson erickerick...@gmail.com wrote: I'd guess it would be much faster, assuming that the search savings wouldn't be swamped by the additional transmission time over the wire and parsing the request (although SolrJ uses a binary format, so parsing request probably isn't all that expensive). You could even do a hybrid approach. Pack up all of the IDs you are about to update, send them to your special *request* handler and have your request handler respond with the documents that were already in the index... Hmmm, scratch all that. Start with just stringing together a long set of uniqueKeys and just search for them. Something like q=id:(1 2 47 09873)fl=id The response should be a minimal set of data returned (just the ID). Then you can remove each document ID returned from your next update. No custom Solr components required. Solr defaults to a maxBooleanClause count of 1024, so your packets should have fewer IDs this or you should bump that config setting. This should pretty much do what I was thinking with custom code without having to write anything.. Best Erick On Thu, Dec 29, 2011 at 8:15 AM, Alexander Aristov alexander.aris...@gmail.com wrote: I have never developed for solr yet and don't know much internals but Today I have tried one approach with searcher. In my update processor I get searcher and search for ID. It works but I need to load test it. Will index traversal be faster (less resource consuming) than search? Best Regards Alexander Aristov On 29 December 2011 17:03, Erick Erickson erickerick...@gmail.com wrote: Hmmm, we're not communicating G... The update processor wouldn't search in the classic sense. It would just use lower-level index traversal to determine if the doc (identified by your unique key) was already in the index and skip indexing that document if it was. No real *searching* involved (see TermDocs.seek for one approach). The price would be that you are transmitting the document over to the Solr instance and then throwing it away. Best Erick On Thu, Dec 29, 2011 at 12:52 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Alexander, I have two ideas how to implement fast dedupe externally, assuming your PKs don't fit to java.util.*Map: - your crawler can use inprocess RDBMS (Derby, H2) to track dupes; - if your crawler is stateless - it doesn't track PKs which has been already crawled, you can retrieve it from Solr via http://wiki.apache.org/solr/TermsComponent . That's blazingly fast, but it might be a problem with removed documents (I'm not sure). And it's also can lead to OOMException (if you have too much PKs). Let me know if you need a workaround for one of these problems. If you choose internal dedupe (UpdateProcessor), pls let me know if querying one-by-one will be to slow for your and you'll need to do it page-by-page. I did some of such paging, and will do something similar soon, so I'm interested in it. Regards On Thu, Dec 29, 2011 at 9:34 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Unfortunately I have a lot of duplicates and taking that searching might suffer I will try with implementing update procesor. But your idea is interesting and I will consider it, thanks. Best Regards Alexander Aristov On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote: Hello Alexander, I don't know much about your requirements in terms of size and performances, but I've had a similar use case and found a pretty simple workaround. If your duplicate rate is not too high, you can have the SignatureProcessor to generate fingerprint of documents (you already did that). Simply turn off overwritting of duplicates, you can then rely on solr's grouping / field collapsing to group your search results by fingerprints. You'll then have one document group per real document. You can use group.sort to sort your groups by indexing date ascending, and group.limit=1 to keep only the oldest one. You can even use group.format = simple to serve results as if no collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to get the real number
Re: solr keep old docs
the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already. Best Regards Alexander Aristov On 28 December 2011 13:04, Lance Norskog goks...@gmail.com wrote: The SignatureUpdateProcessor is for exactly this problem: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov alexander.aris...@gmail.com wrote: I get docs from external sources and the only place I keep them is solr index. I have no a database or other means to track indexed docs (my personal oppinion is that it might be a huge headache). Some docs might change slightly in there original sources but I don't need that changes. In fact I need original data only. So I have no other ways but to either check if a document is already in index before I put it to solrj array (read - query solr) or develop my own update chain processor and implement ID check there and skip such docs. Maybe it's wrong place to aguee and probably it's been discussed before but I wonder why simple the overwrite parameter doesn't work here. My oppinion it perfectly suits here. In combination with unique ID it can cover all possible variants. cases: 1. overwrite=true and uniquID exists then newer doc should overwrite the old one. 2. overwrite=false and uniqueID exists then newer doc must be skipped since old exists. 3. uniqueID doesn't exist then newer doc just gets added regardless if old exists or not. Best Regards Alexander Aristov On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com wrote: Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (when uniqueKey is defined) should trigger the old document to be replaced. what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just *not* send documents that are already in the Solr index based on, for instance, any timestamp associated with your system-of-record and the last time you did an incremental index? Best Erick On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks. I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to make it working without codding, just by playing around with configuration and params. I don't want to go away from default solr implementation. Best Regards Alexander Aristov On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new duplicate docs and keep old indexed docs. I don't care that it's new. Just determine by ID that such document is in the index already and that's it. I use solrj for indexing. I have tried setting overwrite=false and dedupe apprache but nothing helped me. I either have that a newer doc overwrites old one or I get duplicate. I think it's a very simple and basic feature and it must exist. What did I make wrong or didn't do? I guess, because the mainstream approach is delta-import , when you have updated timestamps in your DB and last-import timestamp stored somewhere. You can check how it works in DIH. Tried google but I couldn't find a solution there althoght many people encounted such problem. it's definitely can be done by overriding o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest to start from implementing your own http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass chain call if it's found. Then if you meet performance issues on querying your PKs one by one, (but only after that) you can batch your searches, there are couple of optimization techniques for huge disjunction queries like PK:(2 OR 4 OR 5 OR 6). I start considering that I must query index to check if a doc to be added is in the index already and do not add it to array but I have so many docs that I am affraid it's not a good solution. Best Regards Alexander Aristov -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr
Re: solr keep old docs
Thanks Eric, it sets me direction. I will be writing new plugin and will get back to the dev forum with results and then we will decide next steps. Best Regards Alexander Aristov On 28 December 2011 18:08, Erick Erickson erickerick...@gmail.com wrote: Well, the short answer is that nobody else has 1 had a similar requirement AND 2 not found a suitable work around AND 3 implemented the change and contributed it back. So, if you'd like to volunteer G. Seriously. If you think this would be valuable and are willing to work on it, hop on over to the dev list and discuss it, open a JIRA and make it work. I'd start by opening a discussion on the dev list before opening a JIRA, just to get a sense of where the snags would be to changing the Solr code, but that's optional. That said, writing your own update request handler that detected this case isn't very difficult, extend UpdateRequestProcessorFactory/UpdateRequestProcessor and use it as a plugin. Best Erick On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov alexander.aris...@gmail.com wrote: the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already. Best Regards Alexander Aristov On 28 December 2011 13:04, Lance Norskog goks...@gmail.com wrote: The SignatureUpdateProcessor is for exactly this problem: http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov alexander.aris...@gmail.com wrote: I get docs from external sources and the only place I keep them is solr index. I have no a database or other means to track indexed docs (my personal oppinion is that it might be a huge headache). Some docs might change slightly in there original sources but I don't need that changes. In fact I need original data only. So I have no other ways but to either check if a document is already in index before I put it to solrj array (read - query solr) or develop my own update chain processor and implement ID check there and skip such docs. Maybe it's wrong place to aguee and probably it's been discussed before but I wonder why simple the overwrite parameter doesn't work here. My oppinion it perfectly suits here. In combination with unique ID it can cover all possible variants. cases: 1. overwrite=true and uniquID exists then newer doc should overwrite the old one. 2. overwrite=false and uniqueID exists then newer doc must be skipped since old exists. 3. uniqueID doesn't exist then newer doc just gets added regardless if old exists or not. Best Regards Alexander Aristov On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com wrote: Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (when uniqueKey is defined) should trigger the old document to be replaced. what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just *not* send documents that are already in the Solr index based on, for instance, any timestamp associated with your system-of-record and the last time you did an incremental index? Best Erick On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks. I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to make it working without codding, just by playing around with configuration and params. I don't want to go away from default solr implementation. Best Regards Alexander Aristov On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new duplicate docs and keep old indexed docs. I don't care that it's new. Just determine by ID that such document is in the index already and that's it. I use solrj for indexing. I have tried setting overwrite=false and dedupe apprache but nothing helped me. I either have that a newer doc overwrites old one or I get duplicate. I think it's a very simple and basic feature and it must exist
Re: solr keep old docs
Unfortunately I have a lot of duplicates and taking that searching might suffer I will try with implementing update procesor. But your idea is interesting and I will consider it, thanks. Best Regards Alexander Aristov On 28 December 2011 19:12, Tanguy Moal tanguy.m...@gmail.com wrote: Hello Alexander, I don't know much about your requirements in terms of size and performances, but I've had a similar use case and found a pretty simple workaround. If your duplicate rate is not too high, you can have the SignatureProcessor to generate fingerprint of documents (you already did that). Simply turn off overwritting of duplicates, you can then rely on solr's grouping / field collapsing to group your search results by fingerprints. You'll then have one document group per real document. You can use group.sort to sort your groups by indexing date ascending, and group.limit=1 to keep only the oldest one. You can even use group.format = simple to serve results as if no collapsing occured, and use group.ngroups (/!\ could be expansive /!\) to get the real number of deduplicated documents. Of course the index will be larger, as I said, I made no assumption regarding your operating requirements. And search can be a bit slower, depending on the average rate of duplicated documents. But you've got your issue addressed by configuration tuning only... Depending on your project's sizing, it could be time saving. The advantage is that you have the precious information of what content is duplicated from where :-) Hope this helps, -- Tanguy Le 28/12/2011 15:45, Alexander Aristov a écrit : Thanks Eric, it sets me direction. I will be writing new plugin and will get back to the dev forum with results and then we will decide next steps. Best Regards Alexander Aristov On 28 December 2011 18:08, Erick Ericksonerickerickson@gmail.**comerickerick...@gmail.com wrote: Well, the short answer is that nobody else has 1 had a similar requirement AND 2 not found a suitable work around AND 3 implemented the change and contributed it back. So, if you'd like to volunteerG. Seriously. If you think this would be valuable and are willing to work on it, hop on over to the dev list and discuss it, open a JIRA and make it work. I'd start by opening a discussion on the dev list before opening a JIRA, just to get a sense of where the snags would be to changing the Solr code, but that's optional. That said, writing your own update request handler that detected this case isn't very difficult, extend UpdateRequestProcessorFactory/**UpdateRequestProcessor and use it as a plugin. Best Erick On Wed, Dec 28, 2011 at 6:46 AM, Alexander Aristov alexander.aris...@gmail.com wrote: the problem with dedupe (SignatureUpdateProcessor ) is that it REPLACES old docs. I have tried it already. Best Regards Alexander Aristov On 28 December 2011 13:04, Lance Norskoggoks...@gmail.com wrote: The SignatureUpdateProcessor is for exactly this problem: http://www.lucidimagination.**com/search/link?url=http://** wiki.apache.org/solr/**Deduplicationhttp://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/Deduplication On Tue, Dec 27, 2011 at 10:42 PM, Alexander Aristov alexander.aris...@gmail.com wrote: I get docs from external sources and the only place I keep them is solr index. I have no a database or other means to track indexed docs (my personal oppinion is that it might be a huge headache). Some docs might change slightly in there original sources but I don't need that changes. In fact I need original data only. So I have no other ways but to either check if a document is already in index before I put it to solrj array (read - query solr) or develop my own update chain processor and implement ID check there and skip such docs. Maybe it's wrong place to aguee and probably it's been discussed before but I wonder why simple the overwrite parameter doesn't work here. My oppinion it perfectly suits here. In combination with unique ID it can cover all possible variants. cases: 1. overwrite=true and uniquID exists then newer doc should overwrite the old one. 2. overwrite=false and uniqueID exists then newer doc must be skipped since old exists. 3. uniqueID doesn't exist then newer doc just gets added regardless if old exists or not. Best Regards Alexander Aristov On 27 December 2011 22:53, Erick Ericksonerickerickson@gmail.**comerickerick...@gmail.com wrote: Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (whenuniqueKey is defined) should trigger the old document to be replaced. what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just *not* send documents that are already in the Solr index based on, for instance, any timestamp associated with your
Re: solr keep old docs
Yes I have been warned that query index each time before adding doc to index might be resource consuming. Will check it. As for the overwrite parameter I think the name is not the best then. People outside the business like me misuse it and assume what I wrote. Overwrite shall mean what it means. But I understand what it does in fact and so my way is to write custom update processor plugin. Best Regards Alexander Aristov On 28 December 2011 22:16, Chris Hostetter hossman_luc...@fucit.org wrote: : That said, writing your own update request handler : that detected this case isn't very difficult, : extend UpdateRequestProcessorFactory/UpdateRequestProcessor : and use it as a plugin. i can't find the thread at the moment, but the general issue that has caused people headaches with this type of approach in the past has been that the performance of doing a query on every update (to see if the doc is already in the index) can slow things down quite a bit -- in your usecase it may not be a significant bottleneck, but that's the general issue that has come up i nthe past. If you look at systems (like nutch) that do large scale crawling, they treat the crawl phrase independent from the indexing phase precisesly for reasons like this -- so the crawler can dedup the documents (by unique URL) and eliminate duplication before ever even adding them to the index. : I wonder why simple the overwrite parameter doesn't work here. ... : 2. overwrite=false and uniqueID exists then newer doc must be skipped : since : old exists. that is not what overwrite=false does (or was ever designed to do). overwrite=false is a way to tell Solr that you are already certain that the documents being added do not exist in the index, therefore Solr can save time by not attempting to overwrite an existing document. It is intended for situations where you are bulk loading documents, ie: doing an initial build of an index from a system of record (ie: a single pass over adatabase that uses the same unique key) or importing documents from a new system of record with a completley differnet id space. -Hoss
Re: solr keep old docs
Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks. I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to make it working without codding, just by playing around with configuration and params. I don't want to go away from default solr implementation. Best Regards Alexander Aristov On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.comwrote: On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new duplicate docs and keep old indexed docs. I don't care that it's new. Just determine by ID that such document is in the index already and that's it. I use solrj for indexing. I have tried setting overwrite=false and dedupe apprache but nothing helped me. I either have that a newer doc overwrites old one or I get duplicate. I think it's a very simple and basic feature and it must exist. What did I make wrong or didn't do? I guess, because the mainstream approach is delta-import , when you have updated timestamps in your DB and last-import timestamp stored somewhere. You can check how it works in DIH. Tried google but I couldn't find a solution there althoght many people encounted such problem. it's definitely can be done by overriding o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest to start from implementing your own http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass chain call if it's found. Then if you meet performance issues on querying your PKs one by one, (but only after that) you can batch your searches, there are couple of optimization techniques for huge disjunction queries like PK:(2 OR 4 OR 5 OR 6). I start considering that I must query index to check if a doc to be added is in the index already and do not add it to array but I have so many docs that I am affraid it's not a good solution. Best Regards Alexander Aristov -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics
Re: solr keep old docs
I get docs from external sources and the only place I keep them is solr index. I have no a database or other means to track indexed docs (my personal oppinion is that it might be a huge headache). Some docs might change slightly in there original sources but I don't need that changes. In fact I need original data only. So I have no other ways but to either check if a document is already in index before I put it to solrj array (read - query solr) or develop my own update chain processor and implement ID check there and skip such docs. Maybe it's wrong place to aguee and probably it's been discussed before but I wonder why simple the overwrite parameter doesn't work here. My oppinion it perfectly suits here. In combination with unique ID it can cover all possible variants. cases: 1. overwrite=true and uniquID exists then newer doc should overwrite the old one. 2. overwrite=false and uniqueID exists then newer doc must be skipped since old exists. 3. uniqueID doesn't exist then newer doc just gets added regardless if old exists or not. Best Regards Alexander Aristov On 27 December 2011 22:53, Erick Erickson erickerick...@gmail.com wrote: Mikhail is right as far as I know, the assumption built into Solr is that duplicate IDs (when uniqueKey is defined) should trigger the old document to be replaced. what is your system-of-record? By that I mean what does your SolrJ program do to send data to Solr? Is there any way you could just *not* send documents that are already in the Solr index based on, for instance, any timestamp associated with your system-of-record and the last time you did an incremental index? Best Erick On Tue, Dec 27, 2011 at 6:38 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi I am not using database. All needed data is in solr index that's why I want to skip excessive checks. I will check DIH but not sure if it helps. I am fluent with Java and it's not a problem for me to write a class or so but I want to check first maybe there are any ways (workarounds) to make it working without codding, just by playing around with configuration and params. I don't want to go away from default solr implementation. Best Regards Alexander Aristov On 27 December 2011 09:33, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Tue, Dec 27, 2011 at 12:26 AM, Alexander Aristov alexander.aris...@gmail.com wrote: Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new duplicate docs and keep old indexed docs. I don't care that it's new. Just determine by ID that such document is in the index already and that's it. I use solrj for indexing. I have tried setting overwrite=false and dedupe apprache but nothing helped me. I either have that a newer doc overwrites old one or I get duplicate. I think it's a very simple and basic feature and it must exist. What did I make wrong or didn't do? I guess, because the mainstream approach is delta-import , when you have updated timestamps in your DB and last-import timestamp stored somewhere. You can check how it works in DIH. Tried google but I couldn't find a solution there althoght many people encounted such problem. it's definitely can be done by overriding o.a.s.update.DirectUpdateHandler2.addDoc(AddUpdateCommand), but I suggest to start from implementing your own http://wiki.apache.org/solr/UpdateRequestProcessor - search for PK, bypass chain call if it's found. Then if you meet performance issues on querying your PKs one by one, (but only after that) you can batch your searches, there are couple of optimization techniques for huge disjunction queries like PK:(2 OR 4 OR 5 OR 6). I start considering that I must query index to check if a doc to be added is in the index already and do not add it to array but I have so many docs that I am affraid it's not a good solution. Best Regards Alexander Aristov -- Sincerely yours Mikhail Khludnev Lucid Certified Apache Lucene/Solr Developer Grid Dynamics
solr keep old docs
Hi people, I urgently need your help! I have solr 3.3 configured and running. I do uncremental indexing 4 times a day using bulk updates. Some documents are identical to some extent and I wish to skip them, not to index. But here is the problem as I could not find a way to tell solr ignore new duplicate docs and keep old indexed docs. I don't care that it's new. Just determine by ID that such document is in the index already and that's it. I use solrj for indexing. I have tried setting overwrite=false and dedupe apprache but nothing helped me. I either have that a newer doc overwrites old one or I get duplicate. I think it's a very simple and basic feature and it must exist. What did I make wrong or didn't do? Tried google but I couldn't find a solution there althoght many people encounted such problem. I start considering that I must query index to check if a doc to be added is in the index already and do not add it to array but I have so many docs that I am affraid it's not a good solution. Best Regards Alexander Aristov
solr ignore duplicate documents
People, I am asking for your help with solr. When a document is sent to solr and such document already exists in its index (by its ID) then the new doc replaces the old one. But I don't want to automatically replace documents. Just ignore and proceed to the next. How can I configure solr to do so? Of course I can query solr to check if it has the document already but it's bad for me since I do bulk updates and this will complicate the process and increase amount of request. So are there any ways to configure solr to ignore duplicates? Just ignore. I don't need any specific responses or actions. Best Regards Alexander Aristov
Re: solr upgrade question
Didn't get any responses. But I tried luke 1.0.1 and it did the magic. I run optimization and after that solr got up. Best Regards Alexander Aristov On 30 March 2011 15:47, Alexander Aristov alexander.aris...@gmail.comwrote: People Is were way to upgrade existsing index from solr 1.4 to solr 4(trunk). When I configured solr 4 and launched it complained about incorrect lucence file version (3 instead of old 2) Are there any procedures to convert index? Best Regards Alexander Aristov
solr upgrade question
People Is were way to upgrade existsing index from solr 1.4 to solr 4(trunk). When I configured solr 4 and launched it complained about incorrect lucence file version (3 instead of old 2) Are there any procedures to convert index? Best Regards Alexander Aristov