Custom Handler support in Solr-ruby
Hi, I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really inflexible in terms of specifying handler. The Solr::Request::Select class defines handler as select and all other classes inherit from this class. And since the methods in Solr::Connection use one of the classes from Solr::Request, I don't see a direct way to use a custom handler (which I have made for MoreLikeThis). Currently, the approach I am using is to create the query URL, do a CURL, parse the response and return it. Even if I'd to extend the classes, I'd end up making a new Solr::Request::CustomSelect which will be similar to Solr::Request::Select except for the flexibility for the user to provide handler, defaulted by 'select'. Then creating different classes each for DisMax and all, which will be derived from Solr::Request::CustomSelect. Isn't this too much of an overhead? Or am I missing something? Also, where can I file bugs to solr-ruby? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Include synonys in solr
Hi, i am using solr for my searches. in this i found a synonyms.text file in which you can include synonyms manually for the words u want. But as i suppose it would be very hard to include synonyms manually for each word as my application has large data. I want to know is there any way that this synonym.text file generate automatically referring to all dictionary words - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonys-in-solr-tp3116836p3116836.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Include synonys in solr
On Tue, Jun 28, 2011 at 12:54 PM, Romi romijain3...@gmail.com wrote: Hi, i am using solr for my searches. in this i found a synonyms.text file in which you can include synonyms manually for the words u want. Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory No offence, but a simple Google search, or a search of the Wiki would have turned this up. Please try such simpler avenues before dashing off a message to the list. Regards, Gora
Re: Include synonys in solr
Am 28.06.2011 09:24, schrieb Romi: But as i suppose it would be very hard to include synonyms manually for each word as my application has large data. I want to know is there any way that this synonym.text file generate automatically referring to all dictionary words I don't get the point here. Why should you want to add all dictionary words to the synonyms? To what shall they translate? Just having all words in synonyms.txt doesn't make much sense. If you're asking about some kind of translation into another language: In that case, you'd rather translate the text at index time and put it into another field which you query as well. In my last project, we had multi-valued fields like meta_description and misspelled, where you could add arbitrary synonyms for each document - maybe that's what you're asking for? -Kuli
Re: Analyzer creates PhraseQuery
You could add this filter after the NGram filter to prevent the phrase query creation : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Analyzer-creates-PhraseQuery-tp3116288p3116885.html Sent from the Solr - User mailing list archive at Nabble.com.
Find results with or without whitespace
I'm looking for a way to index/search on terms that may or may not contain spaces. An example will explain better : - Loooking for healthcare, I want to find both healthcare and health care. - Loooking for health care, I want to find both health care and healthcare. My other constraints are - I will index rather long strings (extracted from Office documents) - I want to avoid synonym lists (as they may be incomplete) - I want to avoid specific logic (i.e. query rewriting with as many OR as search terms combination requires) - I don't want to rely on uppercase/lowercase tokenizer (as users are... creative) I already tried many tokenizer/filter combination without success. I did not find any answer to this problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117144.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiple spatial values
Yonik Seeley-2-2 wrote: On Sat, Jun 25, 2011 at 5:56 AM, marthinal lt;jm.rodriguez.ve...@gmail.comgt; wrote: sfield, pt and d can all be specified directly in the spatial functions/filters too, and that will override the global params. Unfortunately one must currently use lucene query syntax to do an OR. It just makes it look a bit messier. q=_query_:{!geofilt} _query:{!geofilt sfield=location_2} -Yonik http://www.lucidimagination.com @Yonik it seems to work like this, i triyed houndreds of other possibilities without success: q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt sfield=location_2 pt=40.51,-5.91 d=500} Ah, right. I had thought you wanted docs that matched either geofilt (hence OR), not docs that only matched both. -Yonik http://www.lucidimagination.com Yes Yonik what i do now is q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value .. I write here the query because maybe it *helps* to someone that need to do something like this ... -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html Sent from the Solr - User mailing list archive at Nabble.com.
Saravanan Chinnadurai/Actionimages is out of the office.
I will be out of the office starting 28/06/2011 and will not return until 30/06/2011. Please email to itsta...@actionimages.com for any urgent issues. Action Images is a division of Reuters Limited and your data will therefore be protected in accordance with the Reuters Group Privacy / Data Protection notice which is available in the privacy footer at www.reuters.com Registered in England No. 145516 VAT REG: 397000555
Index Version and Epoch Time?
Hi, I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny
Re: Include synonys in solr
Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory No offence, but a simple Google search, or a search of the Wiki would have turned this up. Please try such simpler avenues before dashing off a message to the list. Gora, I heve already read the document and also included synonyms in my search results :) My question is , when i use this *filter class=solr.SynonymFilterFactory synonyms=syn.txt ignoreCase=true expand=false/ * i need to enter synonyms manually in synonyms.txt. which is really tough if you have many words for synonyms. i wanted to ask is there any other option so that i need not to enter synonyms manually.. i hope you got my point :) - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Include synonys in solr
I don't want to add all dictionary words to my synonyms.txt, but i wanted to include synonyms for the words which i am having in my data...as you can imagine if i have suppose 1000 words then i would be very tough to enter synonyms for these 1000 words in synonyms.txt manually. I just want to know how can i solve this puzzle so that i need not to enter synonyms manually. for example for GB i am entering gigabyte for ring i am entering synonyms as band, circle - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Include synonys in solr
Well you need to find word lists and/or a thesaurus. This is one place to start: http://wordlist.sourceforge.net/ I used the US/UK english word list for my synonyms for an index I have because it contains both US and UK english terms, the list lacks some medical terms though so we just added them. Cheers François On Jun 28, 2011, at 6:55 AM, Romi wrote: Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory No offence, but a simple Google search, or a search of the Wiki would have turned this up. Please try such simpler avenues before dashing off a message to the list. Gora, I heve already read the document and also included synonyms in my search results :) My question is , when i use this *filter class=solr.SynonymFilterFactory synonyms=syn.txt ignoreCase=true expand=false/ * i need to enter synonyms manually in synonyms.txt. which is really tough if you have many words for synonyms. i wanted to ask is there any other option so that i need not to enter synonyms manually.. i hope you got my point :) - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Find results with or without whitespace
I had the same problem: http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-td2934742.html#a2964942 -- View this message in context: http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117386.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removing duplicate documents from search results
I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq
Re: multiple spatial values
Will it be possible to do spatial searches on multi-valued spatial fields soon? I have a latlon field (point) that is multi-valued and don't know how to search against it such that the lats and lons match correctly - since they are split apart. e.g. I have a document with 10 point/latlon values for the same field. On 06/28/2011 05:15 AM, marthinal wrote: Yonik Seeley-2-2 wrote: On Sat, Jun 25, 2011 at 5:56 AM, marthinal lt;jm.rodriguez.ve...@gmail.comgt; wrote: sfield, pt and d can all be specified directly in the spatial functions/filters too, and that will override the global params. Unfortunately one must currently use lucene query syntax to do an OR. It just makes it look a bit messier. q=_query_:{!geofilt} _query:{!geofilt sfield=location_2} -Yonik http://www.lucidimagination.com @Yonik it seems to work like this, i triyed houndreds of other possibilities without success: q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt sfield=location_2 pt=40.51,-5.91 d=500} Ah, right. I had thought you wanted docs that matched either geofilt (hence OR), not docs that only matched both. -Yonik http://www.lucidimagination.com Yes Yonik what i do now is q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value .. I write here the query because maybe it *helps* to someone that need to do something like this ... -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removing duplicate documents from search results
I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Default schema - 'keywords' not multivalued
On 06/27/2011 11:23 AM, lee carroll wrote: Hi Tod, A list of keywords would be fine in a non multi valued field: keywords : xxx yyy sss aaa multi value field would allow you to repeat the field when indexing keywords: xxx keywords: yyy keywords: sss etc Thanks Lee. the problem is I'm manually pushing a document (via stream.url) and its metadata from a database with the Solr /update/extract REST service, HTTP GET, using Perl. I'm streaming over the document content (presumably via tika) and its gathering the document's metadata which includes the keywords metadata field. Since I'm also passing that field from the DB to the REST call as a list (as you suggested) there is a collision because the keywords field is single valued. I can change this behavior using a copy field. What I wanted to know is if there was a specific reason the default schema defined a field like keywords single valued so I could make sure I wasn't missing something before I changed things. While I'm at it, I'd REALLY like to know how to use DIH to index the metadata from the database while simultaneously streaming over the document content and indexing it. I've never quite figured it out yet but I have to believe it is a possibility. - Tod
Re: Find results with or without whitespace
Thank you for your answer. I agree, I can manage predictable values through synonyms. However most data in this index are company and product names, leading sometimes to rather strange syntax (mix of upper/lower case, misplaced dash or spaces). One purpose to using solr was to help in finding potential duplicates before data insertion. On another hand I could write a custom tokenizer/filter and a custom query builder that would test many combinations. I have the feeling however it is an inefficient approach. That is... Indexing : chelsea soccer club = chelsea,soccer,club,chelseasoccer,soccerclub,chelseasoccerclub Searching : chelsea soccerclub = chelsea and soccerclub or chelseasoccerclub While search expressions are generally short, indexation will be a nightmare... -- View this message in context: http://lucene.472066.n3.nabble.com/Find-results-with-or-without-whitespace-tp3117144p3117581.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removing duplicate documents from search results
Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Analyzer creates PhraseQuery
(11/06/28 16:40), lboutros wrote: You could add this filter after the NGram filter to prevent the phrase query creation : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory Ludovic. There is an option to avoid producing phrase queries, autoGeneratePhraseQueries=false. koji -- http://www.rondhuit.com/en/
Re: Removing duplicate documents from search results
Indeed, take a look at this: http://wiki.apache.org/solr/Deduplication I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Hey François, thanks for your suggestion, I followed the same link ( http://wiki.apache.org/solr/Deduplication) they have the solution*, either make Hash as uniqueKey OR overwrite on duplicate, I dont need either. I need Discard on Duplicate. * I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does
Re: Include synonys in solr
Thanks François Schiettecatte, information you provided is very helpful. i need to know one more thing, i downloaded one of the given dictionary but it contains many files, do i need to add all this files data in to synonyms.text ?? - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removing duplicate documents from search results
Mohammad, just in case you meant it, I would like to discourage you to try to deduplicate *the search result*. There are many things that go wrong if you do that; we had it in one version of the ActiveMath search environment (which uses Lucene): - paging is inappropriate - total count is wrong unless you go through all the results - performance can go really bad if you try to go through all the results - performance does go bad for some search results if you try to fill the page (need to fetch till you find) - you to go through all search results again and again when delivering the next ones So, as others have suggested, please be sure to deduplicate somehow at indexing time. paul Le 28 juin 2011 à 14:24, Mohammad Shariq a écrit : I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Yeah, I read the overview which suggests that duplicates can be prevented from entering the index and scanned the rest, it does not look like you can actually drop the document entirely. Maybe I am missing something here. François On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote: Hey François, thanks for your suggestion, I followed the same link ( http://wiki.apache.org/solr/Deduplication) they have the solution*, either make Hash as uniqueKey OR overwrite on duplicate, I dont need either. I need Discard on Duplicate. * I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible
Re: Include synonys in solr
Well no, you need to see which files (if any) will suit your needs, they are not all synonyms files, I only needed the UK/US english file and I needed to process it into a format suitable for the synonyms file. There may well be other word lists on the net suitable for your needs. I would not recommend the use of synonyms unless you have a specific need for them. I needed them because we have documents which mix UK/US english, and we need to be able to search on medical terms e.g. hemoglobin/haemoglobin and get the same results. Cheers François On Jun 28, 2011, at 9:21 AM, Romi wrote: Thanks François Schiettecatte, information you provided is very helpful. i need to know one more thing, i downloaded one of the given dictionary but it contains many files, do i need to add all this files data in to synonyms.text ?? - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: multiple spatial values
It is precisely this limitation which triggered me to develop a grid indexing approach using Geohashes: https://issues.apache.org/jira/browse/SOLR-2155 This patch requires a Solr trunk release. If you have a small number of distinct points in total, and you only need filtering, then the geohash field in Solr 3.1 may be fast enough for you. ~ David Smiley On Jun 28, 2011, at 7:53 AM, Darren Govoni wrote: Will it be possible to do spatial searches on multi-valued spatial fields soon? I have a latlon field (point) that is multi-valued and don't know how to search against it such that the lats and lons match correctly - since they are split apart. e.g. I have a document with 10 point/latlon values for the same field. On 06/28/2011 05:15 AM, marthinal wrote: Yonik Seeley-2-2 wrote: On Sat, Jun 25, 2011 at 5:56 AM, marthinal lt;jm.rodriguez.ve...@gmail.comgt; wrote: sfield, pt and d can all be specified directly in the spatial functions/filters too, and that will override the global params. Unfortunately one must currently use lucene query syntax to do an OR. It just makes it look a bit messier. q=_query_:{!geofilt} _query:{!geofilt sfield=location_2} -Yonik http://www.lucidimagination.com @Yonik it seems to work like this, i triyed houndreds of other possibilities without success: q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq={!geofilt sfield=location_2 pt=40.51,-5.91 d=500} Ah, right. I had thought you wanted docs that matched either geofilt (hence OR), not docs that only matched both. -Yonik http://www.lucidimagination.com Yes Yonik what i do now is q={!geofilt sfield=location_1 pt=36.62,-6.23 d=50}fq=_query_:{!geofilt sfield=location_2 pt=40.51,-5.91 d=500} other_filter:value .. I write here the query because maybe it *helps* to someone that need to do something like this ... -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-spatial-values-tp1555668p3117145.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Version and Epoch Time?
On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote: I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. The index version shown on the dashboard is the time at which the most recent index segment was created. I'm not sure why it has a value older than a month if a commit has happened after that time. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Yeah, it changes after every commit which added/deleted a document. Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? If you have configured replication correctly, the admin dashboard should show a Replication link right next to the Schema Browser link. The path should be /admin/replication/index.jsp -- Regards, Shalin Shekhar Mangar.
Using FieldCache in SolrIndexSearcher - crazy idea?
I am a user of Solr 3.2 and I make use of the distributed search capabilities of Solr using a fairly simple architecture of a coordinator + some shards. Correct me if I am wrong: In a standard distributed search with QueryComponent, the first query sent to the shards asks for fl=myUniqueKey or fl=myUniqueKey,score. When the response is being generated to send back to the coordinator, SolrIndexSearcher.doc (int i, SetString fields) is called for each document. As I understand it, this will read each document from the index _on disk_ and retrieve the myUniqueKey field value for each document. My idea is to have a FieldCache for the myUniqueKey field in SolrIndexSearcher (or somewhere else?) that would be used in cases where the only field that needs to be retrieved is myUniqueKey. Is this something that would improve performance? In our actual setup, we are using an extended version of QueryComponent that queries for a couple other fields besides myUniqueKey in the initial query to the shards, and it asks a lot of rows when doing so, many more than what the user ends up getting back when they see the results. (The reasons for this are complicated and aren't related much to this question.) We already maintain FieldCaches for the fields that we are asking for, but for other purposes. Would it make sense to utilize these FieldCaches in SolrIndexSearcher? Is this something that anyone else has done before? -Michael
Records disappearing
Hi all, I'm having some weird behavior with my dataimport script. Because of memory issues, I've taken to doing a delta import as doing a fullimport with clean=false. My dataimport config file is set up like: entity name=findDelta query=SELECT id FROM mytable WHERE date_added gt; '${dataimporter.last_index_time}' OR last_updated gt; '${dataimporter.last_index_time}' rootEntity=false entity name=mytable pk=id query=SELECT * FROM mytable WHERE id = '${findDelta.id}' deletedPkQuery=SELECT id FROM my_delete_table deltaImportQuery=SELECT id FROM mytable WHERE id='${ dataimporter.delta.id}' deltaQuery=SELECT id FROM mytable WHERE date_added gt; '${dataimporter.last_index_time}' OR last_updated gt; '${dataimporter.last_index_time}' field column=id name=id / field column=title name=title / field column=name name=name / field column=summary name=summary / /entity /entity I've found that one (possible more that I haven't noticed) keeps disappearing from the index. I will do a fullimportclean=false and search and the record will be there. I'll search again a few hours later and its there. But then all of a sudden, its gone. I don't know what is triggering that one record's disappearance but it is quite annoying. Any ideas what's going on? Thanks, Brian Lamb
Re: Default schema - 'keywords' not multivalued
: I'm streaming over the document content (presumably via tika) and its : gathering the document's metadata which includes the keywords metadata field. : Since I'm also passing that field from the DB to the REST call as a list (as : you suggested) there is a collision because the keywords field is single : valued. : : I can change this behavior using a copy field. What I wanted to know is if : there was a specific reason the default schema defined a field like keywords : single valued so I could make sure I wasn't missing something before I changed : things. That file is just an example, you're absolutely free to change it to meet your use case. I'm not very familiar with Tika, but based on the comment in the example config... !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich documents such as Word, PDF. Some fields are multiValued only because Tika currently may return multiple values for them. -- ...i suspect it was intentional that that field is *not* multiValued (i guess Tika always returns a single delimited value?) but if you have multiple descrete values you want to send for your DB backed data there is no downside to changing that. : While I'm at it, I'd REALLY like to know how to use DIH to index the metadata : from the database while simultaneously streaming over the document content and : indexing it. I've never quite figured it out yet but I have to believe it is : a possibility. There's a TikaEntityProcessor that can be used to have Tika crunch the data that comes from an entity and extract out specific fields, and it can be used in combination with a JdbcDataSource and a BinFileDataSource so that a field in your db data specifies the name of a file on disk to use as the TikaEntity -- but i've personally never tried it Here's a simple example someone posted last year that they got working... http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-td856965.html -Hoss
Does Smart Chinese filter work for Traditional Chinese?
Hi, According to the doc: http://wiki.apache.org/solr/LanguageAnalysis#Chinese.2C_Japanese.2C_Korean solr.SmartChineseWordTokenFilterFactory is for Simplified Chinese. Does it work for Traditional Chinese too? If not, is there anything equivalent for Traditional Chinese? Thanks.
Re: Analyzer creates PhraseQuery
Thanks guys. Both the PositionFilterFactory and the autoGeneratePhraseQueries=false solutions solved the issue. -- View this message in context: http://lucene.472066.n3.nabble.com/Analyzer-creates-PhraseQuery-tp3116288p3118471.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Version and Epoch Time?
Hi, I am facing multiple issues with solr and I am not sure what happens in each case. I am quite naive in Solr and there are some scenarios I'd like to discuss with you. We have a huge volume of documents to be indexed. Somewhere about 5 million. We have a full indexer script which essentially picks up all the documents from database and updates into Solr and an incremental script which adds new documents to Solr.. Relevant areas of my config file goes like unlockOnStartupfalse/unlockOnStartup deletionPolicy class=solr.SolrDeletionPolicy !-- Keep only optimized commit points -- str name=keepOptimizedOnlyfalse/str !-- The maximum number of commit points to be kept -- str name=maxCommitsToKeep1/str /deletionPolicy updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs10/maxDocs /autoCommit /updateHandler requestHandler name=/replication class=solr.ReplicationHandler lst name=master str name=enable${enable.master:false}/str str name=replicateAfterstartup/str str name=replicateAftercommit/str /lst lst name=slave str name=enable${enable.slave:false}/str str name=masterUrlhttp://hostname:port/solr/core0/replication/str /lst /requestHandler Sometimes, while the full indexer script breaks while adding documents to Solr. The script adds the documents and then commits the operation. So, when the script breaks, we have a huge lot of data which has been updated but not committed. Next, the incremental index script executes, and figures out all the new entries, adds them to Solr. It works successfully and commits the operation. - Will the commit by incremental indexer script also commit the previously uncommitted changes made by full indexer script before it broke? Sometimes, while during execution, Solr's avg response time 9avg resp time for last 10 requests, read from log file) goes as high as 9000ms (which I am still unclear why, any ideas how to start hunting for the problem?), so the watchdog process restarts Solr (because it causes a pile of requests queue at application server, which causes app server to crash). On my local environment, I performed the same experiment by adding docs to Solr, killing the process and restarting it. I found that the uncommitted changes were applied and searchable. However, the updates were uncommitted. Could you explain me as to how is this happening, or is there a configuration that can be adjusted for this? Also, what would the index state be if after the restarting Solr, a commit is applied or a commit is not applied? I'd be happy to provide any other information that might be needed. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash pra...@gmail.com wrote: I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. The index version shown on the dashboard is the time at which the most recent index segment was created. I'm not sure why it has a value older than a month if a commit has happened after that time. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Yeah, it changes after every commit which added/deleted a document. Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? If you have configured replication correctly, the admin dashboard should show a Replication link right next to the Schema Browser link. The path should be /admin/replication/index.jsp -- Regards, Shalin Shekhar Mangar.
Re: Custom Query Processing
You should modify the SolrCore for this, if I'm not mistaken. Would extending LuceneQParserPlugin (solr 1.4) be an option for you? On Tue, Jun 28, 2011 at 12:25 AM, Jamie Johnson jej2...@gmail.com wrote: I have a need to take an incoming solr query and apply some additional constraints to it on the Solr end. Our previous implementation used a QueryWrapperFilter along with some custom code to build a new Filter from the query provided. How can we plug this filter into Solr? -- Regards, Dmitry Kan
Re: Unique document count from index?
can you use facet search? facet=truefacet.field=order_nofq=order_no:(1234 OR 5678 OR ...)fq=artist:Pink Floyd On Mon, Jun 27, 2011 at 6:44 PM, Olson, Ron rol...@lbpc.com wrote: Hi all- I have a problem that I'm not sure how it can be (if it can be) solved in Solr. I am using Solr 3.2 with patch 2524 installed to provide grouping. I need to return the count of unique records that match a particular query. For an example of what I'm talking about, imagine I have an index of music CD orders, created from a SQL database using the DataImportHandler. It's possible that the person ordered multiple records by the same artist (e.g. order #1234 contains Pink Floyd Wish You Were, Pink Floyd Meddle, Pink Floyd Obscured by Clouds). One of the fields indexed and stored fields in the document is Artist. If I do a search for Pink Floyd, using the order above, I'd get three documents, all with the same order number, for each of the Pink Floyd records. What I'd like to find out is how many unique orders have Pink Floyd across the entire index. The index has millions of documents. I have been trying to see if the result grouping functionality provided by patch 2524 will help, but while it does collapse the query above into one document, the matches field is still the same as without the grouping (which I guess makes sense insofar as it is still reporting the number of documents it found for the query). I have also thought a subquery in my DataImportHandler might work, though I'm not sure how I'd structure it. Thanks for any guidance on how to solve this problem; I know Solr isn't meant to be a data-mining tool and I'm guessing I'm skating perilously close to using it for that purpose, but anything I can do to take load from the actual database is considered a Good Thing by all concerned. Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you. -- Regards, Dmitry Kan
Re: Index Version and Epoch Time?
On 6/28/2011 1:38 PM, Pranav Prakash wrote: - Will the commit by incremental indexer script also commit the previously uncommitted changes made by full indexer script before it broke? Yes, as long as the Solr instance hasn't crashed. Anything added but not yet committed sticks around and will be committed on next 'commit'. There are no 'transactions' for adding docs in Solr, even if multiple processes are adding, if anyone of them issues a 'commit' they'll all be committed. Sometimes, while during execution, Solr's avg response time 9avg resp time for last 10 requests, read from log file) goes as high as 9000ms (which I am still unclear why, any ideas how to start hunting for the problem?), It could be a Java garbage collection issue. I have found it useful to start the JVM with Solr in it using some parameters to tune garbage collection. I use these JVM options: -server -XX:+AggressiveOpts -d64 -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops You've still got to make sure Solr has enough memory for what you're doing with it, with with your 5 million doc index might be more than you expect. On the other hand, giving a JVM too _much_ heap can cause slowdowns too, although I think the -XX:+UseConcMarkSweepGC should amelioerate that to some extent. Possibly more likely, it could instead be Solr readying the new indexes. Do you issue commits in the middle of 'execution', and could the slowdown happen right after a commit? When a commit is issued to Solr, Solr's got to switch new indexes in with the newly added documents, and 'warm' those indexes in various ways. Which can be a CPU (as well as RAM) intensive thing. (For these purposes a replication from master counts as a commit (because it is), and an optimize can count too (becaue it's close enough)). This can be especially a problem if you issue multiple commits very close together -- Solr's still working away at readying the index from the first commit, when the second comes in, and now Solr's trying to get ready two indexes at once (one of which will never be used because its' already outdated). Or even more than two if you issue a bunch of commits in rapid succession. I found that the uncommitted changes were applied and searchable. However, the updates were uncommitted. There is in general no way that uncomitted adds could be searchable, that's probably not happening. What is probably happening instead is that a commit _is_ happening. One way a commit can happen even if you aren't manually issuing one is with various auto-commit settings in solrconfig.xml. Commit any pending adds after X documents, or after T seconds, can both be configured. If they are configured, that could be causing commits to happen when you don't realize it, which could also trigger the slowdown due to a commit mentioned in the previous paragraph. Jonathan
moving to multicore without changing existing index
hi I'm looking at setting up multi core indices but also have an exiting index. Can I run this index along side new index set up as cores. On a dev machine I've experimented with simply adding solr.xml in slor home and listing the new cores in the cores element but this breaks the existing index. container is tomcat and attempted set up was: solrHome conf (existing running index) core1 (new core directory) solr.xml (cores element has one entry for core1) Is this a valid approach ? thanks lee
Re: moving to multicore without changing existing index
Nope. But you can move your existing index into a core in a multi-core setup. But a multi-core setup is a multi-core setup, there's no way to have an index accessible at a non-core URL in a multi-core setup. On 6/28/2011 2:53 PM, lee carroll wrote: hi I'm looking at setting up multi core indices but also have an exiting index. Can I run this index along side new index set up as cores. On a dev machine I've experimented with simply adding solr.xml in slor home and listing the new cores in the cores element but this breaks the existing index. container is tomcat and attempted set up was: solrHome conf (existing running index) core1 (new core directory) solr.xml (cores element has one entry for core1) Is this a valid approach ? thanks lee
Dynamic Fields vs. Multicore
Hi All, I was searching around for documentation of the performance differences of having a sharded, single schema, dynamic field set up vs. a multi-core, static multi-schema setup (which I currently have), but I have not had much luck finding what I am looking for. I understand commits and optimizes will be more intensive in a single core since there is more data (though I would offset by sharding heavily), but I am particularly curious about the search performance implications. I am interested in moving to the dynamic field setup in order to implement a better global search, but I want to make sure I understood the drawbacks of hitting those datasets individually and globally after they are merged (NOTE: I would have a global field signifying the dataset type, which could then be added to the filter query in order to create the subset for individual dataset queries). Some background about the data: it is extremely variable. Some documents contain only 2 or 3 sentences, and some are 20 page extracted PDFs. There would probably only be about 100-150 unique fields. Any input is greatly appreciated! Thanks, Briggs Thompson
Solr - search queries not returning results
Hello everyone, I believe I am missing something very elementary. The following query returns zero hits: http://localhost:8983/solr/core0/select/?q=testabc However, using solritas, it finds many results: http://localhost:8983/solr/core0/itas?q=testabc Do you have any idea what the issue may be? Thanks in advance!
overwirite if not already in index?
Quick question, Is there a way with solr to conditionally update document on unique id? Meaning, default, add behavior if id is not already in index and *not to touch index if already there. Deletes are not important (no sync issues). I am asking because I noticed with deduplication turned on, index-files get modified even if I update the same documents again (same signatures). I am facing very high dupes rate (40-50%), and setup is going to be master-slave with high commit rate (requirement is to reduce propagation latency for updates). Having unnecessary index modifications is going to waste effort to ship the same information again and again. if there is no standard way, what would be the fastest way to check if Term exists in index from UpdateRequestProcessor? I intend to extend SignatureUpdateProcessor to prevent a document from propagating down the chain if this happens? Would that be a way to deal with it? I repeat, there are no deletes to make headaches with synchronization Thanks, eks
conditionally update document on unique id
Quick question, Is there a way with solr to conditionally update document on unique id? Meaning, default, add behavior if id is not already in index and *not to touch index if already there. Deletes are not important (no sync issues). I am asking because I noticed with deduplication turned on, index-files get modified even if I update the same documents again (same signatures). I am facing very high dupes rate (40-50%), and setup is going to be master-slave with high commit rate (requirement is to reduce propagation latency for updates). Having unnecessary index modifications is going to waste effort to ship the same information again and again. if there is no standard way, what would be the fastest way to check if Term exists in index from UpdateRequestProcessor? I intend to extend SignatureUpdateProcessor to prevent a document from propagating down the chain if this happens? Would that be a way to deal with it? I repeat, there are no deletes to make headaches with synchronization Thanks, eks
Re: Solr - search queries not returning results
Hi Walter, probably solritas is using Dismax with a set of fields on the qf parameter, while with your first query, you are just querying to the default field. On Tue, Jun 28, 2011 at 5:07 PM, Walter Closenfleight walter.p.closenflei...@gmail.com wrote: Hello everyone, I believe I am missing something very elementary. The following query returns zero hits: http://localhost:8983/solr/core0/select/?q=testabc However, using solritas, it finds many results: http://localhost:8983/solr/core0/itas?q=testabc Do you have any idea what the issue may be? Thanks in advance!
edismax - Handling collocations mapped to a single token . . ?
We are trying to get edismax to handle collocations mapped to a single token. To do so we need to manipulate the chunks (as Hoss referred to them in http://www.lucidimagination.com/blog/2010/05/23/whats-a-dismax/) generated by the dismax parser. We have numerous collocations (terms of speech which do not directly relate to the constituent words that make up the saying). For example, at index time real estate is mapped to real_estate to avoid it colliding with searches for estate or real value. So we need the chunks to reflect this mapping of multi-word phrases to a single token that is done during indexing (via the synonym filter). In an ideal world, we would just list the queryAnalyzerFieldType that should be used in pre-processing the query string before it is divided into chunks (similar to what is done with the SpellChecker Compoenent). But our impression thus far is that we are off the reservation and will need to hack away at org.apache.solr.search.ExtendedDismaxQParser.splitIntoClauses(String, boolean). Is it correct that the only pre-processing by dismax is on stopwords? Is it correct to be able to limit customization to splitIntoClauses(String, boolean) to handle this? Regards, Christopher
Re: moving to multicore without changing existing index
But a multi-core setup is a multi-core setup, there's no way to have an index accessible at a non-core URL in a multi-core setup. Isn't there? What about defaultCoreName parameter? from the wiki: The name of a core that will be used for requests that don't specify a core. If you have one core and want to use the features specified on this page, then this provides a way to keep your URLs the same. You will need to set up the directory structure for that core, something like: solrHome originalCore (new core directory) conf (existing running index) core1 (new core directory) conf (new configuration) solr.xml (declare both cores, and set originalCore as defaultCoreName ) Haven't tried it, but I think it should work. See http://wiki.apache.org/solr/CoreAdmin#solr On Tue, Jun 28, 2011 at 3:57 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Nope. But you can move your existing index into a core in a multi-core setup. But a multi-core setup is a multi-core setup, there's no way to have an index accessible at a non-core URL in a multi-core setup. On 6/28/2011 2:53 PM, lee carroll wrote: hi I'm looking at setting up multi core indices but also have an exiting index. Can I run this index along side new index set up as cores. On a dev machine I've experimented with simply adding solr.xml in slor home and listing the new cores in the cores element but this breaks the existing index. container is tomcat and attempted set up was: solrHome conf (existing running index) core1 (new core directory) solr.xml (cores element has one entry for core1) Is this a valid approach ? thanks lee
How to Create a weighted function (dismax or otherwise)
I am trying to create a feature that allows search results to be displayed by this formula sum(weight1*text relevance score, weight2 * price). weight1 and weight2 are numeric values that can be changed to influence the search results. I am sending the following query params to the Solr instance for searching. q=red defType=dismax qf=10^name+2^price My understanding is that when using dismax, Solr/Lucene looks for the search text in all the fields specified in the qf param. Currently my search results are similar to those I get when qf does not including a price. I think this is because price is a numeric field and there is not text match. Is it possible to rank search results based on this formula - sum(weight1*text relevance score, weight2 * price). Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Create-a-weighted-function-dismax-or-otherwise-tp3119977p3119977.html Sent from the Solr - User mailing list archive at Nabble.com.
Fuzzy Query Param
According to the docs on lucene query syntax: Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. I was messing around with this and started doing queries with values greater than 1 and it seemed to be doing something. However I haven't been able to find any documentation on this. What happens when specifying a fuzzy query with a value 1? tiger~2 animal~3 -- View this message in context: http://lucene.472066.n3.nabble.com/Fuzzy-Query-Param-tp3120235p3120235.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using RAMDirectoryFactory in Master/Slave setup
Using RAMDirectory really does not help performance. Java garbage collection has to work around all of the memory taken by the segments. It works out that Solr works better (for most indexes) without using the RAMDirectory. On Sun, Jun 26, 2011 at 2:07 PM, nipunb ni...@walmartlabs.com wrote: PS: Sorry if this is a repost, I was unable to see my message in the mailing list - this may have been due to my outgoing email different from the one I used to subscribe to the list with. Overview – Trying to evaluate if keeping the index in memory using RAMDirectoryFactory can help in query performance.I am trying to perform the indexing on the master using solr.StandardDirectoryFactory and make those indexes accesible to the slave using solr.RAMDirectoryFactory Details: We have set-up Solr in a master/slave enviornment. The index is built on the master and then replicated to slaves which are used to serve the query. The replication is done using the in-built Java replication in Solr. On the master, in the indexDefaults of solrconfig.xml we have directoryFactory name=DirectoryFactory class=solr.StandardDirectoryFactory/ On the slave, I tried to use the following in the indexDefaults directoryFactory name=DirectoryFactory class=solr.RAMDirectoryFactory/ My slave shows no data for any queries. In solrconfig.xml it is mentioned that replication doesn’t work when using RAMDirectoryFactory, however this ( https://issues.apache.org/jira/browse/SOLR-1379) mentions that you can use it to have the index on disk and then load into memory. To test the sanity of my set-up, I changed solrconfig.xml in the slave to and replicated: directoryFactory name=DirectoryFactory class=solr.StandardDirectoryFactory/ I was able to see the results. Shouldn’t RAMDirectoryFactory be used for reading index from disk into memory? Any help/pointers in the right direction would be appreciated. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Using-RAMDirectoryFactory-in-Master-Slave-setup-tp3111792p3111792.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com