Re: Detect term occurrences
Hi Francisco, >> I have many drug products leaflets, each corresponding to 1 product. In the other hand we have a medical dictionary with about 10^5 terms. I want to detect all the occurrences of those terms for any leaflet document. Take a look at SolrTextTagger for this use case. https://github.com/OpenSextant/SolrTextTagger 10^5 entries are not that large, I am using it for much larger dictionaries at the moment with very good results. Its a project built (at least originally) by David Smiley, who is also quite active in this group. -sujit On Fri, Sep 11, 2015 at 7:29 AM, Alexandre Rafalovitchwrote: > Assuming the medical dictionary is constant, I would do a copyField of > text into a separate field and have that separate field use: > > http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/miscellaneous/KeepWordFilterFactory.html > with words coming from the dictionary (normalized). > > That way that new field will ONLY have your dictionary terms from the > text. Then you can do facet against that field or anything else. Or > even search and just be a lot more efficient. > > The main issue would be a gigantic filter, which may mean speed and/or > memory issues. Solr has some ways to deal with such large set matches > by compiling them into a state machine (used for auto-complete), but I > don't know if that's exposed for your purpose. > > But could make a fun custom filter to build. > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 10 September 2015 at 22:21, Francisco Andrés Fernández > wrote: > > Yes. > > I have many drug products leaflets, each corresponding to 1 product. In > the > > other hand we have a medical dictionary with about 10^5 terms. > > I want to detect all the occurrences of those terms for any leaflet > > document. > > Could you give me a clue about how is the best way to perform it? > > Perhaps, the best way is (as Walter suggests) to do all the queries every > > time, as needed. > > Regards, > > > > Francisco > > > > El jue., 10 de sept. de 2015 a la(s) 11:14 a. m., Alexandre Rafalovitch < > > arafa...@gmail.com> escribió: > > > >> Can you tell us a bit more about the business case? Not the current > >> technical one. Because it is entirely possible Solr can solve the > >> higher level problem out of the box without you doing manual term > >> comparisons.In which case, your problem scope is not quite right. > >> > >> Regards, > >>Alex. > >> > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > >> http://www.solr-start.com/ > >> > >> > >> On 10 September 2015 at 09:58, Francisco Andrés Fernández > >> wrote: > >> > Hi all, I'm new to Solr. > >> > I want to detect all ocurrences of terms existing in a thesaurus into > 1 > >> or > >> > more documents. > >> > What´s the best strategy to make it? > >> > Doing a query for each term doesn't seem to be the best way. > >> > Many thanks, > >> > > >> > Francisco > >> >
Re: Solr query which return only those docs whose all tokens are from given list
Hi Naresh, Couldn't you could just model this as an OR query since your requirement is at least one (but can be more than one), ie: tags:T1 tags:T2 tags:T3 -sujit On Mon, May 11, 2015 at 4:14 AM, Naresh Yadav nyadav@gmail.com wrote: Hi all, Also asked this here : http://stackoverflow.com/questions/30166116 For example i have SOLR docs in which tags field is indexed : Doc1 - tags:T1 T2 Doc2 - tags:T1 T3 Doc3 - tags:T1 T4 Doc4 - tags:T1 T2 T3 Query1 : get all docs with tags:T1 AND tags:T3 then it works and will give Doc2 and Doc4 Query2 : get all docs whose tags must be one of these [T1, T2, T3] Expected is : Doc1, Doc2, Doc4 How to model Query2 in Solr ?? Please help me on this ?
Re: Proximity Search
Hi Vijay, I haven't tried this myself, but perhaps you could build the two phrases as PhraseQueries and connect them up with a SpanQuery? Something like this (using your original example). PhraseQuery p1 = new PhraseQuery(); for (String word : this is phrase 1.split()) { p1.add(new Term(my_field, word)); } PhraseQuery p2 = new PhraseQuery(); for (String word : this is the second phrase.split()) { p2.add(new Term(my_field, word)); } SpanQuery q = new SpanNearQuery(new SpanQuery[] {p1, p2}, 4, true); -sujit On Thu, Apr 30, 2015 at 10:04 AM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Thanks Rajani. I could get proximity search work for individual words. However, still could not make it work for two phrases, each containing more than a word. Also, results seem to be unexpected for proximity queries with wildcards. Thanks Regards Vijay On 30 April 2015 at 15:19, Rajani Maski rajani.ma...@lucidworks.com wrote: Hi Vijaya, I just quickly tried proximity search with the example set shipped with solr 5 and it looked like working for me. Perhaps, what you could is debug the query by enabling debugQuery=true. Here are the steps that I tried.(Assuming you are on Solr 5. Though this term proximity functionality should work for 4.x versions too) 1. Go to solr5.0 downloaded folder and navigate to bin. Rajanis-MacBook-Pro:solr-5.0.0 rajanishivarajmaski$ bin/solr -e techproducts 2. Execute the below query. The field name has value Test with some GB18030 encoded characters and you search for name: Test GB18030~10 http://localhost:8983/solr/techproducts/select?q=name: Test GB18030~10wt=jsonindent=true Image : http://postimg.org/image/bjkbufsph/ On Thu, Apr 30, 2015 at 7:14 PM, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: I just tried with simple proximity search like word1 word2 ~3 and it is not working. Just wondering whether I have to make any configuration changes to solrconfig.xml to make proximity search work? Thanks Vijay On 30 April 2015 at 14:32, Vijaya Narayana Reddy Bhoomi Reddy vijaya.bhoomire...@whishworks.com wrote: Hi, I have created my index with the default configurations. Now I am trying to use proximity search. However, I am bit not sure on the results and where its going wrong. For example, I want to find two phrases this is phrase one and another phrase this is the second phrase with not more than a proximity distance of 4 words in between them. The query syntax I am using is (\this is phrase one\) (\this is the second phrase\)~4 However, the results I am getting are similar to OR operation. Can anyone please let me know whether the syntax is correct? Also, please let me know how to implement proximity search using SolrJ Query API? Thanks Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS. -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.
Re: Enrich search results with external data
Hi Ha, Yes, I think if you want to facet on the external field, the custom component seems to be the best option IMO. -sujit On Fri, Apr 17, 2015 at 3:02 PM, ha.p...@arvatosystems.com wrote: Hi Sujit, Many thanks for your blog post, responding to my question, and suggesting the alternative option ☺ I think I prefer your approach because we can supply our own Comparator. The reason is that we need to meet some strict requirements: we can only call the external system once to retrieve extra fields (price, inventory, etc.) for probably a subset of the search result. Therefore we need to be able to sort and facet on the list of items that some of them may not have external fields. I think using the Comparator would help with the sorting but let me know if you have different ideas. Do you have suggestion how we should deal with the facet requirement? I am thinking about adding another Facet Component that will be executed after the standard FacetComponent. Let me know if you think we should consider other options. Thanks, -Ha -Original Message- From: sujitatgt...@gmail.com [mailto:sujitatgt...@gmail.com] On Behalf Of Sujit Pal Sent: Saturday, April 11, 2015 10:23 AM To: solr-user@lucene.apache.org; Ahmet Arslan Subject: Re: Enrich search results with external data Hi Ha, I am the author of the blog post you mention. To your question, I don't know if the code will work without change (since the Lucene/Solr API has evolved so much over the last few years), but a more preferred way using Function Queries way may be found in slides for Timothy Potter's talk here: http://www.slideshare.net/thelabdude/boosting-documents-in-solr-lucene-revolution-2011 Here he speaks of external fields stored in a database and accessed using a custom component (rather than from a flat file as in ExternalFieldField), and using function queries to influence the ranking based on the external field. However, per this document on function queries, you can use the output of a function query to sort as well by passing the function to the sort parameter. https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function Hope this helps, Sujit On Fri, Apr 10, 2015 at 10:38 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Who don't you include/add/index those additional fields, at least the one used in sorting? Also, you may find https://stanbol.apache.org/docs/trunk/components/enhancer/ relevant. Ahmet On Saturday, April 11, 2015 1:04 AM, ha.p...@arvatosystems.com ha.p...@arvatosystems.com wrote: This ticket seems to address the problem I have https://issues.apache.org/jira/browse/SOLR-1566 and as the result of that ticket, DocTransformer is added since Solr 4.0. I wrote a simple DocTransformer and found that the transformer is executed AFTER pagination. In our application, we need the external fields added before sorting/pagination. I've looked around for the option to change the execution order but haven't had any luck. Does anyone know the solution? The ticket also states it is not possible for components to add fields to outgoing documents which are not in the stored fields of the document. Does anyone know if this is still true? Thanks, -Ha -Original Message- From: Pham, Ha Sent: Thursday, April 09, 2015 11:41 PM To: solr-user@lucene.apache.org Subject: Enrich search results with external data Hi everyone, We have a requirement to append external data (e.g. price/inventory of product, retrieved from an ERP via web services) to query result and support sorting and pagination based on those external fields. For example if Solr returns 100 records and the page size user selects is 20, the sorting on the external fields is still on 100 records. This limits us from enriching search results outside of Solr. I guess this is a common problem so hopefully someone could share their experience. I am considering using a PostFilter and enrich documents in collect() method as below @Override public void collect(int docId) throws IOException { DoubleField price = new DoubleField (PRICE, 1.23, Field.Store.YES); Document currentDoc = context.reader().document(docId); currentDoc.add(price); } but the result documents don't have PRICE fields. Did I miss anything here? I also did some research and it seems the approach mentioned here http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-exte rnal.html is close to what we need to achieve but since the document is 4 years old, I don't know if there's a better approach for our problem (we are using solr 5.0)? Thanks, -Ha
Re: Enrich search results with external data
Hi Ha, I am the author of the blog post you mention. To your question, I don't know if the code will work without change (since the Lucene/Solr API has evolved so much over the last few years), but a more preferred way using Function Queries way may be found in slides for Timothy Potter's talk here: http://www.slideshare.net/thelabdude/boosting-documents-in-solr-lucene-revolution-2011 Here he speaks of external fields stored in a database and accessed using a custom component (rather than from a flat file as in ExternalFieldField), and using function queries to influence the ranking based on the external field. However, per this document on function queries, you can use the output of a function query to sort as well by passing the function to the sort parameter. https://wiki.apache.org/solr/FunctionQuery#Sort_By_Function Hope this helps, Sujit On Fri, Apr 10, 2015 at 10:38 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Who don't you include/add/index those additional fields, at least the one used in sorting? Also, you may find https://stanbol.apache.org/docs/trunk/components/enhancer/ relevant. Ahmet On Saturday, April 11, 2015 1:04 AM, ha.p...@arvatosystems.com ha.p...@arvatosystems.com wrote: This ticket seems to address the problem I have https://issues.apache.org/jira/browse/SOLR-1566 and as the result of that ticket, DocTransformer is added since Solr 4.0. I wrote a simple DocTransformer and found that the transformer is executed AFTER pagination. In our application, we need the external fields added before sorting/pagination. I've looked around for the option to change the execution order but haven't had any luck. Does anyone know the solution? The ticket also states it is not possible for components to add fields to outgoing documents which are not in the stored fields of the document. Does anyone know if this is still true? Thanks, -Ha -Original Message- From: Pham, Ha Sent: Thursday, April 09, 2015 11:41 PM To: solr-user@lucene.apache.org Subject: Enrich search results with external data Hi everyone, We have a requirement to append external data (e.g. price/inventory of product, retrieved from an ERP via web services) to query result and support sorting and pagination based on those external fields. For example if Solr returns 100 records and the page size user selects is 20, the sorting on the external fields is still on 100 records. This limits us from enriching search results outside of Solr. I guess this is a common problem so hopefully someone could share their experience. I am considering using a PostFilter and enrich documents in collect() method as below @Override public void collect(int docId) throws IOException { DoubleField price = new DoubleField (PRICE, 1.23, Field.Store.YES); Document currentDoc = context.reader().document(docId); currentDoc.add(price); } but the result documents don't have PRICE fields. Did I miss anything here? I also did some research and it seems the approach mentioned here http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html is close to what we need to achieve but since the document is 4 years old, I don't know if there's a better approach for our problem (we are using solr 5.0)? Thanks, -Ha
Re: Get the new terms of fields since last update
Hi Ludovic, A bit late to the party, sorry, but here is a bit of a riff off Eric's idea. Why not store the previous terms in a Bloom filter and once you get the terms from this week, check to see if they are not in the set. Once you find the set, add them to the Bloom filter. Bloom filters are space efficient, by increasing the false positive rate you can make it consume less space (more keys hash to the same element), since you are only concerned with finding if something is not in the set. -sujit On Fri, Dec 5, 2014 at 7:21 AM, lboutros boutr...@gmail.com wrote: The Apache Solr community is sooo great ! Interesting problem with 3 interesting answers in less than 2 hours ! Thank you all, really. Erik, I'm already saving the billion of terms each week. It's hard to diff 1 billion of terms. I'm already rebuilding the whole dictionaries each week in a custom distributed terms query handler. I'm saving the result in Mongo DB in order to scroll thru it quickly with term position in the dictionary. It takes 3-4 hours each week. Now I would like to update the result in order to do it faster. Alex, I will check, this seems to be a good idea. Is it possible to filter terms with payloads in index readers ? I did not see anything like that in my first investigation. I suppose it would take some additional disk space. Michael, this is the easiest way to do it. You are right. But I'm not sure that indexing twice and update the dictionaries would be faster than the current process. But it worth it to do some math ;) Ludovic. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Get-the-new-terms-of-fields-since-last-update-tp4172755p4172785.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What's the most efficient way to sort by number of terms matched?
Hi Trey, In an application I built few years ago, I had a component that rewrote the input query into a Lucene BooleanQuery and we would set the minimumNumberShouldMatch value for the query. Worked well, but lately we are trying to move away from writing our own custom components since maintaining them across releases becomes a bit of a chore. So lately we simulate this behavior in the client by constructing progressively smaller n-grams and OR'ing them then sending to Solr. For your example, it becomes something like this: (python AND solr AND hadoop) OR (python AND solr) OR (solr AND hadoop) OR (python AND hadoop) OR (python) OR (solr) OR (hadoop). -sujit On Thu, Nov 6, 2014 at 7:25 AM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Trey, Not exactly the same but we did something similar with (e)dismax's mm parameter. By autoRelax'ing it. In your example, try with mm=3 if numFound 20 then try with mm=2 etc. Ahmet On Thursday, November 6, 2014 8:41 AM, Trey Grainger solrt...@gmail.com wrote: Just curious if there are some suggestions here. The use case is fairly simple: Given a query like python OR solr OR hadoop, I want to sort results by number of keywords matched first, and by relevancy separately. I can think of ways to do this, but not efficiently. For example, I could do: q=python OR solr OR hadoop p1=python p2=solr p3=hadoop sort=sum(if(query($p1,0),1,0),if(query($p2,0),1,0),if(query($p3,0),1,0)) desc, score desc Other than the obvious downside that this requires me to pre-parse the user's query, it's also somewhat inefficient to run the query function once for each term in the original query since it is re-executing multiple queries and looping through every document in the index during scoring. Ideally, I would be able to do something like the below that could just pull the count of unique matched terms from the main query (q parameter) execution:: q=python OR solr OR hadoopsort=uniquematchedterms() desc,score desc. I don't think anything like this exists, but would love some suggestions if anyone else has solved this before. Thanks, -Trey
Re: Query on Facet
Hi Smitha, Have you looked at Facet queries? It allows you to attach Solr queries to facets. The problem with this is that you will need to know all possible combinations of language and binding (or make an initial query to find this information). https://wiki.apache.org/solr/SimpleFacetParameters#facet.query_:_Arbitrary_Query_Faceting Another alternative could be to bake in language+binding pairs into a field in your index and facet on that. -sujit On Wed, Jul 30, 2014 at 7:01 AM, vamshi kiran mothevamshiki...@gmail.com wrote: Hi Alex, As you said If we exclude language facet field ,it will get all the language facets with count right ? It Will not filter by binding facet field of type 'paperback' , how can we do this ? Thanks Regards, Vamshi. On Jul 30, 2014 4:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: I am not sure I fully understood your question, but I would start by looking at Tagging and Excluding first: https://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Wed, Jul 30, 2014 at 5:07 PM, Smitha Rajiv smitharaji...@gmail.com wrote: Hi, I need some help on Solr Faceting. How do I facet on two fields at the same time to get combination facets and its count? I'm using below query to get facets with combination of language and its binding. But now I'm getting only selected facet in facetList of each field and its count. For e.g. in language facets the query is returning English and its count. Instead I need to get other language facets which satisfies binding type of paperback http://localhost:8080/solr/collection1/select?q=software%20testingfq=language%3A(%22English%22)fq=Binding%3A(%22paperback%22)facet=truefacet.mincount=1 facet.field=Languagefacet.field=latestArrivalsfacet.field=Bindingwt=jsonindent=truedefType=edismax json.nl=map Please provide me your inputs. Thanks Regards, Smitha
Re: Implementing custom analyzer for multi-language stemming
Hi Eugene, In a system we built couple of years ago, we had a corpus of English and French mixed (and Spanish on the way but that was implemented by client after we handed off). We had different fields for each language. So (title, body) for English docs was (title_en, body_en), for French (title_fr, body_fr) and for Spanish (title_es, body_es) - each of these were associated with a different Analyzer (that was associated with the field types in schema.xml, in case of Lucene you can use PerFieldAnalyzerWrapper). Our pipeline used Google translate to detect the language and write the contents into the appropriate field set for the language. Our analyzers were custom - but Lucene/Solr provides analyzer chains for many major languages. You can find a list here: https://wiki.apache.org/solr/LanguageAnalysis -sujit On Wed, Jul 30, 2014 at 10:52 AM, Chris Morley ch...@depahelix.com wrote: I know BasisTech.com has a plugin for elasticsearch that extends stemming/lemmatization to work across 40 natural languages. I'm not sure what they have for Solr, but I think something like that may exist as well. Cheers, -Chris. From: Eugene beyondcomp...@gmail.com Sent: Wednesday, July 30, 2014 1:48 PM To: solr-user@lucene.apache.org Subject: Implementing custom analyzer for multi-language stemming Hello, fellow Solr and Lucene users and developers! In our project we receive text from users in different languages. We detect language automatically and use Google Translate APIs a lot (so having arbitrary number of languages in our system doesn't concern us). However we need to be able to search using stemming. Having nearly hundred of fields (several fields for each language with language-specific stemmers) listed in our search query is not an option. So we need a way to have a single index which has stemmed tokens for different languages. I have two questions: 1. Are there already (third-party) custom multi-language stemming analyzers? (I doubt that no one else ran into this issue) 2. If I'm going to implement such analyzer myself, could you please suggest a better way to 'pass' detected language value into such analyzer? Detecting language in analyzer itself is not an option, because: a) we already detect it in other place b) we do it based on combined values of many fields ('name', 'topic', 'description', etc.), while current field can be to short for reliable detection c) sometimes we just want to specify language explicitly. The obvious hack would be to prepend ISO 639-1 code to field value. But I'd like to believe that Solr allows for cleaner solution. I could think about either: a) custom query parameter (but I guess, it will require modifying request handlers, etc. which is highly undesirable) b) getting value from other field (we obviously have 'language' field and we do not have mixed-language records). If it is possible, could you please describe the mechanism for doing this or point to relevant code examples? Thank you very much and have a good day!
Re: Any Solrj API to obtain field list?
Have you looked at IndexSchema? That would offer you methods to query index metadata using SolrJ. http://lucene.apache.org/solr/4_7_2/solr-core/org/apache/solr/schema/IndexSchema.html -sujit On Tue, May 27, 2014 at 1:56 PM, T. Kuro Kurosaka k...@healthline.comwrote: I'd like to write Solr client code that writes text to language specific field, say, myfield_es, for Spanish, if the field myfield_es is defined in schema.xml, and otherwise to a fall-back field myfield. To do this, I need to obtain a list of defined fields (and dynamic fields) from the server. But I cannot find a suitable Solrj API. Is there any? I'm using Solr 4.6.1. I could write code to use Schema REST API (https://wiki.apache.org/solr/SchemaRESTAPI) but I would much prefer to use the existing code if one exists. -- T. Kuro Kurosaka • Senior Software Engineer
Re: How to apply Semantic Search in Solr
Hi Sohan, Given you have 15 days and this looks like a class project, I would suggest going with John Berryman's approach - he also provides code which you can just apply to your data. Even if you don't get the exact expansions you desire, I think you will get results that will pleasantly surprise you :-). -sujit On Mon, Mar 10, 2014 at 11:07 PM, Sohan Kalsariya sohankalsar...@gmail.comwrote: Hey Sujit thanks a lot. But what do you think about Berryman blog post ? Is it feasible to apply or should i apply the synonym stuff ? which one is good? And the 3rd approach you told me about, seems like difficult and time consuming for students like me as i will have to submit this in next 15 Days. Please suggest me something. On Tue, Mar 11, 2014 at 5:12 AM, Sujit Pal sujit@comcast.net wrote: Hi Sohan, You would be the best person to answer your question of how to proceed :-). From your original query term musical events in New York rewriting to musical nights at ABC place OR concerts events OR classical music event you would have to build into your knowledge base that ABC place is a synonym for New York, and that musical event at New York is a synonym for concerts events and classical music event. You can do this using approach #1 (from the Berryman blog post) and the approach #2 (my first suggestion) but these results are not guaranteed - because your corpus may not contain this relationship. Approach #3 (my second suggestion) involves lots of work and possibly domain knowledge but much cleaner relationships. OTOH, you could get away for this one query by adding the three queries into your synonyms.txt and enabling synonym support in Solr. http://stackoverflow.com/questions/18790256/solr-synonym-not-working So how much effort you put into supporting this feature would be dictated by how important it is to your environment - that is a question only you can answer. -sujit On Sun, Mar 9, 2014 at 11:26 PM, Sohan Kalsariya sohankalsar...@gmail.comwrote: Thanks Sujit and all for your views about semantic search in solr. But How do i proceed towards, i mean how do i start off the things to get on track ? On Sat, Mar 8, 2014 at 10:50 PM, Sujit Pal sujit@comcast.net wrote: Thanks for sharing this link Sohan, its an interesting approach. Since you have effectively defined what you mean by Semantic Search, there are couple other approaches I know of to do something like this: 1) preprocess your documents looking for terms that co-occur in the same document. The more such cooccurrences you find the more strongly these terms are related (can help with ordering related terms from most related to least related). At query time expand the query to include /most/ related concepts and search. 2) use an external knowledgebase such as a taxonomy that indicates relationships between concepts (this is the approach we use). At query time expand the query to include related concepts and search. -sujit On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Basically, when i searched it on Google I got this result : http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/ And I am working on this. So is this useful ? On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: And how would it know to give you those results? Obviously, you have some sort of magic/algorithm in your mind. Are you doing geographic location match, category match, synonyms match? We can't really help with generic questions. You still need to figure out what semantic means for you specifically. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Hello, I am working on an event listing and promotions website( http://allevents.in) and I want to apply semantic search on solr. For example, if someone search : Musical Events in New York So it would give me results such as : * Musical Night at ABC place * Concerts Events * Classical Music Event I mean all results should be Semantic to the Search_Query it should not give the results only based on tf-idf. So can you please make me understand how do i proceed to apply Semantic Search in Solr. ( allevents.in
Re: How to apply Semantic Search in Solr
Hi Sohan, You would be the best person to answer your question of how to proceed :-). From your original query term musical events in New York rewriting to musical nights at ABC place OR concerts events OR classical music event you would have to build into your knowledge base that ABC place is a synonym for New York, and that musical event at New York is a synonym for concerts events and classical music event. You can do this using approach #1 (from the Berryman blog post) and the approach #2 (my first suggestion) but these results are not guaranteed - because your corpus may not contain this relationship. Approach #3 (my second suggestion) involves lots of work and possibly domain knowledge but much cleaner relationships. OTOH, you could get away for this one query by adding the three queries into your synonyms.txt and enabling synonym support in Solr. http://stackoverflow.com/questions/18790256/solr-synonym-not-working So how much effort you put into supporting this feature would be dictated by how important it is to your environment - that is a question only you can answer. -sujit On Sun, Mar 9, 2014 at 11:26 PM, Sohan Kalsariya sohankalsar...@gmail.comwrote: Thanks Sujit and all for your views about semantic search in solr. But How do i proceed towards, i mean how do i start off the things to get on track ? On Sat, Mar 8, 2014 at 10:50 PM, Sujit Pal sujit@comcast.net wrote: Thanks for sharing this link Sohan, its an interesting approach. Since you have effectively defined what you mean by Semantic Search, there are couple other approaches I know of to do something like this: 1) preprocess your documents looking for terms that co-occur in the same document. The more such cooccurrences you find the more strongly these terms are related (can help with ordering related terms from most related to least related). At query time expand the query to include /most/ related concepts and search. 2) use an external knowledgebase such as a taxonomy that indicates relationships between concepts (this is the approach we use). At query time expand the query to include related concepts and search. -sujit On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Basically, when i searched it on Google I got this result : http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/ And I am working on this. So is this useful ? On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: And how would it know to give you those results? Obviously, you have some sort of magic/algorithm in your mind. Are you doing geographic location match, category match, synonyms match? We can't really help with generic questions. You still need to figure out what semantic means for you specifically. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Hello, I am working on an event listing and promotions website( http://allevents.in) and I want to apply semantic search on solr. For example, if someone search : Musical Events in New York So it would give me results such as : * Musical Night at ABC place * Concerts Events * Classical Music Event I mean all results should be Semantic to the Search_Query it should not give the results only based on tf-idf. So can you please make me understand how do i proceed to apply Semantic Search in Solr. ( allevents.in) -- Regards, *Sohan Kalsariya* -- Regards, *Sohan Kalsariya* -- Regards, *Sohan Kalsariya*
Re: How to apply Semantic Search in Solr
Thanks for sharing this link Sohan, its an interesting approach. Since you have effectively defined what you mean by Semantic Search, there are couple other approaches I know of to do something like this: 1) preprocess your documents looking for terms that co-occur in the same document. The more such cooccurrences you find the more strongly these terms are related (can help with ordering related terms from most related to least related). At query time expand the query to include /most/ related concepts and search. 2) use an external knowledgebase such as a taxonomy that indicates relationships between concepts (this is the approach we use). At query time expand the query to include related concepts and search. -sujit On Sat, Mar 8, 2014 at 8:21 AM, Sohan Kalsariya sohankalsar...@gmail.comwrote: Basically, when i searched it on Google I got this result : http://www.opensourceconnections.com/2013/08/25/semantic-search-with-solr-and-python-numpy/ And I am working on this. So is this useful ? On Sat, Mar 8, 2014 at 3:11 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: And how would it know to give you those results? Obviously, you have some sort of magic/algorithm in your mind. Are you doing geographic location match, category match, synonyms match? We can't really help with generic questions. You still need to figure out what semantic means for you specifically. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Sat, Mar 8, 2014 at 4:27 PM, Sohan Kalsariya sohankalsar...@gmail.com wrote: Hello, I am working on an event listing and promotions website( http://allevents.in) and I want to apply semantic search on solr. For example, if someone search : Musical Events in New York So it would give me results such as : * Musical Night at ABC place * Concerts Events * Classical Music Event I mean all results should be Semantic to the Search_Query it should not give the results only based on tf-idf. So can you please make me understand how do i proceed to apply Semantic Search in Solr. ( allevents.in) -- Regards, *Sohan Kalsariya* -- Regards, *Sohan Kalsariya*
Re: Multivalued true Error?
Hi Furkan, In the stock definition of the payload field: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup the analyzer for payloads field type is a WhitespaceTokenizerFactory followed by a DelimitedPayloadTokenFilterFactory. So if you send it a string foo$score1 bar$score2 ... where foo and bar are string tokens and score[12] are payload scores and $ is your delimiter, the analyzer will tokenize it into multiple payloads and you should be able to run the tests in the blog post. So you shouldn't make it multiValued AFAIK. -sujit On Tue, Nov 26, 2013 at 8:44 AM, Furkan KAMACI furkankam...@gmail.comwrote: Hi; I've ported this example from Scala into Java: http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#! However does field should be multivalued true at that example? PS: I use Solr 4.5.1 Thanks; Furkan KAMACI
Re: Why do people want to deploy to Tomcat?
In our case, it is because all our other applications are deployed on Tomcat and ops is familiar with the deployment process. We also had customizations that needed to go in, so we inserted our custom JAR into the solr.war's WEB-INF/lib directory, so to ops the process of deploying Solr was (almost, except for schema.xml or solrconfig.xml changes) identical to any of the other apps. But I think if Solr becomes a server with clearly defined extension points (such as dropping your custom JARs into lib/ and custom configuration in conf/solrconfig.xml or similar like it already is) then it will be treated as something other than a webapp and the expectation that it runs on Tomcat will not apply. Just my $0.02... Sujit On Tue, Nov 12, 2013 at 9:13 AM, Siegfried Goeschl sgoes...@gmx.at wrote: Hi ALex, in my case * ignorance that Tomcat is not fully supported * Tomcat configuration and operations know-how inhouse * could migrate to Jetty but need approved change request to do so Cheers, Siegfried Goeschl On 12.11.13 04:54, Alexandre Rafalovitch wrote: Hello, I keep seeing here and on Stack Overflow people trying to deploy Solr to Tomcat. We don't usually ask why, just help when where we can. But the question happens often enough that I am curious. What is the actual business case. Is that because Tomcat is well known? Is it because other apps are running under Tomcat and it is ops' requirement? Is it because Tomcat gives something - to Solr - that Jetty does not? It might be useful to know. Especially, since Solr team is considering making the server part into a black box component. What use cases will that break? So, if somebody runs Solr under Tomcat (or needed to and gave up), let's use this thread to collect this knowledge. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Solr language-dependent sort
Hi Lisheng, We did something similar in Solr using a custom handler (but I think you could just build a custom QeryParser to do this), but you could do this in your application as well, ie, get the language and then rewrite your query to use the language specific fields. Come to think of it, the QueryParser would probably be sufficiently general to qualify as a patch for custom functionality. -sujit On Apr 8, 2013, at 12:28 PM, Zhang, Lisheng wrote: Hi, I found that in solr we need to define a special fieldType for each language (http://wiki.apache.org/solr/UnicodeCollation), then point a field to this type. But in our application one field (like 'title') can be used by various users for their languages (user1 used for English, user2 used it for Japanese ..), so it is even difficult for us to use dynamical field, we would prefer to pass in a parameter like language = 'en' at run time, then solr API may use this parameter to call lucene API to sort a field. This approach would be much more flexible (we programmed this way when using lucene directly)? Thanks very much for helps, Lisheng
Re: Solr Sorting is not working properly on long Fields
Hi ballusethuraman, I am sure you have done this already, but just to be sure, did you reindex your existing kilometer data after you changed the data type from string to long? If not, then you should. -sujit On Mar 23, 2013, at 11:21 PM, ballusethuraman wrote: Hi, I am having a column named 'Kilometers' and when I try to sort with that it is not working properly.The values in 'Kilometers' column are,Kilometers171119792365611Values in 'Kilometers' after soting are Kilometers979236561117111The Problem here is, when 97 is compared with 923 it is taking 97 as bigger number since 97 is greater than 923. Initially Kilometers column was having string as datatype and I thought the problem could be because of that and i changed the datatype of that column to 'long'. Even then i couldn't see any change in the results.But when I insert values which are having same number of digits say, 1, 2, 3,4,5Kilometers21452 when i try to sort now it is working perfectlyKilometers12345Datatypes that I have tries are, Can anyone helpme to get rid out of this problem... Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Sorting-is-not-working-properly-on-long-Fields-tp4050833.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Matching an exact word
You could also do this outside Solr, in your client. If your query is surrounded by quotes, then strip away the quotes and make q=text_exact_field:your_unquoted_query. Probably better to do outside Solr in general keeping in mind the upgrade path. -sujit On Feb 21, 2013, at 12:20 PM, Van Tassell, Kristian wrote: Thank you. So essentially I need to write a custom query parser (extending upon something like the QParser)? -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: Thursday, February 21, 2013 12:22 PM To: solr-user@lucene.apache.org Subject: Re: Matching an exact word Solr will only match on the terms as they are in the index. If it is stemmed in the index, it will match that. If it isn't, it'll match that. All term matches are (by default at least) exact matches. Only with stemming you are doing an exact match against the stemmed term. Therefore, there really is no way to do what you are looking for within Solr. I'd suggest you'll need to do some parsing at your side and, if you find quotes, do the query against a different field. Upayavira On Thu, Feb 21, 2013, at 06:17 PM, Van Tassell, Kristian wrote: I'm trying to match the word created. Given that it is surrounded by quotes, I would expect an exact match to occur, but instead the entire stemming results show for words such as create, creates, created, etc. q=createdwt=xmlrows=1000qf=textdefType=edismax If I copy the text field to a new one that does not stem words, text_exact for example, I get the expected results: q=createdwt=xmlrows=1000qf=text_exactdefType=edismax I would like the decision whether to match exact or not to be determined by the quotes rather than the qf parameter (eg, not have to use it at all). What topic do I need to look into more to understand this? Thanks in advance!
Re: Can Solr analyze content and find dates and places
Hi Bart, Like I said, I didn't actually hook my UIMA stuff into Solr, content and queries are annotated before they reach Solr. What you describe sounds like a classpath problem (but of course you already knew that :-)). Since I haven't actually done what you are trying to do, here are some suggestions, they may or may not work... 1) package up the XML files into your custom JAR at the top level, that way you don't need to specify it as /RoomNumberAnnotator.xml. 2) if you are using solr4, then you should drop your custom JAR into $SOLR_HOME/collection1/lib, not $SOLR_HOME/lib. -sujit On Feb 11, 2013, at 9:40 AM, jazz wrote: Hi Sujit and others who answered my question, I have been working on the UIMA path which seems great with the available Eclipse tooling and this: http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html Now I worked through the UIMA tutorial of the RoomNumberAnnotator: http://uima.apache.org/doc-uima-annotator.html And I am able to test it using the UIMA CAS Virtuall Debugger. So far so good. But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot find the xml file and the Java class (they are in the correct lib directories, because the WhitespaceTokenizer works fine). updateRequestProcessorChain name=uima processor class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory lst name=uimaConfig lst name=runtimeParameters /lst str name=analysisEngine/RoomNumberAnnotator.xml/str bool name=ignoreErrorsfalse/bool lst name=analyzeFields bool name=mergefalse/bool arr name=fields strcontent/str /arr /lst lst name=fieldMappings lst name=type str name=nameorg.apache.uima.tutorial.RoomNumber/str lst name=mapping str name=featurebuilding/str str name=fieldUIMAname/str /lst /lst /lst /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it fails: Deploy new jars inside one of the lib directories Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima path. Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch can I checkout? This is the Stable release I am running: Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36 Regards, Bart On 8 Feb 2013, at 22:11, SUJIT PAL wrote: Hi Bart, I did some work with UIMA but this was to annotate the data before it goes to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I believe you will have to set up your own aggregate analysis chain in place of the one currently configured. Writing UIMA annotators is very simple (there is a tutorial here: [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]). You provide the XML description for the annotation and let UIMA generate the annotation bean. You write Java code for the annotator and also the annotator XML descriptor. UIMA uses the annotator XML descriptor to instantiate and run your annotator. Overall, sounds really complicated but its actually quite simple. The tutorial has quite a few examples that you will find useful, but in case you need more, I have some on this github repository: [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima] The dictionary and pattern annotators may be similar to what you are looking for (date and city annotators). Best regards, Sujit On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote: Hi Alex, Indeed that is exactly what I am trying to achieve using wordcities. Date will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how do I integrate the Java library as UIMA? The documentation about changing schema.xml and solr.xml is not very detailed. Regards, Bart On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote: Hi Bart, I haven't done any UIMA work (I used other stuff for my NLP phase), so not sure I can help much further. But in general, you are venturing into pure research territory here. Even for dates, what do you actually mean? Just fixed expression? Relative dates (e.g. last tuesday?). What about times (7pm?). Same with cities. If you want it offline, you need the gazetteer and disambiguation modules. Gazetteer for cities (worldwide) is huge and has a lot of duplicate names (Paris, Ontario is apparently a short drive from London, Ontario eh?). Something like http://www.maxmind.com/en/worldcities? And disambiguation
Re: Can Solr analyze content and find dates and places
Cool! Thanks for the update, this will help if I ever go all the way with UIMA and Solr. -sujit On Feb 11, 2013, at 12:13 PM, jazz wrote: Hi Sujit, Thanks for your help! I moved the RoomNumberAnnotator.xml to the top level of the jar and used the same solrconfig.xml (with the /). Now it works perfect. Best regards, Bart On 11 Feb 2013, at 20:13, SUJIT PAL wrote: Hi Bart, Like I said, I didn't actually hook my UIMA stuff into Solr, content and queries are annotated before they reach Solr. What you describe sounds like a classpath problem (but of course you already knew that :-)). Since I haven't actually done what you are trying to do, here are some suggestions, they may or may not work... 1) package up the XML files into your custom JAR at the top level, that way you don't need to specify it as /RoomNumberAnnotator.xml. 2) if you are using solr4, then you should drop your custom JAR into $SOLR_HOME/collection1/lib, not $SOLR_HOME/lib. -sujit On Feb 11, 2013, at 9:40 AM, jazz wrote: Hi Sujit and others who answered my question, I have been working on the UIMA path which seems great with the available Eclipse tooling and this: http://sujitpal.blogspot.nl/2011/03/smart-query-parsing-with-uima.html Now I worked through the UIMA tutorial of the RoomNumberAnnotator: http://uima.apache.org/doc-uima-annotator.html And I am able to test it using the UIMA CAS Virtuall Debugger. So far so good. But, now I want to use the new RoomNumberAnnotator with Solr, but it cannot find the xml file and the Java class (they are in the correct lib directories, because the WhitespaceTokenizer works fine). updateRequestProcessorChain name=uima processor class=org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory lst name=uimaConfig lst name=runtimeParameters /lst str name=analysisEngine/RoomNumberAnnotator.xml/str bool name=ignoreErrorsfalse/bool lst name=analyzeFields bool name=mergefalse/bool arr name=fields strcontent/str /arr /lst lst name=fieldMappings lst name=type str name=nameorg.apache.uima.tutorial.RoomNumber/str lst name=mapping str name=featurebuilding/str str name=fieldUIMAname/str /lst /lst /lst /lst /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / On the Wiki (http://wiki.apache.org/solr/SolrUIMA) this is mentioned but it fails: Deploy new jars inside one of the lib directories Run 'ant clean dist' (or 'mvn clean package') from the solr/contrib/uima path. Is it needed to deploy the new jar (RoomAnnotator.jar)? If yes, which branch can I checkout? This is the Stable release I am running: Solr 4.1.0 1434440 - sarowe - 2013-01-16 17:21:36 Regards, Bart On 8 Feb 2013, at 22:11, SUJIT PAL wrote: Hi Bart, I did some work with UIMA but this was to annotate the data before it goes to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I believe you will have to set up your own aggregate analysis chain in place of the one currently configured. Writing UIMA annotators is very simple (there is a tutorial here: [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]). You provide the XML description for the annotation and let UIMA generate the annotation bean. You write Java code for the annotator and also the annotator XML descriptor. UIMA uses the annotator XML descriptor to instantiate and run your annotator. Overall, sounds really complicated but its actually quite simple. The tutorial has quite a few examples that you will find useful, but in case you need more, I have some on this github repository: [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima] The dictionary and pattern annotators may be similar to what you are looking for (date and city annotators). Best regards, Sujit On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote: Hi Alex, Indeed that is exactly what I am trying to achieve using wordcities. Date will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how do I integrate the Java library as UIMA? The documentation about changing schema.xml and solr.xml is not very detailed. Regards, Bart On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote: Hi Bart, I haven't done any UIMA work (I used other stuff for my NLP phase), so not sure I can help much further. But in general, you are venturing into pure research territory here. Even for dates, what do you actually mean? Just fixed expression? Relative dates (e.g
Re: Crawl Anywhere -
Hi Siva, You will probably get a better reply if you head over to the nutch mailing list [http://nutch.apache.org/mailing_lists.html] and ask there. Nutch 2.1 may be what you are looking for (stores pages in NoSQL database). Regards, Sujit On Feb 10, 2013, at 9:16 PM, SivaKarthik wrote: Dear Erick, Thanks for ur relpy.. ya..nutch can meet my requirement... but the problem is, i want to store the crawled document in html or xml format instead of mapreduce format.. not sure nutch plugins available to convert into xml files. please share me if you any idea . ThankYou -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4039619.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Solr analyze content and find dates and places
Hi Bart, I did some work with UIMA but this was to annotate the data before it goes to Lucene/Solr, ie not built as a UpdateRequestProcessor. I just looked through the SolrUima wiki page [http://wiki.apache.org/solr/SolrUIMA] and I believe you will have to set up your own aggregate analysis chain in place of the one currently configured. Writing UIMA annotators is very simple (there is a tutorial here: [http://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html]). You provide the XML description for the annotation and let UIMA generate the annotation bean. You write Java code for the annotator and also the annotator XML descriptor. UIMA uses the annotator XML descriptor to instantiate and run your annotator. Overall, sounds really complicated but its actually quite simple. The tutorial has quite a few examples that you will find useful, but in case you need more, I have some on this github repository: [https://github.com/sujitpal/tgni/tree/master/src/main/java/com/mycompany/tgni/analysis/uima] The dictionary and pattern annotators may be similar to what you are looking for (date and city annotators). Best regards, Sujit On Feb 8, 2013, at 8:50 AM, Bart Rijpers wrote: Hi Alex, Indeed that is exactly what I am trying to achieve using wordcities. Date will be simple: 16-Jan becomes 16-Jan-2013 in a new dynamic field. But how do I integrate the Java library as UIMA? The documentation about changing schema.xml and solr.xml is not very detailed. Regards, Bart On 8 Feb 2013, at 16:57, Alexandre Rafalovitch arafa...@gmail.com wrote: Hi Bart, I haven't done any UIMA work (I used other stuff for my NLP phase), so not sure I can help much further. But in general, you are venturing into pure research territory here. Even for dates, what do you actually mean? Just fixed expression? Relative dates (e.g. last tuesday?). What about times (7pm?). Same with cities. If you want it offline, you need the gazetteer and disambiguation modules. Gazetteer for cities (worldwide) is huge and has a lot of duplicate names (Paris, Ontario is apparently a short drive from London, Ontario eh?). Something like http://www.maxmind.com/en/worldcities? And disambiguation usually requires training corpus that is similar to what your text will look like. Online services like OpenCalais are backed by gigantic databases and some serious corpus-training Machine Language disambiguation algorithms. So, no plug-and-play solution here. If you really need to get this done, I would recommend narrowing down the specification of exactly what you will settle for and looking for software that can do it. Once you have that, integration with Solr is your next - and smaller - concern. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Feb 8, 2013 at 10:41 AM, jazz jazzsa...@me.com wrote: Thanks Alex, I checked the documentation but it seems there is only a webservice (OpenCalais) available to extract dates and places. http://uima.apache.org/sandbox.html Do you know is there is a Solr Compatible UIMA add-on which detects dates and places (cities) without a webservice? If not, how do you write one? Regards, Bart On 8 Feb 2013, at 15:29, Alexandre Rafalovitch wrote: Yes, it is possible. You are looking at UIMA or OpenNLP integration, most probably in Update Request Processor pipeline. Have a look here as a start: https://wiki.apache.org/solr/SolrUIMA You will have to put some serious work into this, it is not all tied together and packaged. Mostly because the Natural Language Processing (the field you are getting into) is kind of messy all of its own. Good luck, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Feb 8, 2013 at 9:24 AM, jazz jazzsa...@me.com wrote: Hi, I want to know if Solr can analyze text and recoginze dates and places. If yes, is it then possible to create new dynamic fields with these dates and places (e.g. city). Thanks, Bart
Re: Per user document exclusions
Hi Christian, Since customization is not a problem in your case, how about writing out the userId and excluded document ids to the database when it is excluded, and then for each query from the user (possibly identified by a userid parameter), lookup the database by userid, construct a NOT filter out of the excluded docIds, then send to Solr as the fq? We are using a variant of this approach to allow database style wildcard search on document titles. -sujit On Nov 18, 2012, at 9:05 PM, Christian Jensen wrote: Hi, We have a need to allow each user to 'exclude' individual documents in the results. We can easily do this now within the RDBMS using a FTS index and a query with 'OUTER LEFT JOIN WHERE NULL' type of thing. Can Solr do this somehow? Heavy customization is not a problem - I would bet this has already been done. I would like to avoid multiple trips back and forth from either the DB or SOLR if possible. Thanks! Christian -- *Christian Jensen* 724 Ioco Rd Port Moody, BC V3H 2W8 +1 (778) 996-4283 christ...@jensenbox.com
Re: Query foreign language synonyms / words of equivalent meaning?
Hi, We are using google translate to do something like what you (onlinespending) want to do, so maybe it will help. During indexing, we store the searchable fields from documents into a fields named _en, _fr, _es, etc. So assuming we capture title and body from each document, the fields are (title_en, body_en), (title_fr, body_fr), etc, with their own analyzer chains. These documents come from a controlled source (ie not the web), so we know the language they are authored in. During searching, a custom component intercepts the client language and the query. The query is sent to google translate for language detection. The largest amount of docs in the corpus is english, so if the detected language is either english or the client language, then we call google translate again to find the translated query in the other (english or client) language. Another custom component constructs an OR query between the two languages one component of which is aimed at the _en field set and the other aimed at the _xx (client language) field set. -sujit On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote: As far as I know, there is no built-in functionality for language translation. I would propose to write one, but there are many many pitfalls. If you want to translate from one language to another you might have to know the starting language. Otherwise you get problems with translation. Not (german) - distress (english), affliction (english) - you might have words in one language which are stopwords in other language not - you don't have a one to one mapping, it's more like 1 to n+x toilette (french) - bathroom, rest room / restroom, powder room This are just two points which jump into my mind but there are tons of pitfalls. We use the solution of a multilingual thesaurus as synonym dictionary. http://en.wikipedia.org/wiki/Eurovoc It holds translations of 22 official languages of the European Union. So a search for europäischer währungsfonds gives also results with european monetary fund, fonds monétaire européen, ... Regards Bernd Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com: Hi, English is going to be the predominant language used in my documents, but there may be a spattering of words in other languages (such as Spanish or French). What I'd like is to initiate a query for something like bathroom for example and for Solr to return documents that not only contain bathroom but also baño (Spanish). And the same goes when searching for baño. I'd like Solr to return documents that contain either bathroom or baño. One possibility is to pre-translate all indexed documents to a common language, in this case English. And if someone were to search using a foreign word, I'd need to translate that to English before issuing a query to Solr. This appears to be problematic, since I'd have to know whether the indexed words and the query are even in a foreign language, which is not trivial. Another possibility is to pre-build a list of foreign word synonyms. So baño would be listed as a synonym for bathroom. But I'd need to include other languages (such as toilette in French) and other words. This requires that I know in advance all possible words I'd need to include foreign language versions of (not to mention needing to know which languages to include). This isn't trivial either. I'm assuming there's no built-in functionality that supports the foreign language translation on the fly, so what do people propose? Thanks! -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie Universitätsstr. 25 und Wissensmanagement 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: How to make SOLR manipulate the results?
Hi Srilatha, One way to do this would be by making two calls, one to your sponsored list where you pick two at random and a solr call where you pick all the search results and then stick them together in your client. Sujit On Oct 4, 2012, at 12:39 AM, srilatha wrote: For an E-commerce website, we have stored the products as SOLR documents with the following fields and weights: Title:5 Description:4 For some products, we need to ensure that they appear in the top ten results even if their relevance in the above two fields does not qualify them for being in top 10. For example: P1, P2, P10 are the legitimate products for a given search keyword iPhone. I have S1 ... S100 as sponsored products that want to appear in the top 10. My policy is that only 2 of these 100 sponsored products will be randomly chosen and shown in the top 10 so that the results will be: S5, S31, P1, P2, ... P8. In the next request, the sponsored products that gets slipped in may be S4, S99. The QueryElevationComponent lets us specify the docIDs for keywords but does not let us randomize the results such that only 2 of the complete set of sponsored docIDs is sent in the results. Any suggestions for implementing this would be appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-make-SOLR-manipulate-the-results-tp4011739.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Synonym file for American-British words
Hi Alex, I implemented something similar using the rules described in this page: http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences The idea is to normalize the British spelling form to the American form during indexing and query using a tokenizer that takes in a word and if matched to one of the rules, returns the converted form. My rules were modeled as a chain of transformations. Each transformation had a set of (pattern, action) pairs. The transformations were: a. word_replacement (such as artefact = artifact) - in this case the source word would directly be normalized into the specified target word. b) prefix rules (eg anae = ane for anemic) - in this case the prefix characters of the word, if matched, would be transformed into the target. c) suffix rules (eg tre = ter for center) - similar to prefix rules except it works on suffix. d) infix rules (eg moeb = meb for ameba) - replaces characters in the middle of the word. I cannot share the actual rules, but they should be relatively simple to figure out from the wiki page, if you want to go that route. HTM Sujit On Aug 7, 2012, at 7:08 AM, Alexander Cougarman wrote: Dear friends, Is there a downloadable synonym file for American-British words? This page has some, for example the VarCon file, but it's not in the Solr synonym.txt file. We need something that can normalize words like center to centre. The VarCon file has it, but it's in the wrong format. Thank you in advance :) Sincerely, Alex
Re: First query to find meta data, second to search. How to group into one?
Hi Samarendra, This does look like a candidate for a custom query component if you want to do this inside Solr. You can of course continue to do this at the client. -sujit On May 15, 2012, at 12:26 PM, Samarendra Pratap wrote: Hi, I need a suggestion for improving relevance of search results. Any help/pointers are appreciated. We have following fields (plus a lot more) in our schema title description category_id (multivalued) We are using mm=70% in solrconfig.xml We are using qf=title description We are not doing phrase query in q In case of a multi-word search text, mostly the end results are the junk ones. Because the words, mentioned in search text, are written in different fields and in different contexts. For example searching for water proof (without double quotes) brings a record where title = rose water and description = ... no proof of contamination ... Our priority is to remove irrelevant results, as much as possible. Increasing mm will not solve this completely because user input may not be always correct to be benefited by high mm. To remove irrelevant records we worked on following solution (or work-around) - We are firing first query to get top n results. We assume that first n results are mostly good results. n is dynamic within a predefined minimum and maximum value. - We are calculating frequency of category ids in these top results. We are not using facets because that gives count for all, relevant or irrelevant, results. - Based on category frequencies within top matching results we are trying to find a few most frequent categories by simple calculation. Now we are very confident that these categories are the ones which best suit to our query. - Finally we are firing a second query with top categories, calculated above, in filter query (fq). The quality of results really increased very much so I thought to try it the standard way. Does it require writing a plugin if I want to move above logic into Solr? Which component do I need to modify - QueryComponent? Or is there any better or even equivalent method in Solr of doing this or similar thing? Thanks -- Regards, Samar
Re: Faceting on a date field multiple times
Hi Ian, I believe you may be able to use a bunch of facet.query parameters, something like this: facet.query=yourfield:[NOW-1DAY TO NOW] facet.query=yourfield:[NOW-2DAY to NOW-1DAY] ... and so on. -sujit On May 3, 2012, at 10:41 PM, Ian Holsman wrote: Hi. I would like to be able to do a facet on a date field, but with different ranges (in a single query). for example. I would like to show #documents by day for the last week - #documents by week for the last couple of months #documents by year for the last several years. is there a way to do this without hitting solr 3 times? thanks Ian
Re: Any way to get reference to original request object from within Solr component?
Hi Hoss, Thanks for the pointers, and sorry, it was a bug in my code (was some dead code which was alphabetizing the facet link text and also the parameters themselves indirectly by reference). I actually ended up building a servlet and a component to print out the multi-valued parameters using HttpServletRequest.getParameterValues(myparam) and ResponseBuilder.req.getParams().getParams(myparam) respectively to isolate the problem. Both of them returned the parameters in the correct order. So I went trolling through the code with a debugger, to observe exactly at what point the order got messed up, and found the bug. FWIW, I am using Tomcat 5.5. Thanks to everybody for their help, and sorry for the noise, guess I should have done the debugger thing before I threw up my hands :-). -sujit On Mar 19, 2012, at 6:55 PM, Chris Hostetter wrote: : I have a custom component which depends on the ordering of a : multi-valued parameter. Unfortunately it looks like the values do not : come back in the same order as they were put in the URL. Here is some : code to explain the behavior: ... : and I notice that the values are ordered differently than [foo, bar, : baz] that I would have expected. I am guessing its because the : SolrParams is a MultiMap structure, so order is destroyed on its way in. a) MultiMapSolrParams does not destroy order on the way in b) when dealing with HTTP requests, the request params actaully use an instance of ServletSolrParams which is backed directly by the ServletRequest.getParameterMap() -- you should get the values returned in the exact order as ServletRequest.getParameterMap().get(myparam) : 1) is there a setting in Solr can use to enforce ordering of : multi-valued parameters? I suppose I could use a single parameter with : comma-separated values, but its a bit late to do that now... Should already be enforced in MultiMapSolrParams and ServletSolrParams : 2) is it possible to use a specific SolrParams object that preserves order? If so how? see above. : 3) is it possible to get a reference to the HTTP request object from within a component? If so how? not out of the box, because there is no garuntee that solr is even running in a servlet container. you can subclass SolrDispatchFilter to do this if you wish (note the comment in the execute() method). My questions to you... 1) what servlet container are you using? 2) have you tested your servlet container with a simple servlet (ie: eliminate solr from the equation) to verify that the ServletRequest.getParameterMap() contains your request values in order? if you debug this and find evidence that something in solr is re-ordering the values in a MultiMapSolrParams or ServletSolrParams *PLEASE* open a jira with a reproducable example .. that would definitley be an anoying bug we should get to the bottom of. -Hoss
Re: Any way to get reference to original request object from within Solr component?
Thanks Russel, thats a good idea, I think this would work too... I will try this and update the thread with details once. -sujit On Mar 18, 2012, at 7:11 AM, Russell Black wrote: One way to do this is to register a servlet filter that places the current request in a global static ThreadLocal variable, thereby making it available to your Solr component. It's kind of a hack but would work. Sent from my phone On Mar 17, 2012, at 6:53 PM, SUJIT PAL sujit@comcast.net wrote: Thanks Pravesh, Yes, converting the myparam to a single (comma-separated) field is probably the best approach, but as I mentioned, this is probably a bit too late for this to be practical in my case... The myparam parameters are facet filter queries, and so far order did not matter, since the filters were just AND-ed together and applied to the result set and facets were being returned in count order. But now the requirement is to bubble up the selected facets so the one is most currently selected is on the top. This was uncovered during user-acceptance testing (since the client shows only the top N facets, and the currently selected facet to disappear since its no longer within the top N facets). Asking the client to switch to a single comma-separated field is an option, but its the last option at this point, so I was wondering if it was possible to switch to some other data structure, or at least get a handle to the original HTTP servlet request from within the component so I could grab the parameters from there. I noticed that the /select call does preserve the order of the parameters, but that is because its probably being executed by SolrServlet, which gets its parameters from the HttpServletRequest. I guess I will have to just run the request through a debugger and see where exactly the parameter order gets messed up...I'll update this thread if I find out. Meanwhile, if any of you have simpler alternatives, would really appreciate knowing... Thanks, -sujit On Mar 17, 2012, at 12:01 AM, pravesh wrote: Hi Sujit, The Http parameters ordering is above the SOLR level. Don't think this could be controlled at SOLR level. You can append all required values in a single Http param at then break at your component level. Regds Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Any-way-to-get-reference-to-original-request-object-from-within-Solr-component-tp3833703p3834082.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Any way to get reference to original request object from within Solr component?
Thanks Pravesh, Yes, converting the myparam to a single (comma-separated) field is probably the best approach, but as I mentioned, this is probably a bit too late for this to be practical in my case... The myparam parameters are facet filter queries, and so far order did not matter, since the filters were just AND-ed together and applied to the result set and facets were being returned in count order. But now the requirement is to bubble up the selected facets so the one is most currently selected is on the top. This was uncovered during user-acceptance testing (since the client shows only the top N facets, and the currently selected facet to disappear since its no longer within the top N facets). Asking the client to switch to a single comma-separated field is an option, but its the last option at this point, so I was wondering if it was possible to switch to some other data structure, or at least get a handle to the original HTTP servlet request from within the component so I could grab the parameters from there. I noticed that the /select call does preserve the order of the parameters, but that is because its probably being executed by SolrServlet, which gets its parameters from the HttpServletRequest. I guess I will have to just run the request through a debugger and see where exactly the parameter order gets messed up...I'll update this thread if I find out. Meanwhile, if any of you have simpler alternatives, would really appreciate knowing... Thanks, -sujit On Mar 17, 2012, at 12:01 AM, pravesh wrote: Hi Sujit, The Http parameters ordering is above the SOLR level. Don't think this could be controlled at SOLR level. You can append all required values in a single Http param at then break at your component level. Regds Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Any-way-to-get-reference-to-original-request-object-from-within-Solr-component-tp3833703p3834082.html Sent from the Solr - User mailing list archive at Nabble.com.
Any way to get reference to original request object from within Solr component?
Hello, I have a custom component which depends on the ordering of a multi-valued parameter. Unfortunately it looks like the values do not come back in the same order as they were put in the URL. Here is some code to explain the behavior: URL: /solr/my_custom_handler?q=somethingmyparam=foomyparam=barmyparam=baz Inside my component's process(ResponseBuilder) method, I do the following: public void process(ResponseBuilder rb) throws IOException { String[] myparams = rb.req.getParams().getParams(myparam); System.out.println(myparams= + ArrayUtils.toString(myparams); ... } and I notice that the values are ordered differently than [foo, bar, baz] that I would have expected. I am guessing its because the SolrParams is a MultiMap structure, so order is destroyed on its way in. My question is: 1) is there a setting in Solr can use to enforce ordering of multi-valued parameters? I suppose I could use a single parameter with comma-separated values, but its a bit late to do that now... 2) is it possible to use a specific SolrParams object that preserves order? If so how? 3) is it possible to get a reference to the HTTP request object from within a component? If so how? I am on Solr version 3.2.0. Thanks in advance for any help you can provide, Sujit
Re: How to check if a field is a multivalue field with java
Hi Thomas, With Java (from within a custom handler in Solr) you can get a handle to the IndexSchema from the request, like so: IndexSchema schema = req.getSchema(); SchemaField sf = schema.getField(fielaname); boolean isMultiValued = sf.multiValued(); From within SolrJ code, you can use SolrDocument.getFieldValue() which returns an Object, so you could do an instanceof check - if its a Collection its multivalued, else not. Object value = sdoc.getFieldValue(fieldname); boolean isMultiValued = value instanceof Collection; At least this is what I do, I don't think there is a way to get a handle to the IndexSchema object over solrj... -sujit On Feb 22, 2012, at 9:41 AM, tschiela wrote: Hello, is there any way to check, if a field of a SolrDocument ist a multivalue field with java (solrj)? Greets Thomas -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-check-if-a-field-is-a-multivalue-field-with-java-tp3767200p3767200.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to make search with special characters in keywords
Hi Tejinder, I had this problem yesterday (believe it or not :-)), and the fix for us was to make Tomcat UTF-8 compliant. In server.xml, there is a Controller tag, we added the attribute URIEncoding=UTF-8 and restarted Tomcat. Not sure what container you are using, if its Tomcat this will solve it, else you could probably find a similar setting for your container. Here is a link that provides more specific info: http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html -sujit On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote: Hi all, In my implementation many fields in documents are having words with special characters like Company® ,Time™. Index is created using these fields. However if I make search using these keywords in solr console, it does not work. i.e. entering Company® or Time™ in search field box does not return any document. Where as entering Company or Time returns documents. Requirement is to be able to make search with special characters in keywords. Any pointers about how to index and search in case of special characters will be greatly appreciated. Thank you. Thanks, Tejinder
Re: How to make search with special characters in keywords
Well, sometimes people just copy-paste stuff into the search box probably because some words (at least in my world) are very hard to spell correctly. We noticed the problem because the query was getting mangled on its way in and not returning any search results even though it should have. Our analysis chain (both query and index) uses ASCIIFoldingFilter to downcast these special characters to equivalent ASCII, so a string such as Ångström for example will actually result in a search for angstrom. The indexing also does the same conversion. The mangling looked very similar to what happens when UTF-8 is passed through ISO-8859-1 encoding (and vice versa) which led us to the solution. -sujit On Feb 1, 2012, at 5:04 PM, Erick Erickson wrote: Sujit's comments are well taken, part of your problem will certainly be getting the special characters through your container... But another part of your problem will be having the characters in your index in the first place. The fact that you can find Time in the first place suggests that your index does NOT have the special characters, you need to look to your analysis chain to see what transformations occur, see the admin/analysis page... But I question why you need to search on special characters. Do you really expect the user to be happy with being required to enter Company®? A common approach is to remove such special characters during both index and query analyzing so a Company® and Company are equivalent. But your problem space may differ. Best Erick On Wed, Feb 1, 2012 at 6:55 PM, SUJIT PAL sujit@comcast.net wrote: Hi Tejinder, I had this problem yesterday (believe it or not :-)), and the fix for us was to make Tomcat UTF-8 compliant. In server.xml, there is a Controller tag, we added the attribute URIEncoding=UTF-8 and restarted Tomcat. Not sure what container you are using, if its Tomcat this will solve it, else you could probably find a similar setting for your container. Here is a link that provides more specific info: http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html -sujit On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote: Hi all, In my implementation many fields in documents are having words with special characters like Company® ,Time™. Index is created using these fields. However if I make search using these keywords in solr console, it does not work. i.e. entering Company® or Time™ in search field box does not return any document. Where as entering Company or Time returns documents. Requirement is to be able to make search with special characters in keywords. Any pointers about how to index and search in case of special characters will be greatly appreciated. Thank you. Thanks, Tejinder
Re: Solr, SQL Server's LIKE
Hi Devon, Have you considered using a permuterm index? Its workable, but depending on your requirements (size of fields that you want to create the index on), it may bloat your index. I've written about it here: http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html Another alternative which I've implemented is a custom mechanism that retrieves a list of matching unique ids from a database table using a SQL LIKE, then passes this list as a filter to the main query. Its hacky, but I was building a custom handler anyway, so it was quite simple to add in. -sujit On Thu, 2011-12-29 at 11:38 -0600, Devon Baumgarten wrote: I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google. I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest? I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word. Thanks, Devon Baumgarten
Re: Dynamic rating based on Like feature
Hi Eugene, I proposed a solution for something similar, maybe it will help you. http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html -sujit On Sat, 2011-11-05 at 16:43 -0400, Eugene Strokin wrote: Hello, I have a task which seems trivial, but I couldn't find any related information from Solr documentation. So I'm asking the community for an advice. I have relatively big amount (about 25 Millions) of documents which are describing products. Those products could be rated by humans and/or machines. The rating is nothing more but just Like kind of points. So if someone or something likes a product it adds +1 to the total points of the product. I was thinking I could just have an integer field in the document, and increment it each time when Like event is fired, and just sort this field. But, because Like event could come from external systems, I could get literally thousands of such events in first few hours. And I'm not sure that updating the document that often would be good. This is the first question - May be there is another way to do such dynamic rating? So more Liked products will be first in a search result. The second problem, that the client is asking to have time based search results. For example those Likes should not boost the document if they are a week old, a month old, etc. Ideally, they want to set the expiration time dynamically, but if this is a problem, it is acceptable to have some predefined time of expiration of those Likes, but still we are going to need at least a week and a month thresholds. Second question, if this is possible at all to do using Solr, if so, how? If not, what could you suggest? Thanks in advance, any advice, information, anything are greatly appreciated. Eugene S.
Re: Find Documents with field = maxValue
Hi Alireza, Would this work? Sort the results by age desc, then loop through the results as long as age == age[0]. -sujit On Tue, 2011-10-18 at 15:23 -0700, Otis Gospodnetic wrote: Hi, Are you just looking for: age:target age This will return all documents/records where age field is equal to target age. But maybe you want age:[0 TO target age here] This will include people aged from 0 to target age. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Alireza Salimi alireza.sal...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, October 18, 2011 10:15 AM Subject: Re: Find Documents with field = maxValue Hi Ahmet, Thanks for your reply, but I want ALL documents with age = max_age. On Tue, Oct 18, 2011 at 9:59 AM, Ahmet Arslan iori...@yahoo.com wrote: --- On Tue, 10/18/11, Alireza Salimi alireza.sal...@gmail.com wrote: From: Alireza Salimi alireza.sal...@gmail.com Subject: Find Documents with field = maxValue To: solr-user@lucene.apache.org Date: Tuesday, October 18, 2011, 4:10 PM Hi, It might be a naive question. Assume we have a list of Document, each Document contains the information of a person, there is a numeric field named 'age', how can we find those Documents whose *age* field is *max(age) *in one query. May be http://wiki.apache.org/solr/StatsComponent? Or sort by age? q=*:*start=0rows=1sort=age desc -- Alireza Salimi Java EE Developer
Re: SolrJ + Post
If you use the CommonsHttpSolrServer from your client (not sure about the other types, this is the one I use), you can pass the method as an argument to its query() method, something like this: QueryResponse rsp = server.query(params, METHOD.POST); HTH Sujit On Fri, 2011-10-14 at 13:29 +, Rohit wrote: I want to user POST instead of GET while using solrj, but I am unable to find a clear example for it. If anyone has implemented the same it would be nice to get some insight. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: SolrJ + Post
Not the OP, but I put it in on /one/ of my solr custom handlers that acts as a proxy to itself (ie the server its part of). It basically rewrites the incoming query (usually short 50-250 chars at most) to a set of very long queries and passes them in parallel to the server, gathers up the results and returns a combo response. The logging is not an issue for me since the handler logs the expanded query before sending it off, but the caching is. Thank you for pointing it out. I was doing it because I was running afoul of the limit on the URL size (and the max boolean clauses as well, but I reset the max for that). But I just realized that we can probably reset that limit as well as this page shows: http://serverfault.com/questions/56691/whats-the-maximum-url-length-in-tomcat So perhaps if the URL length is the reason for the OP's question, increasing it may be a better option than using POST? -sujit On Fri, 2011-10-14 at 09:30 -0700, Walter Underwood wrote: Why do you want to use POST? It is the wrong HTTP request type for search results. GET is for retrieving information from the server, POST is for changing information on the server. POST responses cannot be cached (see HTTP spec). POST requests do not include the arguments in the log, which makes your HTTP logs nearly useless for diagnosing problems. wunder Walter Underwood On Oct 14, 2011, at 9:20 AM, Sujit Pal wrote: If you use the CommonsHttpSolrServer from your client (not sure about the other types, this is the one I use), you can pass the method as an argument to its query() method, something like this: QueryResponse rsp = server.query(params, METHOD.POST); HTH Sujit On Fri, 2011-10-14 at 13:29 +, Rohit wrote: I want to user POST instead of GET while using solrj, but I am unable to find a clear example for it. If anyone has implemented the same it would be nice to get some insight. Regards, Rohit Mobile: +91-9901768202 About Me: http://about.me/rohitg http://about.me/rohitg
Re: Sort five random Top Offers to the top
Hi Mouli, I was looking at the code here, not sure why you even need to do the sort... After you get the DocList, couldn't you do something like this? ListInteger topofferDocIds = new ArrayListInteger(); for (DocIterator it = ergebnis.iterator(); it.hasNext();) { topofferDocIds.add(it.next()); } Collections.shuffle(topofferDocIds); rb.req.getContext().set(TOPOFFERS, topofferDocIds); so in first-component, you have identified the top 5 offers for the query and client, and stuffed them into the context. Then you define a last component which will take the topofferDocIds and place them at the top of the search results, and remove them if they exist from the main result. Would that not work? Alternatively (kind of a hybrid way) would be to define your own (single) component that takes the query, sends back two queries to the underlying solr, one with the topoffers and one without and merges the results before sending back. This would replace the component that does the search. -sujit On Wed, 2011-09-28 at 07:15 -0700, MOuli wrote: Hey Community. I write my first component and now i got a problem hear is my code: @Override public void prepare(ResponseBuilder rb) throws IOException { try { rb.req.getParams().getBool(topoffers.show, true); String client = rb.req.getParams().get(client, 1); BooleanQuery[] queries = new BooleanQuery[2]; queries[0] = (BooleanQuery) DisMaxQParser.getParser( rb.req.getParams().get(q), DisMaxQParserPlugin.NAME, rb.req) .getQuery(); queries[1] = new BooleanQuery(); Occur occur = BooleanClause.Occur.MUST; queries[1].add(QueryParsing.parseQuery(ups_topoffer_ + client + :true, rb.req.getSearcher().getSchema()), occur); Query q = Query.mergeBooleanQueries(queries[0], queries[1]); DocList ergebnis = rb.req.getSearcher().getDocList(q, null, null, 0, 5, 0); String[] machineIds = new String[5]; int position = 0; DocIterator iter = ergebnis.iterator(); while (iter.hasNext()) { int docID = iter.nextDoc(); Document doc = rb.req.getSearcher().getReader().document(docID); for (String value : doc.getValues(machine_id)) { machineIds[position++] = value; } } Sort sort = rb.getSortSpec().getSort(); if (sort == null) { rb.getSortSpec().setSort(new Sort()); sort = rb.getSortSpec().getSort(); } SortField[] newSortings = new SortField[sort.getSort().length + 5]; int count = 0; for (String machineId : machineIds) { SortField sortMachineId = new SortField(map(machine_id, + machineId + , + machineId + ,1,0) desc, SortField.DOUBLE); newSortings[count++] = sortMachineId; } SortField[] sortings = sort.getSort(); for (SortField sorting : sortings) { newSortings[count++] = sorting; } sort.setSort(newSortings); rb.getSortSpec().setSort(sort); } catch (ParseException e) { LoggerFactory.getLogger(Topoffers.class).error( Fehler bei den Topoffers!, this); LoggerFactory.getLogger(Topoffers.class).error(e.toString(), this); } } Why can't i manipulate the sort? Is there something i miss understand? This search component is added as a first-component in the solrconfig.xml. Please can anyone help me?? -- View this message in context: http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3376166.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sort five random Top Offers to the top
Not the OP, but this is /much/ simpler, although at the expense of making 2 calls to solr. But the upside is that no customization is required. On Thu, 2011-09-22 at 09:43 +0100, Doug McKenzie wrote: Could you not just do your normal search with and add a filter query on? fq=topoffer:true That would then return only results with top offer : true and then use whatever shuffling / randomising you like in your application. Alternately you could even add sorting on relevance to show the top 5 closest matches to the query rows=5sort=score desc On 21/09/2011 21:26, Sujit Pal wrote: Hi MOuli, AFAIK (and I don't know that much about Solr), this feature does not exist out of the box in Solr. One way to achieve this could be to construct a DocSet with topoffer:true and intersect it with your result DocSet, then select the first 5 off the intersection, randomly shuffle them, sublist [0:5], and move the sublist to the top of the results like QueryElevationComponent does. Actually you may want to take a look at QueryElevationComponent code for inspiration (this is where I would have looked if I had to implement something similar). -sujit On Wed, 2011-09-21 at 06:54 -0700, MOuli wrote: Hey Community. I got a Lucene/Solr Index with many offers. Some of them are marked by a flag field topoffer that they are top offers. Now I want so sort randomly 5 of this offers on the top. For Example HTC Sensation - topoffer = true HTC Desire - topoffer = false Samsung Galaxy S2 - topoffer = ture IPhone 4 - topoffer = true ... When i search for a Handy then i want that first 3 offers are HTC Sensation, Samsung Galaxy S2 and the iPhone 4. Does anyone have an idea? PS.: I hope my english is not to bad -- View this message in context: http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3355469.html Sent from the Solr - User mailing list archive at Nabble.com. -- Become a Firebox Fan on Facebook: http://facebook.com/firebox And Follow us on Twitter: http://twitter.com/firebox Firebox has been nominated for Retailer of the Year in the 2011 Stuff Awards. Who will win? It's up to you! Visit http://www.stuff.tv/awards and place your vote. We'll do a special dance if it's us. Firebox HQ is MOVING HOUSE! We're migrating from Streatham Hill to shiny new digs in Shoreditch. As of 3rd October please update your records to: Firebox.com, 6.10 The Tea Building, 56 Shoreditch High Street, London, E1 6JJ Global Head Office: Firebox House, Ardwell Road, London SW2 4RT Firebox.com Ltd is registered in England and Wales, company number 3874477 Registered Company Address: 41 Welbeck Street London W1G 8EA Firebox.com Any views expressed in this email are those of the individual sender, except where the sender expressly, and with authority, states them to be the views of Firebox.com Ltd.
Re: Sort five random Top Offers to the top
I have a few blog posts on this... http://sujitpal.blogspot.com/2011/04/custom-solr-search-components-2-dev.html http://sujitpal.blogspot.com/2011/04/more-fun-with-solr-component.html http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html but its quite simple, just look at some of the ones already in there. If you need books, check out the Apache Solr 3.1 Cookbook - it has a chapter on how to do this. -sujit On Thu, 2011-09-22 at 02:13 -0700, MOuli wrote: Hmm is it possible for me to write my own search component? I just downloaded the solr sources and need some informations how the search components work. Is there anything out there which can help me? -- View this message in context: http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3358152.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sort five random Top Offers to the top
Sorry hit send too soon. Personally, given the use case, I think I would still prefer the two query approach. It seems way too much work to do a handler (unless you want to learn how to do it) to support this. On Thu, 2011-09-22 at 12:31 -0700, Sujit Pal wrote: I have a few blog posts on this... http://sujitpal.blogspot.com/2011/04/custom-solr-search-components-2-dev.html http://sujitpal.blogspot.com/2011/04/more-fun-with-solr-component.html http://sujitpal.blogspot.com/2011/02/solr-custom-search-requesthandler.html but its quite simple, just look at some of the ones already in there. If you need books, check out the Apache Solr 3.1 Cookbook - it has a chapter on how to do this. -sujit On Thu, 2011-09-22 at 02:13 -0700, MOuli wrote: Hmm is it possible for me to write my own search component? I just downloaded the solr sources and need some informations how the search components work. Is there anything out there which can help me? -- View this message in context: http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3358152.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Sort five random Top Offers to the top
Hi MOuli, AFAIK (and I don't know that much about Solr), this feature does not exist out of the box in Solr. One way to achieve this could be to construct a DocSet with topoffer:true and intersect it with your result DocSet, then select the first 5 off the intersection, randomly shuffle them, sublist [0:5], and move the sublist to the top of the results like QueryElevationComponent does. Actually you may want to take a look at QueryElevationComponent code for inspiration (this is where I would have looked if I had to implement something similar). -sujit On Wed, 2011-09-21 at 06:54 -0700, MOuli wrote: Hey Community. I got a Lucene/Solr Index with many offers. Some of them are marked by a flag field topoffer that they are top offers. Now I want so sort randomly 5 of this offers on the top. For Example HTC Sensation - topoffer = true HTC Desire - topoffer = false Samsung Galaxy S2 - topoffer = ture IPhone 4 - topoffer = true ... When i search for a Handy then i want that first 3 offers are HTC Sensation, Samsung Galaxy S2 and the iPhone 4. Does anyone have an idea? PS.: I hope my english is not to bad -- View this message in context: http://lucene.472066.n3.nabble.com/Sort-five-random-Top-Offers-to-the-top-tp3355469p3355469.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Too many results in dismax queries with one word
Would it make sense to have a Did you mean? type of functionality for which you use the EdgeNGram and Metaphone filters /if/ you don't get appropriate results for the user query? So when user types cannon and the application notices that there are no cannons for sale in the index (0 results with standard analysis), it then makes another query with the EdgeNGram and/or Metaphone filters and come back with: Did you mean Canon, Canine? Clicking on Canon or Canine would fire off a query for these terms. That way your application doesn't guess what is right, it goes back and asks the user what he wants. -sujit On Sun, 2011-08-21 at 17:19 +0200, Rafał Piekarski (RaVbaker) wrote: Thanks for reply. I know that sometimes meeting all clients needs would be impossible but then client recalls that competitive (commercial) product already do that (but has other problems, like performance). And then I'm obligated to try more tricks. :/ I'm currently using Solr 3.1 but thinking about migrating to latest stable version - 3.3. You correct, to meet client needs I have also used some hacks with boosting queries (`bq` and `bf` parameters) but I omit that to make XMLs clearer. You mentioned faceting. This is also one of my(my client?) problems. In the user interface they want to have 5 categories for products. Those 5 should be most relevance ones. When I get those with highest counts for one word queries they are most of the time not that which should be there. For example with phrase ipad which actually has only 12 most relevant products in category tablets but phonetic APT matches also part of model name for hundreds of UPS power supplies and bath tubes . And these are on the list, not tablets. :/ But you mentioned autocomplete which is something what I haven't watched yet. I'll try with that and show it to my client. -- Rafał RaVbaker Piekarski. web: http://ja.ravbaker.net mail: ravba...@gmail.com jid/xmpp/aim: ravba...@gmail.com mobile: +48-663-808-481 On Sun, Aug 21, 2011 at 4:20 PM, Erick Erickson erickerick...@gmail.comwrote: The root problem here is This is unacceptable for my client. The first thing I'd suggest is that you work with your client and get them to define what is acceptable. You'll be forever changing things (to no good purpose) if all they can say is that's not right. For instance, you apparently have two competing requirements: 1 try to correct users input, which inevitably increases the results returned 2 narrow the search to the right results. You can't have both every time! So you could try something like going with a more-restrictive search (no metaphone comparison) first and, if the results returned weren't sufficient firing the broader query back, without showing the too-small results first. You could work with your client and see if what they really want is just the most relevant results at the top of the list, in which case you can play with the dismax field boosts (by the way, what version of Solr are you using?) You could work with the client to understand the user experience if you use autocomplete and/or faceting etc. to guide their explorations. You could... But none of that will help unless and until you and your client can agree what is the correct behavior ahead of time Best Erick On Sat, Aug 20, 2011 at 11:04 AM, Rafał Piekarski (RaVbaker) ravba...@gmail.com wrote: Hi all, I have a database of e-commerce products (5M) and trying to build a search solution for it. I have used steemer, edgengram and doublemetaphone phonetic fields for omiting common typos in queries. It works quite good with dismax QParser for queries longer than one word: tv lc20, sny psp 3001, cannon 5d etc. For not having too many results I manipulated with `mm` parameter. But when user type a single word like ipad, cannon. I always having a lot of results (~6). This is unacceptable for my client. He would like to have then only the `good` results. That particulary match specific query. It's hard to acomplish for me cause of use doublemetaphone field which converts words like apt, opt and ipad and even ipod to the same phonetic word - APT. And then all of these words are matched fairly the same gives me huge amount of results. Similar problems I have with other words like canon, canine and cannon which are KNN in phonetic way. But lexically have different meanings: canon - camera, canine - cat food , cannon - may be a misspell for canon or part of book title about cannon weapons. My first idea was to make a second requestHandler without searching in *_phonetic fields. And use it for queries with only one word. But it didn't worked cause sometimes I want to correct user even if there is only one word and suggest him something better. Query cannon is a good example. I'm fairly sure that most
Re: Exact matching on names?
Hi Ron, There was a discussion about this some time back, which I implemented (with great success btw) in my own code...basically you store both the analyzed and non-analyzed versions (use string type) in the index, then send in a query like this: +name:clarke name_s:clarke^100 The name field is text so it will analyze down clarke to clark but it will match both clark and clarke and the second clause would boost the entry with clarke up to the top, which you then select with rows=1. -sujit On Tue, 2011-08-16 at 10:20 -0500, Olson, Ron wrote: Hi all- I'm missing something fundamental yet I've been unable to find the definitive answer for exact name matching. I'm indexing names using the standard text field type and my search is for the name clarke. My results include clark, which is incorrect, it needs to match clarke exactly (case insensitive). I tried textType but that doesn't work because I believe it needs to be *really* exact, whereas I'm looking for things like clark oil, bob, frank, and clark, etc. Thanks for any help, Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: Problems generating war distribution using ant
FWIW, we have some custom classes on top of solr as well. The way we do it is using the following ant target: target name=war depends=jar description=Rebuild Solr WAR with custom code mkdir dir=${maven.webapps.output}/ !-- we unwar a copy of the 3.2.0 war file in source repo -- unwar src=${prod.common.lib.external.solr}/apache-solr-3.2.0.war dest=${maven.webapps.output}/ !-- add in some extra jar files our custom stuff needs -- copy todir=${maven.webapps.output}/WEB-INF/lib fileset refid=.../ fileset refid=.../ ... /copy !-- the jar target builds just our custom classes into a hl-solr.jar, which is copied over to the WEB-INF/lib of the exploded solr war -- copy file=${maven.build.directory}/hl-solr.jar todir=${maven.webapps.output}/WEB-INF/lib/ /war Seems to work fine...basically automates what you have described in your second paragraph, but allows us to keep our own code separately from solr code under source control. -sujit On Tue, 2011-08-16 at 16:09 -0700, arian487 wrote: So the way I generate war files now is by running an 'ant dist' in the solr folder. It generates the war fine and I get a build success, and then I deploy it to tomcat and once again the logs show it was successful (from the looks of it). However, when I go to 'myip:8080/solr/admin' I get an HTTP status 404. However, it works when I take a war from the nightly build, expand it, drop some new class files in there that I need, and close it up again. The solr I have checked out seems fine though and I can't find any differences between the war I'm generating and the one that has been generated. -- View this message in context: http://lucene.472066.n3.nabble.com/Problems-generating-war-distribution-using-ant-tp3260070p3260070.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Strip special chars like -
I have done this using a custom tokenfilter that (among other things) detects hyphenated words and converts it to the 3 variations, using a regex match on the incoming token: (\w+)-(\w+) that runs the following regex transform: s/(\w+)-(\w+)/$1$2__$1 $2/ and then splits by __ and passes the original token, the one word and two word versions through a SynonymFilter further down the chain (see Lucene in Action, 2nd Edition for code). -sujit On Tue, 2011-08-09 at 06:27 -0700, roySolr wrote: Hello, I have some terms in my index with specials characters. An example is manchester-united. I want that a user can search for manchester-united,manchester united and manchesterunited. What's the best way to fix this? i have used the patternReplaceFilter and some tokenizers but it couldn't fix the last situation(manchesterunited). Can someone helps me? -- View this message in context: http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3238942.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: (Solr-UIMA) Doubt regarding integrating UIMA in to solr - Configuration.
Hi Sowmya, I basically wrote an annotator and built a buffering tokenizer around it so I could include it in a Lucene analyzer pipeline. I've blogged about it, not sure if its good form to include links to blog posts in public forums, but here they are, apologies in advance if this is wrong (let me know and I won't do it again). http://sujitpal.blogspot.com/2011/06/uima-analysis-engine-for-keyword.html http://sujitpal.blogspot.com/2011/06/running-uima-analysis-engine-in-lucene.html Of course, this is in Lucene land. I haven't worked with the SOLR-UIMA stuff so this may not answer your question directly. But I think if you build an Tokenizer or TokenFilter then you can declare it as an analyzer chain in SOLR. HTH Sujit On Fri, 2011-07-08 at 09:19 +0200, Sowmya V.B. wrote: Hi Koji Thanks for the mail. Thanks for all the clarifications. I am now using the version 3.3.. But, another query that I have about this is: How can I add an annotator that I wrote myself, in to Solr-UIMA? Here is what I did before I moved to Solr: I wrote an annotator (which worked when I used plain vanilla lucene based indexer), which enriched the document with more fields (Some statistics about the document...all fields added were numeric fields). Those fields were added to the index by extending *JCasAnnotator_ImplBase* class. But, in Solr-UIMA, I am not exactly clear on where the above setup fits in. I thought I would get an idea looking at the annotators that came with the UIMA integration of Solr, but their source was not available. So, I do not understand how to actually integrate my own annotator in to UIMA. Can you please explain on how to go about this? Sowmya. On Fri, Jul 8, 2011 at 2:03 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: (11/07/07 18:38), Sowmya V.B. wrote: Hi I am trying to add UIMA module in to Solr..and began with the readme file given here. https://svn.apache.org/repos/**asf/lucene/dev/tags/lucene_** solr_3_1/solr/contrib/uima/**README.txthttps://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_1/solr/contrib/uima/README.txt I would recommend you to use Solr 3.3 rather than 3.1, as we have changed some configuration in solrconfig.xml for UIMA. 2. modify your schema.xml adding the fields you want to be hold metadata specifying proper values for type, indexed, stored and multiValued options: -I understood this line as: adding to my schema.xml, the new fields that will come as a result of a UIMA pipeline. For example, in my UIMA pipeline, post-processing, I get fields A,B,C in addition to fields X,Y,Z that I already added to the SolrInputDocument. So, does this mean I should add A,B,C to the schema.xml? I think you got it. Have you tried it but you got some errors? 3. In SolrConfig.xml, inside, uimaConfig runtimeParameters The uimaConfig tag has been moved into update processor setting @ Solr 3.2. Please see the latest README.txt. if iam not using any of those alchemy api key... etc, I think I can remove those lines. However, I plan to use the openNLP tagger tokenizer, and an annotator I wrote for my task. Can I give my model file locations here as runtimeParameters? I don't have an idea of openNLP. 4. I did not understand what fieldMapping tag does. The description said: field mapping describes which features of which types should go in a field-- - For example, in this snippet from the link: type name=org.apache.uima.alchemy.**ts.concept.ConceptFS map feature=text field=concept/ /type -what does feature mean and what does field mean? This defines a map uima feature: http://uima.apache.org/d/**uimaj-2.3.1/references.html#** ugr.ref.xml.component_**descriptor.type_system.**featureshttp://uima.apache.org/d/uimaj-2.3.1/references.html#ugr.ref.xml.component_descriptor.type_system.features to Solr field. koji -- http://www.rondhuit.com/en/
Re: Results with and without whitspace(soccer club and soccerclub)
This may or may not help you, we solved something similar based on hyphenated words - essentially when we encountered a hyphenated word (say word1-word2) we send in a OR query with the word (word1-word2) itself, a phrase word1 word2~3 and the word formed by removing the hyphen (word1word2). But in this case, soccerclub is not hyphenated, but if you have some kind of mapping of common conjunctions based on your search logs, you could write a custom QParser plugin to break it up like that. -sujit On Fri, 2011-05-20 at 05:52 -0700, roySolr wrote: Thanks for the help so far, I don't think this solves the problem. What if my data look like this: soccer club Manchester united if i search for soccerclub manchester and for soccer club manchester i want this result back. A copyfield that removes whitespaces is not an option. With the charfilter i get something like this: 1. Index time: soccer club Manchester united-- soccerclubManchesterunited indexed. 2. Search time: soccer club OR soccerclub -- soccerclub searched. In this situation i still get no result if i search soccerclub. The index is soccerclubManchesterunited. How can i fix it? -- View this message in context: http://lucene.472066.n3.nabble.com/Results-with-and-without-whitespace-soccer-club-and-soccerclub-tp2934742p2965389.html Sent from the Solr - User mailing list archive at Nabble.com.
Custom sorting based on external (database) data
Hi, Sorry for the possible double post, I wrote this up but had the incorrect sender address, so I am guessing that my previous one is going to be rejected by the list moderation daemon. I am trying to figure out options for the following problem. I am on Solr 1.4.1 (Lucene 2.9.1). I have search results which are going to be ranked by the user (using a thumbs up/down) and would translate to a score between -1 and +1. This data is stored in a database table ( unique_id thumbs_up thumbs_down num_calls as the thumbs up/down component is clicked. We want to be able to sort the results by the following score = (thumbs_up - thumbs_down) / (num_calls). The unique_id field refers to the one referenced as uniqueId in the schema.xml. Based on the following conversation: http://www.mail-archive.com/solr-user@lucene.apache.org/msg06322.html ...my understanding is that I need to: 1) subclass FieldType to create my own RankFieldType. 2) In this class I override the getSortField() method to return my custom FieldSortComparatorSource object. 3) Build the custom FieldSortComparatorSource object which returns a custom FieldSortComparator object in newComparator(). 4) Configure the field type of class RankFieldType (rank_t), and a field (called rank) of field type rank_t in schema.xml of type RankFieldType. 5) use sort=rank+desc to do the sort. My question is: is there a simpler/more performant way? The number of database lookups seems like its going to be pretty high with this approach. And its hard to believe that my problem is new, so I am guessing this is either part of some Solr configuration I am missing, or there is some other (possibly simpler) approach I am overlooking. Pointers to documentation or code (or even keywords I could google) would be much appreciated. TIA for all your help, Sujit
Re: Custom sorting based on external (database) data
Thank you Ahmet, looks like we could use this. Basically we would do periodic dumps of the (unique_id|computed_score) sorted by score and write it out to this file followed by a commit. Found some more info here, for the benefit of others looking for something similar: http://dev.tailsweep.com/solr-external-scoring/ On Thu, 2011-05-05 at 13:12 -0700, Ahmet Arslan wrote: --- On Thu, 5/5/11, Sujit Pal sujit@comcast.net wrote: From: Sujit Pal sujit@comcast.net Subject: Custom sorting based on external (database) data To: solr-user solr-user@lucene.apache.org Date: Thursday, May 5, 2011, 11:03 PM Hi, Sorry for the possible double post, I wrote this up but had the incorrect sender address, so I am guessing that my previous one is going to be rejected by the list moderation daemon. I am trying to figure out options for the following problem. I am on Solr 1.4.1 (Lucene 2.9.1). I have search results which are going to be ranked by the user (using a thumbs up/down) and would translate to a score between -1 and +1. This data is stored in a database table ( unique_id thumbs_up thumbs_down num_calls as the thumbs up/down component is clicked. We want to be able to sort the results by the following score = (thumbs_up - thumbs_down) / (num_calls). The unique_id field refers to the one referenced as uniqueId in the schema.xml. Based on the following conversation: http://www.mail-archive.com/solr-user@lucene.apache.org/msg06322.html ...my understanding is that I need to: 1) subclass FieldType to create my own RankFieldType. 2) In this class I override the getSortField() method to return my custom FieldSortComparatorSource object. 3) Build the custom FieldSortComparatorSource object which returns a custom FieldSortComparator object in newComparator(). 4) Configure the field type of class RankFieldType (rank_t), and a field (called rank) of field type rank_t in schema.xml of type RankFieldType. 5) use sort=rank+desc to do the sort. My question is: is there a simpler/more performant way? The number of database lookups seems like its going to be pretty high with this approach. And its hard to believe that my problem is new, so I am guessing this is either part of some Solr configuration I am missing, or there is some other (possibly simpler) approach I am overlooking. Pointers to documentation or code (or even keywords I could google) would be much appreciated. Looks like it can be done with http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html and http://wiki.apache.org/solr/FunctionQuery You can dump your table into three text files. Issue a commit to load these changes. Sort by function query is available in Solr3.1 though.
Hook to do stuff when searcher is reopened?
Hi, I am developing a SearchComponent that needs to build some initial DocSets and then intersect with the result DocSet during each query (in process()). When the searcher is reopened, I need to regenerate the initial DocSets. I am on Solr 1.4.1. My question is, which method in SearchComponent should I override to ensure that this regeneration happens whenever the searcher is reopened (for example in response to an update followed by a commit)? If no such hook method exists, how would this need to be done? Thanks Sujit
Re: Hook to do stuff when searcher is reopened?
I think I found the answer by looking through the code...specifically SpellCheckComponent. So my component would have to implement SolrCoreAware and in the inform() method, register a custom SolrEventListener which will execute the regeneration code in the postCommit and newSearcher methods. Would still appreciate knowing if there is a simpler way, or if I am wildly off the mark. Thanks Sujit On Thu, 2011-04-07 at 16:39 -0700, Sujit Pal wrote: Hi, I am developing a SearchComponent that needs to build some initial DocSets and then intersect with the result DocSet during each query (in process()). When the searcher is reopened, I need to regenerate the initial DocSets. I am on Solr 1.4.1. My question is, which method in SearchComponent should I override to ensure that this regeneration happens whenever the searcher is reopened (for example in response to an update followed by a commit)? If no such hook method exists, how would this need to be done? Thanks Sujit
Re: Hook to do stuff when searcher is reopened?
Thanks Erick. This looks like it would work... I sent out an update to my original query, there is another approach that would probably also work for my case that is being used by SpellCheckerComponent. I will check out both approaches. Thanks very much for your help. -sujit On Thu, 2011-04-07 at 20:58 -0400, Erick Erickson wrote: I haven't built one myself, but have you considered the Solr UserCache? See: http://wiki.apache.org/solr/SolrCaching#User.2BAC8-Generic_Caches It even receives warmup signals I believe... Best Erick On Thu, Apr 7, 2011 at 7:39 PM, Sujit Pal sujit@comcast.net wrote: Hi, I am developing a SearchComponent that needs to build some initial DocSets and then intersect with the result DocSet during each query (in process()). When the searcher is reopened, I need to regenerate the initial DocSets. I am on Solr 1.4.1. My question is, which method in SearchComponent should I override to ensure that this regeneration happens whenever the searcher is reopened (for example in response to an update followed by a commit)? If no such hook method exists, how would this need to be done? Thanks Sujit
Re: Solr and Permissions
Yes there can be cases where user is allowed a subset of a content type, or a combination of content type groups and individual documents, where this would break down. And yes, afaik, if you want to update the permissions in the document (seems slightly strange, since you would potentially many more users than documents, so you may want to think this requirement through some more), you would need to update the document. -sujit On Thu, 2011-03-10 at 21:24 -0800, go canal wrote: I have similar requirements. Content type is one solution; but there are also other use cases where this not enough. Another requirement is, when the access permission is changed, we need to update the field - my understanding is we can not unless re-index the whole document again. Am I correct? thanks, canal From: Sujit Pal sujit@comcast.net To: solr-user@lucene.apache.org Sent: Fri, March 11, 2011 10:39:27 AM Subject: Re: Solr and Permissions How about assigning content types to documents in the index, and map users to a set of content types they are allowed to access? That way you will pass in fewer parameters in the fq. -sujit On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote: Morning, We use solr to index a range of content to which, within our application, access is restricted by a system of user groups and permissions. In order to ensure that search results don't reveal information about items which the user doesn't have access to, we need to somehow filter the results; this needs to be done within Solr itself, rather than after retrieval, so that the facet and result counts are correct. Currently we do this by creating a filter query which specifies all of the items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR ...)), but this has definite scalability issues - we're starting to run into issues, as this can be a set of ORs of potentially unlimited size (and practically, we're hitting the low thousands sometimes). While we can adjust maxBooleanClauses upwards, I understand that this has performance implications... So, has anyone had to implement something similar in the past? Any suggestions for a more scalable approach? Any advice on safe and sensible limits on how far I can push maxBooleanClauses? Thanks for your advice, Liam
Any way to do payload queries in Luke?
Hello, I am denormalizing a map of string,float into a single lucene document by storing it as key1|score1 key2|score2 In Solr, I pull this in using the following analyzer definition. fieldtype name=payloads stored=false indexed=true class=solr.TextField analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.DelimitedPayloadTokenFilterFactory delimiter=| encoder=float/ /analyzer /fieldtype I have my own PayloadSimilarity which overrides scorePayload. The index is created by POSTing Solr XML to Solr. In Solr, I have a custom QParser that converts any query containing a field of type payloads into a PayloadTermQuery instead of a TermQuery (multiple sub-queries are combined using a BooleanQuery). However, in Luke, when I put my custom PayloadSimilarity and a custom PayloadAnalyzer (equivalent to the chain above) in the classpath and enter the same field:value query and the results don't come back ordered by the payload score. I do set the analyzer to my payload analyzer and the similarity to my payload similarity. I guess this is expected as there is no way (that I know of anyway) for me to tell Luke that this is a PayloadTermQuery rather than a TermQuery. So the question is - can I use some special syntax to indicate to Luke that the query should be converted to a PayloadTermQuery? Don't think Luke can figure out based on the definition (in Luke I see the field defined as ITS). Thanks Sujit
Re: Solr and Permissions
How about assigning content types to documents in the index, and map users to a set of content types they are allowed to access? That way you will pass in fewer parameters in the fq. -sujit On Fri, 2011-03-11 at 11:53 +1100, Liam O'Boyle wrote: Morning, We use solr to index a range of content to which, within our application, access is restricted by a system of user groups and permissions. In order to ensure that search results don't reveal information about items which the user doesn't have access to, we need to somehow filter the results; this needs to be done within Solr itself, rather than after retrieval, so that the facet and result counts are correct. Currently we do this by creating a filter query which specifies all of the items which may be allowed to match (e.g. id: (foo OR bar OR blarg OR ...)), but this has definite scalability issues - we're starting to run into issues, as this can be a set of ORs of potentially unlimited size (and practically, we're hitting the low thousands sometimes). While we can adjust maxBooleanClauses upwards, I understand that this has performance implications... So, has anyone had to implement something similar in the past? Any suggestions for a more scalable approach? Any advice on safe and sensible limits on how far I can push maxBooleanClauses? Thanks for your advice, Liam
Re: Understanding multi-field queries with q and fq
This could probably be done using a custom QParser plugin? Define the pattern like this: String queryTemplate = title:%Q%^2.0 body:%Q%; then replace the %Q% with the value of the Q param, send it through QueryParser.parse() and return the query. -sujit On Wed, 2011-03-02 at 11:28 -0800, mrw wrote: Anyone understand how to do boolean logic across multiple fields? Dismax is nice for searching multiple fields, but doesn't necessarily support our syntax requirements. eDismax appears to be not available until Solr 3.1. In the meantime, it looks like we need to support applying the user's query to multiple fields, so if the user enters led zeppelin merle we need to be able to do the logical equivalent of fq=field1:led zeppelin merle OR field2:led zeppelin merle Any ideas? :) mrw wrote: After searching this list, Google, and looking through the Pugh book, I am a little confused about the right way to structure a query. The Packt book uses the example of the MusicBrainz DB full of song metadata. What if they also had the song lyrics in English and German as files on disk, and wanted to index them along with the metadata, so that each document would basically have song title, artist, publisher, date, ..., All_Metadata (copy field of all metadata fields), Text_English, and Text_German fields? There can only be one default field, correct? So if we want to search for all songs containing (zeppelin AND (dog OR merle)) do we repeat the entire query text for all three major fields in the 'q' clause (assuming we don't want to use the cache): q=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin AND (dog OR merle)+Text_German:(zeppelin AND (dog OR merle)) or repeat the entire query text for all three major fields in the 'fq' clause (assuming we want to use the cache): q=*:*fq=(+All_Metadata:zeppelin AND (dog OR merle)+Text_English:zeppelin AND (dog OR merle)+Text_German:zeppelin AND (dog OR merle)) ? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Understanding-multi-field-queries-with-q-and-fq-tp2528866p2619700.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Payloads retrieval
Yes, check out the field type payloads in the schema.xml file. If you set up one or more of your fields as type payloads (you would use the DelimitedPayloadTokenFilterFactory during indexing in your analyzer chain), you can then use the PayloadTermQuery to query it with, scoring can be done with a custom PayloadSimilarity implementation. Check out this (slightly dated) article for more information. http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ -sujit On Mon, 2011-02-28 at 14:49 -0300, Fabiano Nunes wrote: Hi! I'm studying a migration from pure Lucene to Solr, but I need a crucial feature: Is it posible to retrieve payloads from Solr? I'm storing the coordinates from each term in its payload to highlight images in client-side. Thank you,
Re: loading XML docbook files into solr
Hi Derek, The XML files you post to Solr needs to be in the correct Solr specific XML format. One way to preserve the original structure would be to flatten the document into field names indicating the position of the text, for example: book_titleabbrev: Advancing Return on Investment Analysis for Government IT:\ A Public Value Framework ... etc. But you will still have to parse your docbook XML into the appropriate schema that you want to use for Solr. I believe DIH also allows XSLT based preprocessors so you don't have to write parsing code, but I haven't used them. -sujit On Sat, 2011-02-26 at 10:40 -0500, Derek Werthmuller wrote: I've been working on this for a while an seem to hit a wall. The error messages aren't complete enought to give guidance why importing a sample docbook document into solr is not working. I'm using the curl tool to post the xml file and receive a non error message but the document count doesn't increase and the *:* returns no results still. The docbook document has a attribute id and this is mapped to the uniquekey in the schema.xml file. But it seems this may be the issue still. Its not clear how the field names map to the XML. Do they only map to attributes? or do they map to elements? How to you differentiate? Can field names in the schema.xml file have xpath statements? Are there other important sections of the solrconfig that could be keeping this from working? We want to maintain much of the document structure so we have more control over the searching. Here is what the docbook XML looks like: (tried setting the uniquekey to id and docid but no go either way) book label=issuebriefs id=proi docid245/docid titleabbrevAdvancing Return on Investment Analysis for Government IT: A Pu blic Value Framework /titleabbrev chapter titleAdvancing Return on Investment Analysis for Government IT: A Publ ic Value Framework/title para mediaobject imageobject imagedata fileref=/publications/annualreports/ar2006/image s/public-value.jpg format=jpg contentdepth=157 contentwidth=216 align=le ft/ /imageobject textobject phrasePublic Value Illustration/phrase /textobject /mediaobject .. Here is the section of the schema.xml field name=id type=string indexed=true stored=true multiValued=false required=true / field name=titleabbrev type=text indexed=true stored=true / field name=title type=text indexed=true stored=true / field name=para type=text indexed=true stored=true / field name=ulink type=string indexed=true stored=true / field name=listitem type=text indexed=true stored=true / field name=all_text type=text indexed=true stored=false multiValued=true / copyField source=title dest=all_text / copyField source=para dest=all_text / copyField source=listitem dest=all_text / copyField source=titleabbrev dest=all_text / /fields !-- Field to use to determine and enforce document uniqueness. Unless this field is marked with required=false, it will be a required field -- uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldall_text/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=OR/ /schema Load command results. $ ./postfile.sh ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime56/int/lst /response ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime15/int/lst /response Thanks Derek
Re: manually editing spellcheck dictionary
If the dictionary is a Lucene index, wouldn't it be as simple as delete using a term query? Something like this: IndexReader sdreader = new IndexReader(); sdreader.delete(new Term(word, sherri)); ... sdreader.optimize(); sdreader.close(); I am guessing your dictionary is built dynamically using content words. If so, you may want to run the words through an aspell like filter (jazzy.sf.net is a Java implementation of aspell that works quite well with single words) to determine if more of these should be removed, and whether they should be added in the first place. -sujit On Fri, 2011-02-25 at 10:41 -0700, Tanner Postert wrote: I'm using an index based spellcheck dictionary and I was wondering if there were a way for me to manually remove certain words from the dictionary. Some of my content has some mis-spellings, and for example when I search for the word sherrif (which should be spelled sheriff), it get recommendations like sherriff or sherri instead. If I could remove those words, it would seem like the system would work a little better.
Re: boosting results by a query?
We are currently a Lucene shop, the way we do it (currently) is to have these results come from a database table (where it is available in rank order). We want to move to Solr, so what I plan on doing to replicate this functionality is to write a custom request handler that will do the database query and put the results on the top of the search results before the SolrIndexSearcher is invoked. -sujit On Fri, 2011-02-11 at 16:31 -0500, Ryan McKinley wrote: I have an odd need, and want to make sure I am not reinventing a wheel... Similar to the QueryElevationComponent, I need to be able to move documents to the top of a list that match a given query. If there were no sort, then this could be implemented easily with BooleanQuery (i think) but with sort it gets more complicated. Seems like I need: sortSpec.setSort( new Sort( new SortField[] { new SortField( something that only sorts results in the boost query ), new SortField( the regular sort ) })); Is there an existing FieldComparator I should look at? Any other pointers/ideas? Thanks ryan
Re: Architecture decisions with Solr
Another option (assuming the case where a user can be granted access to a certain class of documents, and more than one user would be able to access certain documents) would be to store the access filter (as an OR query of content types) in an external cache (perhaps a database or an eternal cache that the database changes are published to periodically), then using this access filter as a facet on the base query. -sujit On Wed, 2011-02-09 at 14:38 -0500, Glen Newton wrote: This application will be built to serve many users If this means that you have thousands of users, 1000s of VMs and/or 1000s of cores is not going to scale. Have an ID in the index for each user, and filter using it. Then they can see only their own documents. Assuming that you are building an app that through which they authenticate talks to solr . (i.e. all requests are filtered using their ID) -Glen On Wed, Feb 9, 2011 at 2:31 PM, Greg Georges greg.geor...@biztree.com wrote: From what I understand about multicore, each of the indexes are independant from each other right? Or would one index have access to the info of the other? My requirement is like you mention, a client has access only to his or her search data based in their documents. Other clients have no access to the index of other clients. Greg -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: 9 février 2011 14:28 To: solr-user@lucene.apache.org Subject: Re: Architecture decisions with Solr What about standing up a VM (search appliance that you would make) for each client? If there's no data sharing across clients, then using the same solr server/index doesn't seem necessary. Solr will easily meet your needs though, its the best there is. On Wed, 2011-02-09 at 14:23 -0500, Greg Georges wrote: Hello all, I am looking into an enterprise search solution for our architecture and I am very pleased to see all the features Solr provides. In our case, we will have a need for a highly scalable application for multiple clients. This application will be built to serve many users who each will have a client account. Each client will have a multitude of documents to index (0-1000s of documents). After discussion we were talking about going multicore and to have one index file per client account. The reason for this is that security is achieved by having a separate index for each client etc.. Is this the best approach? How feasible is it (dynamically create indexes on client account creation. Is it better to go the faceted search capabilities route? Thanks for your help Greg