Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi Iorixxx! I have not optimized the index but the day after this post I saw I didn't have this problem anymore. I will follow your advice next time! Now I'm avoiding so much manipulation at indexation time and I'm doing more work in the java code in the client side. If I had time I would implement a new tokenizer... -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4122862.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi, Please optimize your index (you can do it core admin GUI) and see if problem goes away. Ahmet On Friday, March 7, 2014 1:18 PM, epnRui wrote: Hi guys! I solved my problem on the client side but at least I solved it... Anyway, now I have another problem, which is related to the following: - I had previously used replace chars and replace patterns, charfilters and filters, at index time to replace "EP" by "European Parliament". At that point, it increased the facet_field count for "European Parliament". Well now I have a big problem which is: I have already deleted the document which generated the "European Parliament" and still that facet_field.count will not subtract!! Is there a way to either remove a facet_field or to subtract its count manually? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi guys! I solved my problem on the client side but at least I solved it... Anyway, now I have another problem, which is related to the following: - I had previously used replace chars and replace patterns, charfilters and filters, at index time to replace "EP" by "European Parliament". At that point, it increased the facet_field count for "European Parliament". Well now I have a big problem which is: I have already deleted the document which generated the "European Parliament" and still that facet_field.count will not subtract!! Is there a way to either remove a facet_field or to subtract its count manually? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi guys, So, I keep facing this problem which I can't solve. I thought it was due to HTML anchors containing the name of the hashtag, and thus repeating it, but it's not. So the use case is: 1 - I need to consider hashtags as tokens. 2 - The hashtag has to show up in the facets. Right now if I index this text: "Action, sanctions or diplomacy: which way forward for the #EU <http://twitter.com/search?q=%23EU> & #Ukraine <http://twitter.com/search?q=%23Ukraine> ? Tell us @LinkedIn <http://twitter.com/LinkedIn> debate http://t.co/umf9olxH9f <http://t.co/umf9olxH9f> " I get the tokens as follows (see image for more detail): action sanctiondiplomacy forward #eu #ukrainetell linkedindebate umf9olxh9f ace bate <http://lucene.472066.n3.nabble.com/file/n4121389/solr.png> Then, if I have a look at the facets after the indexation, I find that (for ukraine), the facets counts is increased for both "Ukraine" and "#Ukraine", isntead of only for #Ukraine. Does anyone have any idea of why this is happening? -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121389.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi guys, I'm on my way to solve it properly. This is how my field looks like now: I still have one case where I'm facing issues because in fact I want to preserve the #: - #European Parliament is translated into one token instead of two: "#European" and "Parliament"... anyway, I have some ideas on how to do it. Ill let you know whatss the final solution -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120948.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi, Let's say you have accomplished what you want. You have a .txt with the tokens tomerge, like "European" and "Parliament". What is your use case then? What is your high level goal? MappingCharFilter approach is closer (to your .txt approach) than PatternReplaceCharFilterFactory approach. By the way, it could also be simulated with ShingleFilterFactory + KeepWordFilterFactory + TypeTokenFilterFactory May be it can be done via firing phrase queries at query time (without interfering with the index) at client side? e.g. q="European Parliament"~0 On Friday, February 28, 2014 11:55 AM, epnRui wrote: Hi Ahmet!! I went ahead and did something I thought it was not a clean solution and then when I read your post and I found we thought of the same solution, including the European_Parliament with the _ :) So I guess there would be no way to do this more cleanly, maybe only implementing my own Tokenizer and Filters, but I honestly couldn't find a tutorial for implement a customized solr Tokenizer. If I end up needing to do it I will write a tutorial. So for now I'm doing PatternReplaceCharFilterFactory to replace "European Parliament" with European_Parliament (initially I didnt use the md5hash European_Parliament). Then I replace it back after the StandardTokenizerFactory ran, into "European Parliament". Well I guess I just found a way to do a 2 words token :) I had seen the ShingleFilterFactory but the problem is I don't need the whole phrase in tokens of 2 words and I understood it's what it does. Of course I would need some filter that would handle a .txt with the tokens to merge, like "European" and "Parliament". I'm still having some other problem now but maybe I find a solution after I read the page you annexed which seems great. Solr is considering #European as #European and European, meaning it does 2 facets for one token. I want it to consider it only as #European. I ran the analyzer debugger in my Solr admin console and I don't see how he can be doing that. Would you know of a reason for this? Thanks for your reply and that page you annexed seems excelent and I'll read it through. -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Have you tried to just use a copyField? For example, I had a similar use case where I needed to have particular field (f1) tokenized but also needed to facet on the complete contents. For that, I created a copyField f1 used tokenizers and filters but f2 was just a plain string. You then facet on f2 ... just an idea On 02/28/2014 04:54 AM, epnRui wrote: Hi Ahmet!! I went ahead and did something I thought it was not a clean solution and then when I read your post and I found we thought of the same solution, including the European_Parliament with the _ :) So I guess there would be no way to do this more cleanly, maybe only implementing my own Tokenizer and Filters, but I honestly couldn't find a tutorial for implement a customized solr Tokenizer. If I end up needing to do it I will write a tutorial. So for now I'm doing PatternReplaceCharFilterFactory to replace "European Parliament" with European_Parliament (initially I didnt use the md5hash European_Parliament). Then I replace it back after the StandardTokenizerFactory ran, into "European Parliament". Well I guess I just found a way to do a 2 words token :) I had seen the ShingleFilterFactory but the problem is I don't need the whole phrase in tokens of 2 words and I understood it's what it does. Of course I would need some filter that would handle a .txt with the tokens to merge, like "European" and "Parliament". I'm still having some other problem now but maybe I find a solution after I read the page you annexed which seems great. Solr is considering #European as #European and European, meaning it does 2 facets for one token. I want it to consider it only as #European. I ran the analyzer debugger in my Solr admin console and I don't see how he can be doing that. Would you know of a reason for this? Thanks for your reply and that page you annexed seems excelent and I'll read it through. -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi Ahmet!! I went ahead and did something I thought it was not a clean solution and then when I read your post and I found we thought of the same solution, including the European_Parliament with the _ :) So I guess there would be no way to do this more cleanly, maybe only implementing my own Tokenizer and Filters, but I honestly couldn't find a tutorial for implement a customized solr Tokenizer. If I end up needing to do it I will write a tutorial. So for now I'm doing PatternReplaceCharFilterFactory to replace "European Parliament" with European_Parliament (initially I didnt use the md5hash European_Parliament). Then I replace it back after the StandardTokenizerFactory ran, into "European Parliament". Well I guess I just found a way to do a 2 words token :) I had seen the ShingleFilterFactory but the problem is I don't need the whole phrase in tokens of 2 words and I understood it's what it does. Of course I would need some filter that would handle a .txt with the tokens to merge, like "European" and "Parliament". I'm still having some other problem now but maybe I find a solution after I read the page you annexed which seems great. Solr is considering #European as #European and European, meaning it does 2 facets for one token. I want it to consider it only as #European. I ran the analyzer debugger in my Solr admin console and I don't see how he can be doing that. Would you know of a reason for this? Thanks for your reply and that page you annexed seems excelent and I'll read it through. -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facets, termvectors, relevancy and Multi word tokenizing
Hi epnRui, I don't full follow your e-mail (I think you need to describe your use case) but here are some answers, - Is it possible to have facets of two or more words? Yes. For example if you use ShingleFilterFactory at index time you will see two or more words in facets. - Can I tokenize a phrase into words, but when it comes accross "European Union", it generates one token for "European Union" and not two tokens "European Union"? Yes. For example you can use mappingCharFilter (executed before tokenizer) with this mapping : "European Union" => "European_Union" Regarding synonym filter, please see : http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ Ahmet On Thursday, February 27, 2014 1:10 PM, epnRui wrote: Hi everyone! I'm having a problem and I have searched and Haven't found a solution yet and am rather confused at the moment. I have an application that stores human readable texts in my Solr index. It finds the most relevant terms in that human readable text, I think using termvectors and facets, and it stores the facets terms. All works fine but now I need that the most relevant terms can also be terms of at least two words, like "European Union", which is quite a frequent term in my system...Still the system is getting into the facets "European" "Union" as two separate terms. So, questions are: - Is it possible to have facets of two or more words? - Can I tokenize a phrase into words, but when it comes accross "European Union", it generates one token for "European Union" and not two tokens "European Union"? - Can termvectors be used to find relevancy of multi-word terms like "European Union" ? - Can I use SynonymFilterFactory that would transform: "EU, UE, European Union, Union Europeene" into "European Union" ? At the moment of indexation I have the following analyzer for english language: Thank you for the help! -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html Sent from the Solr - User mailing list archive at Nabble.com.
Facets, termvectors, relevancy and Multi word tokenizing
Hi everyone! I'm having a problem and I have searched and Haven't found a solution yet and am rather confused at the moment. I have an application that stores human readable texts in my Solr index. It finds the most relevant terms in that human readable text, I think using termvectors and facets, and it stores the facets terms. All works fine but now I need that the most relevant terms can also be terms of at least two words, like "European Union", which is quite a frequent term in my system...Still the system is getting into the facets "European" "Union" as two separate terms. So, questions are: - Is it possible to have facets of two or more words? - Can I tokenize a phrase into words, but when it comes accross "European Union", it generates one token for "European Union" and not two tokens "European Union"? - Can termvectors be used to find relevancy of multi-word terms like "European Union" ? - Can I use SynonymFilterFactory that would transform: "EU, UE, European Union, Union Europeene" into "European Union" ? At the moment of indexation I have the following analyzer for english language: Thank you for the help! -- View this message in context: http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html Sent from the Solr - User mailing list archive at Nabble.com.