subject:"Facets, termvectors, relevancy and Multi word tokenizing"

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-11 Thread epnRui

Hi Iorixxx!

I have not optimized the index but the day after this post I saw I didn't
have this problem anymore.

I will follow your advice next time!

Now I'm avoiding so much manipulation at indexation time and I'm doing more
work in the java code in the client side.

If I had time I would implement a new tokenizer...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4122862.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-07 Thread Ahmet Arslan

Hi,

Please optimize your index (you can do it core admin GUI) and see if problem 
goes away. 

Ahmet



On Friday, March 7, 2014 1:18 PM, epnRui  wrote:
Hi guys!

I solved my problem on the client side but at least I solved it...

Anyway, now I have another problem, which is related to the following:

- I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-07 Thread epnRui

Hi guys!

I solved my problem on the client side but at least I solved it...

Anyway, now I have another problem, which is related to the following:

 - I had previously used replace chars and replace patterns, charfilters and
filters, at index time to replace "EP" by "European Parliament". At that
point, it increased the facet_field count for "European Parliament".
Well now I have a big problem which is: I have already deleted the document
which generated the "European Parliament" and still that facet_field.count
will not subtract!! Is there a way to either remove a facet_field or to
subtract its count manually?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121958.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-05 Thread epnRui

Hi guys,

So, I keep facing this problem which I can't solve. I thought it was due to
HTML anchors containing the name of the hashtag, and thus repeating it, but
it's not.

So the use case is:
1 - I need to consider hashtags as tokens.
2 - The hashtag has to show up in the facets.

Right now if I index this text:
"Action, sanctions or diplomacy: which way forward for the  #EU
<http://twitter.com/search?q=%23EU>   &  #Ukraine
<http://twitter.com/search?q=%23Ukraine>  ? Tell us  @LinkedIn
<http://twitter.com/LinkedIn>   debate  http://t.co/umf9olxH9f
<http://t.co/umf9olxH9f>  "

I get the tokens as follows (see image for more detail):
action  sanctiondiplomacy   forward #eu #ukrainetell
linkedindebate
umf9olxh9f
ace bate

<http://lucene.472066.n3.nabble.com/file/n4121389/solr.png> 

Then, if I have a look at the facets after the indexation, I find that (for
ukraine), the facets counts is increased for both "Ukraine" and "#Ukraine",
isntead of only for #Ukraine.

Does anyone have any idea of why this is happening?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4121389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-03-03 Thread epnRui

Hi guys,

I'm on my way to solve it properly.

This is how my field looks like now:



  









  

I still have one case where I'm facing issues because in fact I want to
preserve the #:
 - #European Parliament is translated into one token instead of two:
"#European" and "Parliament"... anyway, I have some ideas on how to do it.
Ill let you know whatss the final solution



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120948.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-28 Thread Ahmet Arslan

Hi,

Let's say you have accomplished what you want. You have a .txt with the tokens
tomerge, like "European" and "Parliament". What is your use case then? What is
your high level goal?

MappingCharFilter approach is closer (to your .txt approach) than
PatternReplaceCharFilterFactory approach.

By the way, it could also be simulated with ShingleFilterFactory +
KeepWordFilterFactory + TypeTokenFilterFactory

May be it can be done via firing phrase queries at query time (without
interfering with the index) at client side? e.g. q="European Parliament"~0

On Friday, February 28, 2014 11:55 AM, epnRui wrote:
Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _ :)

So I guess there would be no way to do this more cleanly, maybe only
implementing my own Tokenizer and Filters, but I honestly couldn't find a
tutorial for implement a customized solr Tokenizer. If I end up needing to
do it I will write a tutorial.

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

I had seen the ShingleFilterFactory but the problem is I don't need the
whole phrase in tokens of 2 words and I understood it's what it does. Of
course I would need some filter that would handle a .txt with the tokens to
merge, like "European" and "Parliament".

I'm still having some other problem now but maybe I find a solution after I
read the page you annexed which seems great. Solr is considering #European
as #European and European, meaning it does 2 facets for one token. I want it
to consider it only as #European. I ran the analyzer debugger in my Solr
admin console and I don't see how he can be doing that.
Would you know of a reason for this?

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html

Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-28 Thread David Santamauro

Have you tried to just use a copyField? For example, I had a similar use
case where I needed to have particular field (f1) tokenized but also
needed to facet on the complete contents.

For that, I created a copyField

f1 used tokenizers and filters but f2 was just a plain string. You then
facet on f2

... just an idea

On 02/28/2014 04:54 AM, epnRui wrote:

Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _ :)

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.

--
View this message in context:
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101p4120361.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-28 Thread epnRui

Hi Ahmet!!

I went ahead and did something I thought it was not a clean solution and
then when I read your post and I found we thought of the same solution,
including the European_Parliament with the _ :)

So for now I'm doing PatternReplaceCharFilterFactory to replace "European
Parliament" with European_Parliament (initially I didnt use the
md5hash European_Parliament).

Then I replace it back after the StandardTokenizerFactory ran, into
"European Parliament". Well I guess I just found a way to do a 2 words token
:)

Thanks for your reply and that page you annexed seems excelent and I'll read
it through.

Re: Facets, termvectors, relevancy and Multi word tokenizing

2014-02-27 Thread Ahmet Arslan



Hi epnRui,

I don't full follow your e-mail (I think you need to describe your use case) 
but here are some answers,

- Is it possible to have facets of two or more words?

Yes. For example if you use ShingleFilterFactory at index time you will see two 
or more words in facets.


- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?


Yes. For example you can use mappingCharFilter (executed before tokenizer) with 
this mapping :
"European Union" => "European_Union"


Regarding synonym filter, please see : 
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/

Ahmet


On Thursday, February 27, 2014 1:10 PM, epnRui  wrote:
Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
- Is it possible to have facets of two or more words?
- Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
- Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
- Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:


      
        
        
        
        
        
        
      
    


Thank you for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.

Facets, termvectors, relevancy and Multi word tokenizing

2014-02-27 Thread epnRui

Hi everyone!

I'm having a problem and I have searched and Haven't found a solution yet
and am rather confused at the moment.

I have an application that stores human readable texts in my Solr index.
It finds the most relevant terms in that human readable text, I think using
termvectors and facets, and it stores the facets terms.

All works fine but now I need that the most relevant terms can also be terms
of at least two words, like "European Union", which is quite a frequent term
in my system...Still the system is getting into the facets "European"
"Union" as two separate terms.

So, questions are:
 - Is it possible to have facets of two or more words?
 - Can I tokenize a phrase into words, but when it comes accross "European
Union", it generates one token for "European Union" and not two tokens
"European Union"?
 - Can termvectors be used to find relevancy of multi-word terms like
"European Union" ?
 - Can I use SynonymFilterFactory that would transform: "EU, UE, European
Union, Union Europeene" into "European Union" ?

At the moment of indexation I have the following analyzer for english
language:


  






  
    


Thank you for the help!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Facets-termvectors-relevancy-and-Multi-word-tokenizing-tp4120101.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Re: Facets, termvectors, relevancy and Multi word tokenizing

Facets, termvectors, relevancy and Multi word tokenizing

10 matches

Site Navigation

Mail list logo

Footer information