Keywords/terms mutual exclusion

2011-04-05 Thread Octavian Covalschi
Hi there,

I'm trying to use Solr in one of my projects and I've got a small problem
that I can't figure out.

Basically our application is collecting data submitted by users. Now the
problem is that submitted data may contain some incorrect info, like some
keywords that will mess up search results. A simple example:

I've got an article about a pair of glasses. The title of the page where
that product is located contains also "shoes" keyword, which is irrelevant
to glasses but relevant to that entire website. So, basically in this case I
need to exclude "shoes" when I have "glasses" keyword. We're using page's
title and other content to generate some keywords. If I'm searching "women
shoes" or even "shoes"  I'll get those glasses as well, which is not what I
need.

Does it make any sense? Maybe someone has any idea? Or had similar problems
and found a decent solution? It looks to me I need a list of ignored
keywords for some terms, incompatible keywords from semantic point of view,
but I'm not sure if that's the best way to do... even so, I'm not sure how
to make blacklisted terms for other terms.

Thank you in advance.


Re: Keywords/terms mutual exclusion

2011-04-05 Thread Octavian Covalschi
Yes, you may be right sorry for the confusion.

Our ultimate goal is to collect user entered data, with least possible
interaction (users are lazy you know) from them. So basically users just
point out where they found that particular item, and app's job is to index
it and later show it in search results. The main problem/task is to generate
good and relevant keywords for each document.

We also would like to avoid editor's involvement, so we kind of trust users
but not really, but in the same time don't want to verify it...too much at
least :)

The example I gave was a real one, but I wasn't clear enough, sorry again.
So:
No, I don't think we can know at the time of indexing that "shoes" should be
excluded. We also have categories, but in this particular case, category
field can be the same for both items(it's too generic), so we can't really
exclude it. One solution would be to have more specific categories(which may
allow us to know to exclude it), but this will result adding too many too
many of them + eventually we'd like to get rid of categories... I guess
we'll have to try to re-think it one more time.

The process of mapping, which terms should exclude which, was supposed to be
a manual operation. I was thinking it as a learning process, so after a
while there will be less and less editor's involvement. But I'm not sure
anymore about this.

Thanks again for your input.



On Tue, Apr 5, 2011 at 3:14 PM, Jonathan Rochkind  wrote:

> I don't completely understand.  I think maybe you replaced your
> domain-specific actualities with another example in an attempt to be more
> general or not reveal your business, but just made your explanation even
> more confusing!
>
> But. At the point you are indexing, is it possible to know that "shoes"
> should not be indexed for that record at all?
>
> If so, then the best bet is indeed to prevent it from being indexed at all
> at the point of indexing.
>
> Depending on exactly what algorithm you can describe for how you know when
> a given term should NOT have been included in the record, and should not be
> indexed -- there may be ways to do it with Solr analysis. Otherwise, you'd
> have to just preprocess before even giving it to Solr for indexing.
>
> Hope this gives you some ideas at least, I don't entirely understand what
> you're trying to do.
>
>
> On 4/5/2011 4:05 PM, Octavian Covalschi wrote:
>
>> Hi there,
>>
>> I'm trying to use Solr in one of my projects and I've got a small problem
>> that I can't figure out.
>>
>> Basically our application is collecting data submitted by users. Now the
>> problem is that submitted data may contain some incorrect info, like some
>> keywords that will mess up search results. A simple example:
>>
>> I've got an article about a pair of glasses. The title of the page where
>> that product is located contains also "shoes" keyword, which is irrelevant
>> to glasses but relevant to that entire website. So, basically in this case
>> I
>> need to exclude "shoes" when I have "glasses" keyword. We're using page's
>> title and other content to generate some keywords. If I'm searching "women
>> shoes" or even "shoes"  I'll get those glasses as well, which is not what
>> I
>> need.
>>
>> Does it make any sense? Maybe someone has any idea? Or had similar
>> problems
>> and found a decent solution? It looks to me I need a list of ignored
>> keywords for some terms, incompatible keywords from semantic point of
>> view,
>> but I'm not sure if that's the best way to do... even so, I'm not sure how
>> to make blacklisted terms for other terms.
>>
>> Thank you in advance.
>>
>>