There are many uses for shingles.

I've used them to find common phrases in text, which is my understanding of what you try to achieve. It works rather well, is a very simple solution and easy on resources compared to real semantic analysis.

You'll be getting a lot of shingles such as "there is" and "we are", but using a stop word lists to filter out any shingle contaning one or many of the stop words should do the trick (I did that in post processing, keeping all shingles in my index). It will probably require bit of manual work, depending on your corpora, to get a really clean list of common phrases that makes sense. Just create a list and inspect it with your eyes an try to find patterns in the phrases you want to get rid of. You might also want to look for punctuation in your text to avoid creating shingles of text that is in diffrent sentences. There is a pretty good sentence extraction tool in Gate you can use.


     karl

7 okt 2009 kl. 01.39 skrev Andrew Zhang:

Hi Karl,

I think shingle is designed to make the phase search faster, it'll generate
a lot of "seemed like" phase by pos only and completely disregard the
meaning, that's not good enough.

Regards,
Andrew

On Tue, Oct 6, 2009 at 11:51 PM, Karl Wettin <karl.wet...@gmail.com> wrote:

Hi Andrew,

I think you are looking for the shingle package in contrib/analyzers.


    karl

6 okt 2009 kl. 13.42 skrev Andrew Zhang:


Hi guys,

The requirement is very simple here, e.g. for this sentence, 'The NBA formally announced its new *social media* guidelines Wednesday', I want
to
treat '*social media*' as a whole phase term. The default english
analyzers
came with lucene all deal with single word, so it you want to get the most frequent terms, *social *and *media* are separated, and each of them can't
represent a good meaning as *social media*, right?

I know there's a way built on some phase dictionary, and try to match the phase already there, very like the way to do with chinese language, but is there an open source solution for english, I mean I don't want to build a phase dictionary myself, and I also want a smart way, which can "discover" the phase automatically. I got 2 millions docs analyzered the norma way,
all
single terms, which I can use as a base source, and it's possible to find that *social media *came together frequently, but I really don't know
what's
the reverse way.

I tried to find some phase analyzers, but no luck. so any advices?

Regards,
Andrew
--
Simple is best



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




--
Simple is best


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


Reply via email to