Re: Implementing own Analyser components.

2016-10-31 Thread Fuad Efendi
Hi Sergey, Here is the table of tags from http://www.nltk.org/book/ch05.html Tag Meaning English Examples ADJ adjective new, good, high, special, big, local ADP adposition on, of, at, with, by, into, under ADV adverb really, already, still, early, now CONJ conjunction and, or, but, if, while,

Implementing own Analyser components.

2016-10-31 Thread Sergey Repnikov
Hello. My name is Sergeiy, I'm working on Lucene's functionality extension. As I've read in JavaDoc for "org.apache.lucene.analysis" package, it's preferably to ask this email before extending, because some features could be done. So I want to have opportunity to perform search by parts of

Re: Why do the Japanese analyser FST files change every release?

2015-08-07 Thread Dawid Weiss
It is (b). D. On Fri, Aug 7, 2015 at 3:05 AM, Trejkaz trej...@trypticon.org wrote: I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2. During this process, I noticed that the FST used by the Japanese analyser (AKA Kuromoji) was changing between releases. As I fear breakages

Why do the Japanese analyser FST files change every release?

2015-08-06 Thread Trejkaz
I have recently done updates from Lucene 3.6 to 4.x and 4.x to 5.2. During this process, I noticed that the FST used by the Japanese analyser (AKA Kuromoji) was changing between releases. As I fear breakages in backwards compatibility, I worried that the dictionary had changed, so I wrote

Re: Twitter analyser

2013-11-09 Thread Stephane Nicoll
Hi, This is what I've tried: https://gist.github.com/anonymous/7383104 So far so good except that something is definitely wrong in my code as the synonym is not emitted as a valid token it seems. This is how my indexing analyzer is built: private static final class MyIndexAnalyzer extends

Re: Twitter analyser

2013-11-09 Thread Stephane Nicoll
Replying to self: silly me. I am obviously creating the array with the wrong length. final String term = new String(buffer, 1, length); should be replaced by final String term = new String(buffer, 1, length -1); and the silly trim can go away. I guess I need more coffee. S. On Sat, Nov 9,

Re: Twitter analyser

2013-11-08 Thread Lance Norskog
This is a parts-of-speech analyzer for tweets. It would make your index far more useful. http://www.ark.cs.cmu.edu/TweetNLP/ On 11/04/2013 11:40 PM, Stéphane Nicoll wrote: Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
If your universe of items you want to match this way is small, consider something akin to synonyms. Your indexing process emits two tokens, with and without the @ or # which should cover your situation. FWIW, Erick On Tue, Nov 5, 2013 at 2:40 AM, Stéphane Nicoll stephane.nic...@gmail.comwrote:

Re: Twitter analyser

2013-11-05 Thread Stephane Nicoll
Hi, Thanks for the reply. It's an index with tweets so any word really is a target for this. This would mean a significant increase of the index. My volumes are really small so that shouldn't be a problem (but performance/scalability is a concern). I have the control over the query. Another

Re: Twitter analyser

2013-11-05 Thread Erick Erickson
You have to get the values _into_ the index with the special characters, that's where the issue is. Depending on your analysis chain special characters may or may not be even in your index to search in the first place. So it's not how many different words are after the special characters as much

Re: Twitter analyser

2013-11-05 Thread Jack Krupansky
protWords) See: http://lucene.apache.org/core/4_5_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html -- Jack Krupansky -Original Message- From: Stéphane Nicoll Sent: Tuesday, November 05, 2013 2:40 AM To: java-user@lucene.apache.org Subject: Twitter analyser

Twitter analyser

2013-11-04 Thread Stéphane Nicoll
Hi, I am building an application that indexes tweet and offer some basic search facilities on them. I am trying to find a combination where the following would work: * foo matches the foo word, a mention (@foo) or the hashtag (#foo) * @foo only matches the mention * #foo matches only the

Re: Case insensitive Keyword Analyser

2011-10-18 Thread Jamir Shaikh
(Version.LUCENE_34, tokenStream); return tokenStream; } } Best, Anna -Ursprüngliche Nachricht- Von: Jamir Shaikh [mailto:shaikhja...@gmail.com] Gesendet: Samstag, 15. Oktober 2011 02:22 An: java-user@lucene.apache.org Betreff: Case insensitive Keyword Analyser

AW: Case insensitive Keyword Analyser

2011-10-17 Thread Anna Hunecke
-Ursprüngliche Nachricht- Von: Jamir Shaikh [mailto:shaikhja...@gmail.com] Gesendet: Samstag, 15. Oktober 2011 02:22 An: java-user@lucene.apache.org Betreff: Case insensitive Keyword Analyser Hi Guys, Use Case: Field: Name Data: Jose , Jose Sam

Re: Dealing with special cases in analyser

2010-03-18 Thread Paul Taylor
Grant Ingersoll wrote: On Mar 17, 2010, at 11:34 AM, Paul Taylor wrote: Grant Ingersoll wrote: What's your current chain of TokenFilters? How many exceptions do you expect? That is, could you enumerate them? Very few, yes I could enumerate them, but not sure what exactly

Re: Dealing with special cases in analyser

2010-03-17 Thread Grant Ingersoll
What's your current chain of TokenFilters? How many exceptions do you expect? That is, could you enumerate them? On Mar 12, 2010, at 5:27 AM, Paul Taylor wrote: Hi, I'm using a custom analyser based on standardanalyser with good results to search artists (i.e rolling stones/beatles

Re: Dealing with special cases in analyser

2010-03-17 Thread Paul Taylor
Grant Ingersoll wrote: What's your current chain of TokenFilters? How many exceptions do you expect? That is, could you enumerate them? Very few, yes I could enumerate them, but not sure what exactly you are suggesting, what I was going to do would be add to the charConvertMap (when I

Re: Dealing with special cases in analyser

2010-03-17 Thread Grant Ingersoll
On Mar 17, 2010, at 11:34 AM, Paul Taylor wrote: Grant Ingersoll wrote: What's your current chain of TokenFilters? How many exceptions do you expect? That is, could you enumerate them? Very few, yes I could enumerate them, but not sure what exactly you are suggesting, what I was

Dealing with special cases in analyser

2010-03-12 Thread Paul Taylor
Hi, I'm using a custom analyser based on standardanalyser with good results to search artists (i.e rolling stones/beatles) but it fails to match some weird artists names such as '!!!', this is not suprising because the analyser ignores punctuation which is what I want it to normally. I just

farsi analyser

2006-10-19 Thread pc123
sorry i meant farsi analyser instead of farsi parser. -- View this message in context: http://www.nabble.com/farsi-analyser-tf2472949.html#a6895440 Sent from the Lucene - Java Users mailing list archive at Nabble.com

Re: analyser

2006-04-11 Thread Daniel Noll
Raghavendra Prabhu wrote: While Indexing, I use a different Analyser While searching, I use a simple standard Analyzer Will this prevent me from getting the same best fragments when i do a search for two terms say term1 and term2 It depends on the differences, but in general you will always