Efficient string lookup using Lucene

2012-08-24 Thread Ilya Zavorin
Hi Everyone, I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in several languages mixed up. So to me these are just a bunch of Unicode text files. What I need is to implement an efficient EXACT str

Re: Efficient string lookup using Lucene

2012-08-24 Thread Dawid Weiss
What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on how much text you have on the input it may either be a simple task -- see here: http://labs.carrotsearch.com/jsuffixa

Re: Efficient string lookup using Lucene

2012-08-24 Thread Jack Krupansky
rin Sent: Friday, August 24, 2012 3:48 PM To: java-user@lucene.apache.org Subject: Efficient string lookup using Lucene Hi Everyone, I have the following task. I have a set of documents in multiple languages. I don't know what these languages are. Any given doc may contain text in severa

Re: Efficient string lookup using Lucene

2012-08-24 Thread Ahmet Arslan
> search for a string "run", I do not need to find "ran" but I > do want to find it in all of these strings below: > > Fox is running fast > !%#^&$run!$!%@&$# > run,run With NGramFilter you can do that. But it creates a lot of tokens. For example "Fox is running fast" becomes F o

Re: Efficient string lookup using Lucene

2012-08-25 Thread Noopur Julka
Hi, I have a similar issue. I need lucene search to work with kanji characters (japanese). The hits object (or topDocs) returns length = 0 for results but works well for english. I know my index contains matches as luke (lucene search tool) renders them. I tried lace analyser - did not work. Re

RE: Efficient string lookup using Lucene

2012-08-25 Thread Ilya Zavorin
Does it mean that the resulting index will be very large? Thanks, Ilya -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, August 24, 2012 4:59 PM To: java-user@lucene.apache.org Subject: Re: Efficient string lookup using Lucene > search for a string &

RE: Efficient string lookup using Lucene

2012-08-25 Thread Ilya Zavorin
: Friday, August 24, 2012 4:50 PM To: java-user@lucene.apache.org Subject: Re: Efficient string lookup using Lucene What you need is a suffix tree or a suffix array. Both data structures will allow you to perform constant-time searches for existence/ occurrence of any input pattern. Depending on

Re: Efficient string lookup using Lucene

2012-08-25 Thread Noopur Julka
012 4:50 PM > To: java-user@lucene.apache.org > Subject: Re: Efficient string lookup using Lucene > > What you need is a suffix tree or a suffix array. Both data structures > will allow you to perform constant-time searches for existence/ occurrence > of any input pattern. Depending on

Re: Efficient string lookup using Lucene

2012-08-25 Thread Devon H. O'Dell
wid.we...@gmail.com] > > Sent: Friday, August 24, 2012 4:50 PM > > To: java-user@lucene.apache.org > > Subject: Re: Efficient string lookup using Lucene > > > > What you need is a suffix tree or a suffix array. Both data structures > > will allow you to perform cons

Re: Efficient string lookup using Lucene

2012-08-25 Thread Noopur Julka
> implement it outside Lucene? > > > > > > By the way, I need this to run on an Android phone so size of memory > > might > > > be an issue... > > > > > > Thanks, > > > > > > > > > Ilya Zavorin > > > &g

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> Does Lucene support this type of structure, or do I need to somehow implement > it outside Lucene? You'd have to implement it separately but it'd be much, much smaller than Lucene itself (even obfuscated). > By the way, I need this to run on an Android phone so size of memory might be > an is

RE: Efficient string lookup using Lucene

2012-08-26 Thread Ilya Zavorin
/-specific. Thanks, Ilya -Original Message- From: Dawid Weiss [mailto:dawid.we...@gmail.com] Sent: Sunday, August 26, 2012 3:55 AM To: java-user@lucene.apache.org Subject: Re: Efficient string lookup using Lucene > Does Lucene support this type of structure, or do I need to somehow im

Re: Efficient string lookup using Lucene

2012-08-26 Thread Lance Norskog
ge-dependent/-specific. > > Thanks, > > Ilya > > -Original Message- > From: Dawid Weiss [mailto:dawid.we...@gmail.com] > Sent: Sunday, August 26, 2012 3:55 AM > To: java-user@lucene.apache.org > Subject: Re: Efficient string lookup using Lucene > >> Does Luce

Re: Efficient string lookup using Lucene

2012-08-26 Thread Dawid Weiss
> The WhitespaceAnalyzer breaks up text by spaces and tabs and newlines. > After that, you can wildcards. This will use very little space. I > believe leading&trailing wildcards are supported now, right? If leading wildcards take too much time (don't know, really) then one could also try to index