Custom indexing

2016-04-12 Thread PrasannaKumar Chamarty
Hi,

What is the best way (in terms of maintenance required with new lucene
releases) to allow splitting of words on "." and "_" for indexing ? Thank
you.


Jackrabbit - Custom indexing

2016-04-12 Thread PrasannaKumar Chamarty
Hi,

What is the best way (in terms of maintenance required with new lucene
releases) to allow splitting of words (into tokens) on "." and "_" for
indexing ?

Please note that I am using lucene through Jackrabbit. Jackrabbit's Search
configuration can be found at http://wiki.apache.org/jackrabbit/Search

The default analyzer is org.apache.lucene.analysis.standard.StandardAnalyzer
If writing custom analyzer is the only option, how to do that without
maintenance overhead with new lucene releases.

Thank you.


Re: Custom indexing

2016-04-12 Thread Ahmet Arslan
Hi Chamarty,

Well, there are a lot of options here.

1) Use LetterTokenizer
2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
3) Use MappingCharFilter to replace those characters with spaces
.
.
.

Ahmet


On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty 
 wrote:



Hi,

What is the best way (in terms of maintenance required with new lucene
releases) to allow splitting of words on "." and "_" for indexing ? Thank
you.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom indexing

2016-04-12 Thread Jack Krupansky
The standard analyzer/tokenizer should do a decent job of splitting on dot,
hyphen, and underscore, in addition to whitespace and other punctuation.

Can you post some specific test cases you are concerned with? (You should
always run some test cases.)

-- Jack Krupansky

On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan 
wrote:

> Hi Chamarty,
>
> Well, there are a lot of options here.
>
> 1) Use LetterTokenizer
> 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> 3) Use MappingCharFilter to replace those characters with spaces
> .
> .
> .
>
> Ahmet
>
>
> On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> tech.kumar...@gmail.com> wrote:
>
>
>
> Hi,
>
> What is the best way (in terms of maintenance required with new lucene
> releases) to allow splitting of words on "." and "_" for indexing ? Thank
> you.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Custom indexing

2016-04-18 Thread PK C
Hi,

   Thank you very much for your quick responses.

Jack Krupansky,

The main use case is searching in file names. For example, lucene.txt,
lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
files. with 'new' I need to get last two files. Please note that Standard
analyzer/tokenizer of lucene 3.6 is not giving us the results with
tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Ahmet,

1. Not sure if LetterTokenizer helps with the above example of having
numbers and letters in file names.
2. WordDelimeterFilter does not seem to be lucene 3.6
3. MappingCharFilter  is what I am already using overriding initReader
method in my CustomAnalyzer (Source copied from StandardAnalyzer (final
class)). Is this a good way to make use of final class StandardAnalyzer
with some custom changes ? Or composition is better ?

Thank you again,
Best Regards

On Tue, Apr 12, 2016 at 8:45 PM, Jack Krupansky 
wrote:

> The standard analyzer/tokenizer should do a decent job of splitting on dot,
> hyphen, and underscore, in addition to whitespace and other punctuation.
>
> Can you post some specific test cases you are concerned with? (You should
> always run some test cases.)
>
> -- Jack Krupansky
>
> On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan 
> wrote:
>
> > Hi Chamarty,
> >
> > Well, there are a lot of options here.
> >
> > 1) Use LetterTokenizer
> > 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> > 3) Use MappingCharFilter to replace those characters with spaces
> > .
> > .
> > .
> >
> > Ahmet
> >
> >
> > On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> > tech.kumar...@gmail.com> wrote:
> >
> >
> >
> > Hi,
> >
> > What is the best way (in terms of maintenance required with new lucene
> > releases) to allow splitting of words on "." and "_" for indexing ? Thank
> > you.
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: Custom indexing

2016-04-18 Thread Jack Krupansky
You failed to disclose up front that you are using such an old release of
Lucene. Lucene is now on 6.0. I'll defer to others if they wish to provide
support for such an old release.

-- Jack Krupansky

On Mon, Apr 18, 2016 at 8:01 AM, PK C  wrote:

> Hi,
>
>Thank you very much for your quick responses.
>
> Jack Krupansky,
>
> The main use case is searching in file names. For example, lucene.txt,
> lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
> files. with 'new' I need to get last two files. Please note that Standard
> analyzer/tokenizer of lucene 3.6 is not giving us the results with
> tokenization of  "." and "_". Are you referring to later versions than 3.6
> ?
>
> Ahmet,
>
> 1. Not sure if LetterTokenizer helps with the above example of having
> numbers and letters in file names.
> 2. WordDelimeterFilter does not seem to be lucene 3.6
> 3. MappingCharFilter  is what I am already using overriding initReader
> method in my CustomAnalyzer (Source copied from StandardAnalyzer (final
> class)). Is this a good way to make use of final class StandardAnalyzer
> with some custom changes ? Or composition is better ?
>
> Thank you again,
> Best Regards
>
> On Tue, Apr 12, 2016 at 8:45 PM, Jack Krupansky 
> wrote:
>
> > The standard analyzer/tokenizer should do a decent job of splitting on
> dot,
> > hyphen, and underscore, in addition to whitespace and other punctuation.
> >
> > Can you post some specific test cases you are concerned with? (You should
> > always run some test cases.)
> >
> > -- Jack Krupansky
> >
> > On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan  >
> > wrote:
> >
> > > Hi Chamarty,
> > >
> > > Well, there are a lot of options here.
> > >
> > > 1) Use LetterTokenizer
> > > 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> > > 3) Use MappingCharFilter to replace those characters with spaces
> > > .
> > > .
> > > .
> > >
> > > Ahmet
> > >
> > >
> > > On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> > > tech.kumar...@gmail.com> wrote:
> > >
> > >
> > >
> > > Hi,
> > >
> > > What is the best way (in terms of maintenance required with new lucene
> > > releases) to allow splitting of words on "." and "_" for indexing ?
> Thank
> > > you.
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>


Re: Custom indexing

2016-04-18 Thread Ahmet Arslan


Hi,

Please try letter tokenizer, it should cover your example.

Ahmet

On Monday, April 18, 2016 3:02 PM, PK C  wrote:



Hi,

   Thank you very much for your quick responses.

Jack Krupansky,

The main use case is searching in file names. For example, lucene.txt,
lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
files. with 'new' I need to get last two files. Please note that Standard
analyzer/tokenizer of lucene 3.6 is not giving us the results with
tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Ahmet,

1. Not sure if LetterTokenizer helps with the above example of having
numbers and letters in file names.
2. WordDelimeterFilter does not seem to be lucene 3.6
3. MappingCharFilter  is what I am already using overriding initReader
method in my CustomAnalyzer (Source copied from StandardAnalyzer (final
class)). Is this a good way to make use of final class StandardAnalyzer
with some custom changes ? Or composition is better ?

Thank you again,
Best Regards


On Tue, Apr 12, 2016 at 8:45 PM, Jack Krupansky 
wrote:

> The standard analyzer/tokenizer should do a decent job of splitting on dot,
> hyphen, and underscore, in addition to whitespace and other punctuation.
>
> Can you post some specific test cases you are concerned with? (You should
> always run some test cases.)
>
> -- Jack Krupansky
>
> On Tue, Apr 12, 2016 at 10:35 AM, Ahmet Arslan 
> wrote:
>
> > Hi Chamarty,
> >
> > Well, there are a lot of options here.
> >
> > 1) Use LetterTokenizer
> > 2) Use WordDelimeterFilter combined with WhiteSpaceTokenizer
> > 3) Use MappingCharFilter to replace those characters with spaces
> > .
> > .
> > .
> >
> > Ahmet
> >
> >
> > On Tuesday, April 12, 2016 3:58 PM, PrasannaKumar Chamarty <
> > tech.kumar...@gmail.com> wrote:
> >
> >
> >
> > Hi,
> >
> > What is the best way (in terms of maintenance required with new lucene
> > releases) to allow splitting of words on "." and "_" for indexing ? Thank
> > you.
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Custom indexing

2016-04-19 Thread Uwe Schindler
Hi,
> The main use case is searching in file names. For example, lucene.txt,
> lucene_new.txt, lucene_1_new.txt. If I use 'lucene', I need to get all 3
> files. with 'new' I need to get last two files. Please note that Standard
> analyzer/tokenizer of lucene 3.6 is not giving us the results with
> tokenization of  "." and "_". Are you referring to later versions than 3.6 ?

Hi StandardTokenizer in 3.6 is the old, non Unicode-compliant tokenizer classic 
tokenizer. In Lucene 4+ it is called "ClassicTokenizer" because it is still 
used by some users, but newer code should use the new StandardTokenizer. From 
Lucene 4 on, StandardTokenizer implements the Word Break rules from the Unicode 
Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

This one is not available in such old Lucene versions, sorry. Your only chance 
is LetterTokenizer or write your own.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org