Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

2012-09-04 Thread Martin O';Shea
If a Lucene ShingleFilter can be used to tokenize a string into shingles, or
ngrams, of different sizes, e.g.:

 

"please divide this sentence into shingles"

 

Becomes:

 

shingles "please divide", "divide this", "this sentence", "sentence
into", and "into shingles"

 

Does anyone know if this can be used in conjunction with other analyzers to
return the frequencies of the bigrams or trigrams found, e.g.:

 

"please divide this please divide sentence into shingles"

 

Would return 2 for "please divide"?

 

I'm currently using Lucene 3.0.2 to extract frequencies of unigrams from a
string using a combination of a TermVectorMapper and Standard/Snowball
analyzers.

 

I should add that my strings are built up from a database and then indexed
by Lucene in memory and are not persisted beyond this. Use of other products
like Solr is not intended.

 

Thanks

 

Mr Morgan.

 

 



RE: Using a Lucene ShingleFilter to extract frequencies of bigrams in Lucene

2012-09-06 Thread Martin O';Shea
Thanks for that piece of advice.

 I ended up passing my snowballAnalyzer and standardAnalyzers as parameters to 
ShingleFilterWrappers and processing the outputs via a TermVectorMapper. 

It seems to work quite well.

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: 05 Sep 2012 01 53
To: java-user@lucene.apache.org
Subject: Re: Using a Lucene ShingleFilter to extract frequencies of bigrams in 
Lucene

On Tue, Sep 4, 2012 at 12:37 PM, Martin O'Shea  wrote:
>
> Does anyone know if this can be used in conjunction with other 
> analyzers to return the frequencies of the bigrams or trigrams found, e.g.:
>
>
>
> "please divide this please divide sentence into shingles"
>
>
>
> Would return 2 for "please divide"?
>
>
>
> I'm currently using Lucene 3.0.2 to extract frequencies of unigrams 
> from a string using a combination of a TermVectorMapper and 
> Standard/Snowball analyzers.
>
>
>
> I should add that my strings are built up from a database and then 
> indexed by Lucene in memory and are not persisted beyond this. Use of 
> other products like Solr is not intended.
>

The bigrams etc generated by shingles are terms just like the unigrams. So you 
can wrap any other analyzer with a ShingleAnalyzerWrapper if you want the 
shingles.

If you just want to use Lucene's analyzers to tokenize the text and compute 
within-document frequencies for a one-off purpose, I think indexing and 
creating term vectors could be overkill: you could just consume the tokens from 
the Analyzer and make a hashmap or whatever you need...

There are examples in the org.apache.lucene.analysis package javadocs.

--
lucidworks.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using stop words with snowball analyzer and shingle filter

2012-09-20 Thread Martin O';Shea
Thanks for the responses. They've given me much food for thought.

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 20 Sep 2012 02 19
To: java-user@lucene.apache.org
Subject: RE: Using stop words with snowball analyzer and shingle filter

Hi Martin,

SnowballAnalyzer was deprecated in Lucene 3.0.3 and will be removed in
Lucene 5.0.

Looks like you're using Lucene 3.X; here's an (untested) Analyzer, based on
Lucene 3.6 EnglishAnalyzer, (except substituting SnowballFilter for
PorterStemmer; disabling stopword holes' position increments; and adding
ShingleFilter), that should basically do what you want:

--
String[] stopWords = new String[] { ... }; Set stopSet =
StopFilter.makeStopSet(matchVersion, stopWords); String[] stemExclusions =
new String[] { ... }; Set stemExclusionsSet = new HashSet();
stemExclusionsSet.addAll(Arrays.asList(stemExclusions));
matchVersion = Version.LUCENE_3X;

Analyzer analyzer = new ReusableAnalyzerBase() {
  @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader
reader) {
final Tokenizer source = new StandardTokenizer(matchVersion, reader);
TokenStream result = new StandardFilter(matchVersion, source);
// prior to this we get the classic behavior, standardfilter does it for
us.
if (matchVersion.onOrAfter(Version.LUCENE_31))
  result = new EnglishPossessiveFilter(matchVersion, result);
result = new LowerCaseFilter(matchVersion, result);
result = new StopFilter(matchVersion, result, stopSet);
((StopFilter)result).setEnablePositionIncrements(false);  // Disable
holes' position increments
if (stemExclusionsSet.size() > 0) {
  result = new KeywordMarkerFilter(result, stemExclusionsSet);
}
result = new SnowballFilter(result, "English");
result = new ShingleFilter(result, this.getnGramLength());
return new TokenStreamComponents(source, result);
  }
};
--

Steve

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Wednesday, September 19, 2012 7:16 PM
To: java-user@lucene.apache.org
Subject: Re: Using stop words with snowball analyzer and shingle filter

The underscores are due to the fact that the StopFilter defaults to "enable
position increments", so there are no terms at the positions where the stop
words appeared in the source text.

Unfortunately, SnowballAnalyzer does not pass that in as a parameter and is
"final" so you can't subclass it to override the "createComponents" method
that creates the StopFilter, so you would essentially have to copy the
source for SnowballAnalyzer and then add in the code to invoke
StopFilter.setEnablePositionIncrements the way StopFilterFactory does.

-- Jack Krupansky

-Original Message-
From: Martin O'Shea
Sent: Wednesday, September 19, 2012 4:24 AM
To: java-user@lucene.apache.org
Subject: Using stop words with snowball analyzer and shingle filter

I'm currently giving the user an option to include stop words or not when
filtering a body of text for ngram frequencies. Typically, this is done as
follows:



snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30, "English",
stopWords);

shingleAnalyzer = new ShingleAnalyzerWrapper(snowballAnalyzer,
this.getnGramLength());



stopWords is set to either a full list of words to include in ngrams or to
remove from them. this.getnGramLength()); simply contains the current ngram
length up to a maximum of three.



If I use stopwords in filtering text "satellite is definitely falling to
Earth" for trigrams, the output is:



No=1, Key=to, Freq=1

No=2, Key=definitely, Freq=1

No=3, Key=falling to earth, Freq=1

No=4, Key=satellite, Freq=1

No=5, Key=is, Freq=1

No=6, Key=definitely falling to, Freq=1

No=7, Key=definitely falling, Freq=1

No=8, Key=falling, Freq=1

No=9, Key=to earth, Freq=1

No=10, Key=satellite is, Freq=1

No=11, Key=is definitely, Freq=1

No=12, Key=falling to, Freq=1

No=13, Key=is definitely falling, Freq=1

No=14, Key=earth, Freq=1

No=15, Key=satellite is definitely, Freq=1



But if I don't use stopwords for trigrams , the output is this:



No=1, Key=satellite, Freq=1

No=2, Key=falling _, Freq=1

No=3, Key=satellite _ _, Freq=1

No=4, Key=_ earth, Freq=1

No=5, Key=falling, Freq=1

No=6, Key=satellite _, Freq=1

No=7, Key=_ _, Freq=1

No=8, Key=_ falling _, Freq=1

No=9, Key=falling _ earth, Freq=1

No=10, Key=_, Freq=3

No=11, Key=earth, Freq=1

No=12, Key=_ _ falling, Freq=1

No=13, Key=_ falling, Freq=1



Why am I seeing underscores? I would have thought to see simple unigrams,
"satellite falling" and "falling earth", and "satellite falling earth"?








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---

How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O';Shea
I realise that 3.0.2 is an old version of Lucene but if I have Java code as
follows:

 

int nGramLength = 3;

Set stopWords = new Set();

stopwords.add("the");

stopwords.add("and");

...

SnowballAnalyzer snowballAnalyzer = new SnowballAnalyzer(Version.LUCENE_30,
"English", stopWords);  

ShingleAnalyzerWrapper shingleAnalyzer = new
ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);

 

Which will generate the frequency of ngrams from a particular a string of
text without stop words, how can I disable the LowerCaseFilter which forms
part of the SnowBallAnalyzer? I want to preserve the case of the ngrams
generated so that I can perform various counts according to the presence /
absence of upper case characters in the ngrams.

 

I am something of a Lucene newbie. And I should add that upgrading the
version of Lucene is not an option here.



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O';Shea
Uwe

Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of 
filters, I was thinking about something like this where I 'pipe' output from 
one filter to the next:

standardTokenizer =new StandardTokenizer (...);
standardFilter = new StandardFilter(standardTokenizer,...);
stopFilter = new StopFilter(standardFilter,...);
snowballFilter = new SnowballFilter(stopFilter,...);

But ignore LowerCaseFilter. Does this make sense?

Thanks

Martin O'Shea.
-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 10 Nov 2014 14 06
To: java-user@lucene.apache.org
Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in 
Lucene 3.0.2

Hi,

In general, you cannot change Analyzers, they are "examples" and can be seen as 
"best practise". If you want to modify them, write your own Analyzer subclass 
which uses the wanted Tokenizers and TokenFilters as you like. You can for 
example clone the source code of the original and remove LowercaseFilter. 
Analyzers are very simple, there is no logic in them, it's just some 
"configuration" (which Tokenizer and which TokenFilters). In later Lucene 3 and 
Lucene 4, this is very simple: You just need to override createComponents in 
Analyzer class and add your "configuration" there.

If you use Apache Solr or Elasticsearch you can create your analyzers by XML or 
JSON configuration.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> Sent: Monday, November 10, 2014 2:54 PM
> To: java-user@lucene.apache.org
> Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in 
> Lucene 3.0.2
> 
> I realise that 3.0.2 is an old version of Lucene but if I have Java 
> code as
> follows:
> 
> 
> 
> int nGramLength = 3;
> 
> Set stopWords = new Set();
> 
> stopwords.add("the");
> 
> stopwords.add("and");
> 
> ...
> 
> SnowballAnalyzer snowballAnalyzer = new 
> SnowballAnalyzer(Version.LUCENE_30,
> "English", stopWords);
> 
> ShingleAnalyzerWrapper shingleAnalyzer = new 
> ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> 
> 
> 
> Which will generate the frequency of ngrams from a particular a string 
> of text without stop words, how can I disable the LowerCaseFilter 
> which forms part of the SnowBallAnalyzer? I want to preserve the case 
> of the ngrams generated so that I can perform various counts according 
> to the presence / absence of upper case characters in the ngrams.
> 
> 
> 
> I am something of a Lucene newbie. And I should add that upgrading the 
> version of Lucene is not an option here.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-10 Thread Martin O';Shea
Thanks Uwe.

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 10 Nov 2014 14 43
To: java-user@lucene.apache.org
Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in 
Lucene 3.0.2

Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in 
your own package and remove LowercaseFilter. But be aware, it could be that 
snowball needs lowercased terms to correctly do stemming!!! I don't know about 
this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You should 
make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set 
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
> simple: You just need to override createComponents in Analyzer class and add 
> your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case characters 
> > in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O';Shea
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler  wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very 
> simple: You just need to override createComponents in Analyzer class and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading 
> > the version of Lucene is not an option here.
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

2014-11-11 Thread Martin O';Shea
Ahmet, 

Yes that is quite true. But as this is only a proof of concept application,
I'm prepared for things to be 'imperfect'.

Martin O'Shea.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: 11 Nov 2014 18 26
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

With that analyser, your searches (for same word, but different capitalised)
could return different results.

Ahmet


On Tuesday, November 11, 2014 6:57 PM, Martin O'Shea 
wrote:
In the end I edited the code of the StandardAnalyzer and the
SnowballAnalyzer to disable the calls to the LowerCaseFilter. This seems to
work.

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: 10 Nov 2014 15 19
To: java-user@lucene.apache.org
Subject: Re: How to disable LowerCaseFilter when using SnowballAnalyzer in
Lucene 3.0.2

Hi,

Regarding Uwe's warning, 

"NOTE: SnowballFilter expects lowercased text." [1]

[1]
https://lucene.apache.org/core/4_3_0/analyzers-common/org/apache/lucene/anal
ysis/snowball/SnowballFilter.html



On Monday, November 10, 2014 4:43 PM, Uwe Schindler  wrote:
Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a 
> series of filters, I was thinking about something like this where I 
> 'pipe' output from one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new 
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new 
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?

Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in
your own package and remove LowercaseFilter. But be aware, it could be that
snowball needs lowercased terms to correctly do stemming!!! I don't know
about this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You
should make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: 10 Nov 2014 14 06
> To: java-user@lucene.apache.org
> Subject: RE: How to disable LowerCaseFilter when using 
> SnowballAnalyzer in Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can 
> be seen as "best practise". If you want to modify them, write your own 
> Analyzer subclass which uses the wanted Tokenizers and TokenFilters as 
> you like. You can for example clone the source code of the original 
> and remove LowercaseFilter. Analyzers are very simple, there is no 
> logic in them, it's just some "configuration" (which Tokenizer and 
> which TokenFilters). In later Lucene 3 and Lucene 4, this is very
> simple: You just need to override createComponents in Analyzer class 
> and
add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers 
> by XML or JSON configuration.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: Martin O'Shea [mailto:m.os...@dsl.pipex.com]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: java-user@lucene.apache.org
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer 
> > in Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java 
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set stopWords = new Set();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new 
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new 
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a 
> > string of text without stop words, how can I disable the 
> > LowerCaseFilter which forms part of the SnowBallAnalyzer? I want to 
> > preserve the case of the ngrams generated so that I can perform 
> > various counts according to the presence / absence of upper case
characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newb

RE: Use of Lucene to store data from RSS feeds

2010-10-15 Thread Martin O';Shea
@Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday, I'm 
not sure if each row in the database table would form a document or not because 
I do not know if Lucene works in this manner. In my case, each row of the table 
represents a single polling of an RSS feed to retrieve any new postings over a 
given number of hours. If Lucene allows a document to have separate time-based 
entries, then I am happy to use it for indexing. But if a separate document is 
needed per row of the table, then I'm uncertain. I always do have the option of 
using Lucene for in-memory indexing of postings to calculate the keyword 
frequencies. This I know how to do.

The individual columns of my table represent the only two elements of each RSS 
item that I'm interested in retrieving text from, i.e. the title and 
description.

-Original Message-
From: Pulkit Singhal [mailto:pulkitsing...@gmail.com] 
Sent: 15 Oct 2010 13 36
To: java-user@lucene.apache.org
Subject: Re: Use of Lucene to store data from RSS feeds

When you ask:
a) will each feed would form a Lucene document, or
b) will each database row would form a lucene document
I'm inclined to say that really depends on what type of aggregation
tool or logic you are using.

I don't know if "Tika" does it but if there is a tool out there that
can be pointed to a feed and tweaked to spit out documents with each
field having the settings that you want then you can go with that
approach. But if you are already parsing the feed and storing the raw
data into a database table then there is no reason that you can't
leverage that. From a database row perspective you have already done a
good deal of work to collect the data and breaking it down into chunks
that Lucene can happily index as separate fields in a document.

By the way I think there are tools that read from the database
directly too but I won't try to make things too complicated.

The way I see it, if you were to use the row at this moment and index
the 4 columns as fields ... plus you could set the feed body to be
ANALYZED (why don't I see the feed body in your database table?) ...
then lucene range queries on the date/time field could possibly return
some results. I am not sure how to get keyword frequencies but if the
analyzed tokens that lucene is keeping in its index sort of represent
the keywords that you are talking about then i do know that lucene
keeps some sort of inverted index per token in terms of how many
occurrences of it are there ... may be someone else on the list can
comment on how to extract that info in a query.

Sounds doable.

On Thu, Oct 14, 2010 at 10:17 AM,   wrote:
> Hello
>
> I would like to store data retrieved hourly from RSS feeds in a database or 
> in Lucene so that the text can be easily
> indexed for word frequencies.
>
> I need to get the text from the title and description elements of RSS items.
>
> Ideally, for each hourly retrieval from a given feed, I would add a row to a 
> table in a dataset made up of the
> following columns:
>
> feed_url, title_element_text, description_element_text, polling_date_time
>
> From this, I can look up any element in a feed and calculate keyword 
> frequencies based upon the length of time required.
>
> This can be done as a database table and hashmaps used to calculate word 
> frequencies. But can I do this in Lucene to
> this degree of granularity at all? If so, would each feed form a Lucene 
> document or would each 'row' from the
> database table form one?
>
> Can anyone advise?
>
> Thanks
>
> Martin O'Shea.
> --
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O';Shea
Hello

 

I am trying to use a TermFreqVector to get a count of all words in a
Document as follows:

 

   // Search.

int hitsPerPage = 10;

IndexSearcher searcher = new IndexSearcher(index, true);

TopScoreDocCollector collector =
TopScoreDocCollector.create(hitsPerPage, true);

searcher.search(q, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;

 

// Display results.

int docId = 0;

System.out.println("Found " + hits.length + " hits.");

for (int i = 0; i < hits.length; ++i) {

docId = hits[i].doc;

Document d = searcher.doc(docId);

System.out.println((i + 1) + ". " + d.get("title"));

IndexReader trd = IndexReader.open(index);

TermFreqVector tfv = trd.getTermFreqVector(docId, "title");

System.out.println(tfv.getTerms().toString());

System.out.println(tfv.getTermFrequencies().toString());

}

 

The code is very rough as its only an experiment but I'm under the
impression that the getTerms and getTermFrequencies methods for a
TermFreqVector should allow each word and its frequency in the document to
be displayed. All I get though is a NullPointerError. The index consists of
a single document made up of a simple string:

 

IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

addDoc(w, "Lucene for Dummies"); 

 

And the queryString being used is simply "dummies".  

 

Thanks

 

Martin O'Shea.



RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O';Shea
Uwe

Thanks - I figured that bit out. I'm a Lucene 'newbie'.

What I would like to know though is if it is practical to search a single
document of one field simply by doing this:

IndexReader trd = IndexReader.open(index);
TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
String[] terms = tfv.getTerms();
int[] freqs = tfv.getTermFrequencies();
for (int i = 0; i < tfv.getTerms().length; i++) {
System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
}
trd.close();

where docId is set to 0.

The code works but can this be improved upon at all?

My situation is where I don't want to calculate the number of documents with
a particular string. Rather I want to get counts of individual words in a
field in a document. So I can concatenate the strings before passing it to
Lucene.

-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: 20 Oct 2010 19 40
To: java-user@lucene.apache.org
Subject: RE: Using a TermFreqVector to get counts of all words in a document

TermVectors are only available when enabled for the field/document.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Wednesday, October 20, 2010 8:23 PM
> To: java-user@lucene.apache.org
> Subject: Using a TermFreqVector to get counts of all words in a document
> 
> Hello
> 
> 
> 
> I am trying to use a TermFreqVector to get a count of all words in a
Document
> as follows:
> 
> 
> 
>// Search.
> 
> int hitsPerPage = 10;
> 
> IndexSearcher searcher = new IndexSearcher(index, true);
> 
> TopScoreDocCollector collector =
> TopScoreDocCollector.create(hitsPerPage, true);
> 
> searcher.search(q, collector);
> 
> ScoreDoc[] hits = collector.topDocs().scoreDocs;
> 
> 
> 
> // Display results.
> 
> int docId = 0;
> 
> System.out.println("Found " + hits.length + " hits.");
> 
> for (int i = 0; i < hits.length; ++i) {
> 
> docId = hits[i].doc;
> 
> Document d = searcher.doc(docId);
> 
> System.out.println((i + 1) + ". " + d.get("title"));
> 
> IndexReader trd = IndexReader.open(index);
> 
> TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
> 
> System.out.println(tfv.getTerms().toString());
> 
> System.out.println(tfv.getTermFrequencies().toString());
> 
> }
> 
> 
> 
> The code is very rough as its only an experiment but I'm under the
impression
> that the getTerms and getTermFrequencies methods for a TermFreqVector
> should allow each word and its frequency in the document to be displayed.
All I
> get though is a NullPointerError. The index consists of a single document
made
> up of a simple string:
> 
> 
> 
> IndexWriter w = new IndexWriter(index, analyzer, true,
> IndexWriter.MaxFieldLength.UNLIMITED);
> 
> addDoc(w, "Lucene for Dummies");
> 
> 
> 
> And the queryString being used is simply "dummies".
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Using a TermFreqVector to get counts of all words in a document

2010-10-20 Thread Martin O';Shea
http://mail-archives.apache.org/mod_mbox/lucene-java-user/201010.mbox/%3c128
7065863.4cb7110774...@netmail.pipex.net%3e will give you a better idea of
what I'm moving towards.

It's all a bit grey at the moment so further investigation is inevitable.

I expect that a combination of MySQL database storage and Lucene indexing is
going to be the end result.



-Original Message-
From: Grant Ingersoll [mailto:gsing...@apache.org] 
Sent: 20 Oct 2010 21 20
To: java-user@lucene.apache.org
Subject: Re: Using a TermFreqVector to get counts of all words in a document


On Oct 20, 2010, at 2:53 PM, Martin O'Shea wrote:

> Uwe
> 
> Thanks - I figured that bit out. I'm a Lucene 'newbie'.
> 
> What I would like to know though is if it is practical to search a single
> document of one field simply by doing this:
> 
> IndexReader trd = IndexReader.open(index);
>TermFreqVector tfv = trd.getTermFreqVector(docId, "title");
>String[] terms = tfv.getTerms();
>int[] freqs = tfv.getTermFrequencies();
>for (int i = 0; i < tfv.getTerms().length; i++) {
>System.out.println("Term " + terms[i] + " Freq: " + freqs[i]);
>}
>trd.close();
> 
> where docId is set to 0.
> 
> The code works but can this be improved upon at all?
> 
> My situation is where I don't want to calculate the number of documents
with
> a particular string. Rather I want to get counts of individual words in a
> field in a document. So I can concatenate the strings before passing it to
> Lucene.

Can you describe the bigger problem you are trying to solve?  This looks
like a classic XY problem: http://people.apache.org/~hossman/#xyproblem

What you are doing above will work OK for what you describe (up to the
"passing it to Lucene" part), but you probably should explore the use of the
TermVectorMapper which provides a callback mechanism (similar to a SAX
parser) that will allow you to build your data structures on the fly instead
of having to serialize them into two parallel arrays and then loop over
those arrays to create some other structure.


--
Grant Ingersoll
http://www.lucidimagination.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O';Shea
Hello

 

I have a StandardAnalyzer working which retrieves words and frequencies from
a single document using a TermVectorMapper which is populating a HashMap.

 

But if I use the following text as a field in my document, i.e. 

 

addDoc(w, "lucene Lawton-Browne Lucene");

 

The word frequencies returned in the HashMap are:

 

browne 1

lucene 2

lawton 1

 

The problem is the words 'lawton' and 'browne'. If this is an actual
'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where the
name is actually a single word?

 

I've tried combinations of:

 

addDoc(w, "lucene \"Lawton-Browne\" Lucene");

 

And single quotes but without success.

 

Thanks

 

Martin O'Shea.

 

 

 



RE: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O';Shea
A good suggestion. But I'm using Lucene 3.2 and the constructor for a 
StandardAnalyzer has Version_30 as its highest value.

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 24 Oct 2010 21 31
To: java-user@lucene.apache.org
Subject: RE: Use of hyphens in StandardAnalyzer

Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FW: Use of hyphens in StandardAnalyzer

2010-10-24 Thread Martin O';Shea
A good suggestion. But I'm using Lucene 3.0.2 and the constructor for a 
StandardAnalyzer has Version_30 as its highest value. Do you know when 3.1 is 
due?

-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: 24 Oct 2010 21 31
To: java-user@lucene.apache.org
Subject: RE: Use of hyphens in StandardAnalyzer

Hi Martin,

StandardTokenizer and -Analyzer have been changed, as of future version 3.1 
(the next release) to support the Unicode segmentation rules in UAX#29.  My 
(untested) guess is that your hyphenated word will be kept as a single token if 
you set the version to 3.1 or higher in the constructor.

Steve

> -Original Message-
> From: Martin O'Shea [mailto:app...@dsl.pipex.com]
> Sent: Sunday, October 24, 2010 3:59 PM
> To: java-user@lucene.apache.org
> Subject: Use of hyphens in StandardAnalyzer
> 
> Hello
> 
> 
> 
> I have a StandardAnalyzer working which retrieves words and frequencies
> from
> a single document using a TermVectorMapper which is populating a HashMap.
> 
> 
> 
> But if I use the following text as a field in my document, i.e.
> 
> 
> 
> addDoc(w, "lucene Lawton-Browne Lucene");
> 
> 
> 
> The word frequencies returned in the HashMap are:
> 
> 
> 
> browne 1
> 
> lucene 2
> 
> lawton 1
> 
> 
> 
> The problem is the words 'lawton' and 'browne'. If this is an actual
> 'double-barreled' name, can Lucene recognise it as 'Lawton-Browne' where
> the
> name is actually a single word?
> 
> 
> 
> I've tried combinations of:
> 
> 
> 
> addDoc(w, "lucene \"Lawton-Browne\" Lucene");
> 
> 
> 
> And single quotes but without success.
> 
> 
> 
> Thanks
> 
> 
> 
> Martin O'Shea.
> 
> 
> 
> 
> 
> 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Combining analyzers in Lucene

2011-03-05 Thread Martin O';Shea
Hello
I have a situation where I'm using two methods in a Java class to implement
a StandardAnalyzer in Lucene to index text strings and return their word
frequencies as follows:

public void indexText(String suffix, boolean includeStopWords)  {

StandardAnalyzer analyzer = null;

if (includeStopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_30);
}
else {

// Get Stop_Words to exclude them.
Set stopWords = (Set)
Stop_Word_Listener.getStopWords();  
analyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords);
}

try {

// Index text.
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
this.addTextToIndex(w, this.getTextToIndex());
w.close();

// Read index.
IndexReader ir = IndexReader.open(index);
Text_TermVectorMapper ttvm = new Text_TermVectorMapper();

int docId = 0;

ir.getTermFreqVector(docId, "text", ttvm);

// Set output.
this.setWordFrequencies(ttvm.getWordFrequencies());
w.close();
}
catch(Exception ex) {
logger.error("Error indexing elements of RSS_Feed for object " +
suffix + "\n", ex);
}
}

private void addTextToIndex(IndexWriter w, String value) throws
IOException {
Document doc = new Document();
doc.add(new Field("text"), value, Field.Store.YES,
Field.Index.ANALYZED, Field.TermVector.YES));
w.addDocument(doc);
}

Which works perfectly well but I would like to combine this with stemming
using a SnowballAnalyzer as well. 

This class also has two instance variables shown in a constructor below:

public Text_Indexer(String textToIndex) {
this.textToIndex = textToIndex;
this.wordFrequencies = new HashMap();
}

Can anyone tell me how best to achieve this with the code above? Should I
re-index the text when it is returned by the above code or can the stemming
be introduced into the above at all?

Thanks

Mr Morgan.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org