Poor performances with Shingle and Phrase query

2016-01-21 Thread Bertil Chapuis
Hello,

I'm trying improve the speed of an index when searching for long phrases. I
performed some tests with the benchmark module. With a simple analyser and
PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
latest dump of wikipedia. Here is the filters I use at indexation and query
time:

var filter: TokenFilter = new StandardFilter(tokenizer)
filter = new LowerCaseFilter(filter)
filter = new EnglishPossessiveFilter(filter)
filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
filter = new SnowballFilter(filter, "English")

In order to improve performances I tried to add a ShingleFilter and did
some benchmark with PhraseQueries and BooleanQueries (Should, Must) and in
both cases got a lower throughput (respectively 83rec/sec and 84 rec/sec).
Here is the filter:

var filter: TokenFilter = new StandardFilter(tokenizer)
filter = new LowerCaseFilter(filter)
filter = new EnglishPossessiveFilter(filter)
filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
filter = new SnowballFilter(filter, "English")
val shingleFilter =  new ShingleFilter(filter, 2, 2)
shingleFilter.setOutputUnigrams(false)
filter = shingleFilter

>From what I read, the performances should be better, but I'm unable to get
the desired results. Has anyone some advices on the best way to use shingle
in order to improve performances? Should I use some other form of Query?

Thank you in advance for your help.

Regards,

Bertil


Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Doug Turnbull
In my experience, shingles can hurt query performance because the term
dictionary grows quite a bit. There's far more unique bigrams than there
are words. While the lookup time doesn't grow linearly with the number of
terms, it still grows.

I haven't specifically compared performance numbers shingles vs phrase, but
your numbers don't strike me as particularly shocking with performance
issues I've had in the past with larger term dictionary sizes.

Hope that helps
-Doug




On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis  wrote:

> Hello,
>
> I'm trying improve the speed of an index when searching for long phrases. I
> performed some tests with the benchmark module. With a simple analyser and
> PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> latest dump of wikipedia. Here is the filters I use at indexation and query
> time:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
>
> In order to improve performances I tried to add a ShingleFilter and did
> some benchmark with PhraseQueries and BooleanQueries (Should, Must) and in
> both cases got a lower throughput (respectively 83rec/sec and 84 rec/sec).
> Here is the filter:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
> val shingleFilter =  new ShingleFilter(filter, 2, 2)
> shingleFilter.setOutputUnigrams(false)
> filter = shingleFilter
>
> From what I read, the performances should be better, but I'm unable to get
> the desired results. Has anyone some advices on the best way to use shingle
> in order to improve performances? Should I use some other form of Query?
>
> Thank you in advance for your help.
>
> Regards,
>
> Bertil
>



-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Jack Krupansky
Be sure to check and see if your app is compute or I/O bound during this
process - whether too little of your index is cached in system memory and
each query requires I/O, lots of it.

-- Jack Krupansky

On Thu, Jan 21, 2016 at 1:52 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> In my experience, shingles can hurt query performance because the term
> dictionary grows quite a bit. There's far more unique bigrams than there
> are words. While the lookup time doesn't grow linearly with the number of
> terms, it still grows.
>
> I haven't specifically compared performance numbers shingles vs phrase, but
> your numbers don't strike me as particularly shocking with performance
> issues I've had in the past with larger term dictionary sizes.
>
> Hope that helps
> -Doug
>
>
>
>
> On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis 
> wrote:
>
> > Hello,
> >
> > I'm trying improve the speed of an index when searching for long
> phrases. I
> > performed some tests with the benchmark module. With a simple analyser
> and
> > PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> > latest dump of wikipedia. Here is the filters I use at indexation and
> query
> > time:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> >
> > In order to improve performances I tried to add a ShingleFilter and did
> > some benchmark with PhraseQueries and BooleanQueries (Should, Must) and
> in
> > both cases got a lower throughput (respectively 83rec/sec and 84
> rec/sec).
> > Here is the filter:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> > val shingleFilter =  new ShingleFilter(filter, 2, 2)
> > shingleFilter.setOutputUnigrams(false)
> > filter = shingleFilter
> >
> > From what I read, the performances should be better, but I'm unable to
> get
> > the desired results. Has anyone some advices on the best way to use
> shingle
> > in order to improve performances? Should I use some other form of Query?
> >
> > Thank you in advance for your help.
> >
> > Regards,
> >
> > Bertil
> >
>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search 
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Michael McCandless
Shingles should make a huge different on phrase query performance if
1) the phrase queries involve high frequency terms and 2) you have a
substantial number of documents in the index (so that
time-to-visit-postings dominates over time-to-lookup-terms).

118 rec/sec is already very fast for a long phrase on a large index
... how many documents in your index.

You could also try using CommonGramsFilter instead: it's like
shingles, but only for high frequency terms, so you get less increase
on your index size but big perf gains for the otherwise slow phrase
queries.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis  wrote:
> Hello,
>
> I'm trying improve the speed of an index when searching for long phrases. I
> performed some tests with the benchmark module. With a simple analyser and
> PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> latest dump of wikipedia. Here is the filters I use at indexation and query
> time:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
>
> In order to improve performances I tried to add a ShingleFilter and did
> some benchmark with PhraseQueries and BooleanQueries (Should, Must) and in
> both cases got a lower throughput (respectively 83rec/sec and 84 rec/sec).
> Here is the filter:
>
> var filter: TokenFilter = new StandardFilter(tokenizer)
> filter = new LowerCaseFilter(filter)
> filter = new EnglishPossessiveFilter(filter)
> filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> filter = new SnowballFilter(filter, "English")
> val shingleFilter =  new ShingleFilter(filter, 2, 2)
> shingleFilter.setOutputUnigrams(false)
> filter = shingleFilter
>
> From what I read, the performances should be better, but I'm unable to get
> the desired results. Has anyone some advices on the best way to use shingle
> in order to improve performances? Should I use some other form of Query?
>
> Thank you in advance for your help.
>
> Regards,
>
> Bertil

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Poor performances with Shingle and Phrase query

2016-01-21 Thread Bertil Chapuis
Thank you all for your answers. Initially, I also thought that shingle
should make a huge difference. I will give a try to the CommonGramsFilter.
In the mean time, these additional informations may help you at identifying
a problem in my setup.

Basically, I indexed the whole wikipedia dump (> 8 min articles, the index
size is 20G on disk). I also extracted a set of 1000 random sentences from
the dump in order to create phrase queries and ran the following algorithm
file:

# Properties
directory=FSDirectory
work.dir=/media/sdb/wikipedia/index/2-shingle
task.max.depth.log=2
log.queries=true
query.file=/media/sdb/wikipedia/data/queries.txt.gz
query.maker=ch.unil.doplab.text.ShingleQueryMaker
query.shingle=2
# Algorithm
{ "Rounds"
OpenReader
{ "SearchSameRdr" Search > : 1000
CloseReader
NewRound
} : 10
RepSumByName

The ShingleQueryMaker uses the filter I mentioned in my previous mail. I
also tried to warm the reader ({ "WarmRdr" Warm > : 1) without noticing
huge differences. Is there another way to warm the index before performing
the queries?

The machine on which I run the benchmark has 16GB of RAM and a xeon cpu.
The benchmark is using a lot of memory (~40-50%) and according to the
javadoc the benchmark script I run is single threaded and the cpu usage
reflect that (~100%). Are there some other parameters I should check?

Thank you very much.


On 21 January 2016 at 21:14, Michael McCandless 
wrote:

> Shingles should make a huge different on phrase query performance if
> 1) the phrase queries involve high frequency terms and 2) you have a
> substantial number of documents in the index (so that
> time-to-visit-postings dominates over time-to-lookup-terms).
>
> 118 rec/sec is already very fast for a long phrase on a large index
> ... how many documents in your index.
>
> You could also try using CommonGramsFilter instead: it's like
> shingles, but only for high frequency terms, so you get less increase
> on your index size but big perf gains for the otherwise slow phrase
> queries.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Jan 21, 2016 at 1:23 PM, Bertil Chapuis 
> wrote:
> > Hello,
> >
> > I'm trying improve the speed of an index when searching for long
> phrases. I
> > performed some tests with the benchmark module. With a simple analyser
> and
> > PhraseQueries and get a throughput of 118 rec/sec. My test dataset is the
> > latest dump of wikipedia. Here is the filters I use at indexation and
> query
> > time:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> >
> > In order to improve performances I tried to add a ShingleFilter and did
> > some benchmark with PhraseQueries and BooleanQueries (Should, Must) and
> in
> > both cases got a lower throughput (respectively 83rec/sec and 84
> rec/sec).
> > Here is the filter:
> >
> > var filter: TokenFilter = new StandardFilter(tokenizer)
> > filter = new LowerCaseFilter(filter)
> > filter = new EnglishPossessiveFilter(filter)
> > filter = new StopFilter(filter, StopAnalyzer.ENGLISH_STOP_WORDS_SET)
> > filter = new SnowballFilter(filter, "English")
> > val shingleFilter =  new ShingleFilter(filter, 2, 2)
> > shingleFilter.setOutputUnigrams(false)
> > filter = shingleFilter
> >
> > From what I read, the performances should be better, but I'm unable to
> get
> > the desired results. Has anyone some advices on the best way to use
> shingle
> > in order to improve performances? Should I use some other form of Query?
> >
> > Thank you in advance for your help.
> >
> > Regards,
> >
> > Bertil
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>