Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi,
 Though we have 30 GB total index, size of the indexes that are used
in 75%-80% searches is 5 GB. and we have average search time around 700 ms.
(yes, we have optimized index).

Could someone please throw some light on my original doubt!!!
If I want to keep smaller indexes on different servers so that CPU and
memory may be optimized, how can I aggregate the results of a query from
each of the server. One thing I know is RMI which I studied a few years
back, but that was too slow (or i thought so that time). What are other
techniques?

Is 1 second a bad search time for following?
total index size: 30 GB
index size which is being used in 80% searches - 5 GB
number of fields - 40
most of the fields being numeric fields.
one big "contents" field with 500 - 1000 words.
3500 queries / second mostly on
on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens

Are there any benchmarks from which I can compare the performance of my
application? Or any approximate formula which can guide me
calculating (using system parameters and index/search stats) the "best"
expected search time?

Thanks in advance

On Tue, May 10, 2011 at 9:59 AM, Ganesh  wrote:

> We are using similar technique as yours. We keep smaller indexes and use
> ParallelMultiSearcher to search across the index. Keeping smaller indexes is
> good as index and index optimzation would be faster.  There will be small
> delay while searching across the indexes.
>
> 1. What is your search time?
> 2. Is your index optimized?
>
> I have a doubt, If we keep the indexes size to 30 GB then each file size
> (fdt, fdx etc) would in GB's. Small addition or deletion to the file will
> not cause more IO as it has to skip those bytes and write it at the end of
> file.
>
> Regards
> Ganesh
>
>
>
> - Original Message -
> From: "Samarendra Pratap" 
> To: 
> Sent: Monday, May 09, 2011 5:26 PM
> Subject: Sharding Techniques
>
>
> > Hi list,
> > We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
> >
> > We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very
> > soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.
> >
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
> > We are using 8 servers for search, and each of which receives
> approximately
> > 3000/hour queries in peak hour and search time of more than 1 second is
> > considered bad (is it really bad?) as per the business requirement.
> >
> > Since past few months we are experiencing issues (load and search time)
> on
> > our search servers, due to which I am looking for sharding techniques.
> Can
> > someone guide or give me pointers where i can read more and test?
> >
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
> >
> > Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields
> (to
> > search for) but some queries (searching all the data) require all the
> > indexes and are primary cause of the performance degradation.
> >
> > Any suggestions/ideas are greatly appreciated. And further more will
> > sharding (or similar thing) really reduce search time? (load is a less
> > severe issue when compared to search time)
> >
> >
> > --
> > Regards,
> > Samar
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


Re: Sharding Techniques

2011-05-10 Thread Johannes Zillmann

On May 10, 2011, at 9:42 AM, Samarendra Pratap wrote:

> Hi,
> Though we have 30 GB total index, size of the indexes that are used
> in 75%-80% searches is 5 GB. and we have average search time around 700 ms.
> (yes, we have optimized index).
> 
> Could someone please throw some light on my original doubt!!!
> If I want to keep smaller indexes on different servers so that CPU and
> memory may be optimized, how can I aggregate the results of a query from
> each of the server. One thing I know is RMI which I studied a few years
> back, but that was too slow (or i thought so that time). What are other
> techniques?

There is also http://katta.sourceforge.net/ out there...

Johannes

> 
> Is 1 second a bad search time for following?
> total index size: 30 GB
> index size which is being used in 80% searches - 5 GB
> number of fields - 40
> most of the fields being numeric fields.
> one big "contents" field with 500 - 1000 words.
> 3500 queries / second mostly on
> on an average a query uses 7 fields (1 big 6 small) with 25-30 tokens
> 
> Are there any benchmarks from which I can compare the performance of my
> application? Or any approximate formula which can guide me
> calculating (using system parameters and index/search stats) the "best"
> expected search time?
> 
> Thanks in advance
> 
> On Tue, May 10, 2011 at 9:59 AM, Ganesh  wrote:
> 
>> We are using similar technique as yours. We keep smaller indexes and use
>> ParallelMultiSearcher to search across the index. Keeping smaller indexes is
>> good as index and index optimzation would be faster.  There will be small
>> delay while searching across the indexes.
>> 
>> 1. What is your search time?
>> 2. Is your index optimized?
>> 
>> I have a doubt, If we keep the indexes size to 30 GB then each file size
>> (fdt, fdx etc) would in GB's. Small addition or deletion to the file will
>> not cause more IO as it has to skip those bytes and write it at the end of
>> file.
>> 
>> Regards
>> Ganesh
>> 
>> 
>> 
>> - Original Message -
>> From: "Samarendra Pratap" 
>> To: 
>> Sent: Monday, May 09, 2011 5:26 PM
>> Subject: Sharding Techniques
>> 
>> 
>>> Hi list,
>>> We have an index directory of 30 GB which is divided into 3
>> subdirectories
>>> (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
>>> (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
>>> 
>>> We are running with java 1.6, lucene 2.9 (going to upgrade to 3.1 very
>>> soon), linux (fedora core - kernel 2.6.17-13.1), reiserfs.
>>> 
>>> We have almost 40 fields in each index (is it a bad to have so many
>>> fields?). most of them are id based fields.
>>> We are using 8 servers for search, and each of which receives
>> approximately
>>> 3000/hour queries in peak hour and search time of more than 1 second is
>>> considered bad (is it really bad?) as per the business requirement.
>>> 
>>> Since past few months we are experiencing issues (load and search time)
>> on
>>> our search servers, due to which I am looking for sharding techniques.
>> Can
>>> someone guide or give me pointers where i can read more and test?
>>> 
>>> Keeping parts of indexes on different servers search on all of them and
>> then
>>> merging the results - what could be the best approach?
>>> 
>>> Let me tell you that most queries use only 6-7 indexes and 4 - 5 fields
>> (to
>>> search for) but some queries (searching all the data) require all the
>>> indexes and are primary cause of the performance degradation.
>>> 
>>> Any suggestions/ideas are greatly appreciated. And further more will
>>> sharding (or similar thing) really reduce search time? (load is a less
>>> severe issue when compared to search time)
>>> 
>>> 
>>> --
>>> Regards,
>>> Samar
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> Regards,
> Samar


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sharding Techniques

2011-05-10 Thread Toke Eskildsen
On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote:
>  We have an index directory of 30 GB which is divided into 3 subdirectories
> (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).

So each part is about ½ GB in size? That gives you a serious logistic
overhead. You state later that you only update the index once a day, so
it would seem that you have no need for the fast update times that such
small indexes give you. My guess is that you will get faster search
times by using a single index.


Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.

When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.

Due to caching, a seek is not equal to the storage being hit, but the
probability for a storage hit rises with the number of seeks and the
inevitable term duplicates when splitting the index.

> We have almost 40 fields in each index (is it a bad to have so many
> fields?). most of them are id based fields.

Nah, our index is about 40GB with 100+ fields and 8M documents. We use a
single index, optimized to 5 segments. Response times for raw searches
are a few ms, while response times for the full package (heavy faceting)
is generally below 300ms. Our queries are mostly simple boolean queries
across 13 fields.

> Keeping parts of indexes on different servers search on all of them and then
> merging the results - what could be the best approach?

Locate your bottleneck. Some well-placed log statements or a quick peek
with visualvm (comes with the Oracle JVM) should help a lot.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Thanks
 to Johannes - I am looking into katta. Seems promising.
 to Toke - Great explanation. That's what I was looking for.

 I'll come back and share my experience.
Thank you very much.


On Tue, May 10, 2011 at 1:31 PM, Toke Eskildsen wrote:

> On Mon, 2011-05-09 at 13:56 +0200, Samarendra Pratap wrote:
> >  We have an index directory of 30 GB which is divided into 3
> subdirectories
> > (idx1, idx2, idx3) which are again divided into 21 sub-subdirectories
> > (idx1-1, idx1-2, , idx2-1, , idx3-1, , idx3-21).
>
> So each part is about ½ GB in size? That gives you a serious logistic
> overhead. You state later that you only update the index once a day, so
> it would seem that you have no need for the fast update times that such
> small indexes give you. My guess is that you will get faster search
> times by using a single index.
>
>
> Down to basics, Lucene searches work by locating terms and resolving
> documents from them. For standard term queries, a term is located by a
> process akin to binary search. That means that it uses log(n) seeks to
> get the term. Let's say you have 10M terms in your corpus. If you stored
> that in a single field in a single index with a single segment, it would
> take log(10M) ~= 24 seeks to locate a term. This is of course very
> simplified.
>
> When you have 63 indexes, log(n) works against you. Even with the
> unrealistic assumption that the 10M terms are evenly distributed and
> without duplicates, the number of seeks for a search that hits all parts
> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
> begun to estimate the merging part.
>
> Due to caching, a seek is not equal to the storage being hit, but the
> probability for a storage hit rises with the number of seeks and the
> inevitable term duplicates when splitting the index.
>
> > We have almost 40 fields in each index (is it a bad to have so many
> > fields?). most of them are id based fields.
>
> Nah, our index is about 40GB with 100+ fields and 8M documents. We use a
> single index, optimized to 5 segments. Response times for raw searches
> are a few ms, while response times for the full package (heavy faceting)
> is generally below 300ms. Our queries are mostly simple boolean queries
> across 13 fields.
>
> > Keeping parts of indexes on different servers search on all of them and
> then
> > merging the results - what could be the best approach?
>
> Locate your bottleneck. Some well-placed log statements or a quick peek
> with visualvm (comes with the Oracle JVM) should help a lot.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


PDF Highlighting using PDF Highlight File

2011-05-10 Thread Wulf Berschin

Hi all,

in our Lucene 3.0.3-based web application when a user clicks on a hit 
link the targeted PDF should be opened in the browser with highlighted hits.


For this purpose using the Acrobat Highlight File (Parameter xml, see 
http://www.pdfbox.org/userguide/highlighting.html and 
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf) 
seems most reasonable to me.


Since the position to highlight are given by (page and) character 
offsets and Lucene uses offsets as well I think it could be easy (for 
more Lucene-skilled people than me) to create an Highlighter which 
produces this highlight file.


Does such a Highlighter already exists in the Lucene World?

If not could someone please point me the direction (e.g. where to hook 
into the existing (fast vector?) highlighter just to extract the offsets).


BTW: Luke gyve me the impression that Term Vectors are only stored when 
the field content is sored as well. Is that true?


Wulf


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



An unexpected network error occurred

2011-05-10 Thread Yogesh Dabhi
Three Instance of My application & lucene index directory shared for all
instance  

Lucene version 3.1

Lock factory:-  NativeFSLockFactory

Instance1 jdk64 ,64 os 

Instance2 jdk64 ,64 os 

Instance3 jdk32 ,32 os 

  

When I try to search the data from  the index directory  from Instance1
I got bellow error 

An unexpected network error occurred 

In lucene directory there write.lock file 

 

I cannot read data & update data in index from Instance1

But for other two Instances its working fine 

is there a way to handle such error 

Thanks & Regards 

Yogesh



Re: An unexpected network error occurred

2011-05-10 Thread Ian Lea
A full stack trace dump is always helpful.  Are the three instances on
one server with a local index  directory, or on different servers
accessing a network drive (how?) or what?  If the index is locked it
would be surprising that you could update it from 2 of the instances.


--
Ian.


On Tue, May 10, 2011 at 1:05 PM, Yogesh Dabhi  wrote:
> Three Instance of My application & lucene index directory shared for all
> instance
>
> Lucene version 3.1
>
> Lock factory:-  NativeFSLockFactory
>
> Instance1 jdk64 ,64 os
>
> Instance2 jdk64 ,64 os
>
> Instance3 jdk32 ,32 os
>
>
>
> When I try to search the data from  the index directory  from Instance1
> I got bellow error
>
> An unexpected network error occurred
>
> In lucene directory there write.lock file
>
>
>
> I cannot read data & update data in index from Instance1
>
> But for other two Instances its working fine
>
> is there a way to handle such error
>
> Thanks & Regards
>
> Yogesh
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: SpanNearQuery - inOrder parameter

2011-05-10 Thread Gregory Tarr
Anyone able to help me with the problem below?

Thanks

Greg 

-Original Message-
From: Gregory Tarr [mailto:gregory.t...@detica.com] 
Sent: 09 May 2011 12:33
To: java-user@lucene.apache.org
Subject: RE: SpanNearQuery - inOrder parameter

Attachment didn't work - test below:
 
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field; import
org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TopDocsCollector;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.search.spans.SpanNearQuery;
import org.apache.lucene.search.spans.SpanQuery;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Assert;
import org.junit.Test;
 
public class TestSpanNearQueryInOrder {
 
@Test
public void testSpanNearQueryInOrder() {  RAMDirectory directory = new
RAMDirectory();  IndexWriter writer = new IndexWriter(directory, new
StandardAnalyzer(Version.LUCENE_29), true,
IndexWriter.MaxFieldLength.UNLIMITED);
 TopDocsCollector collector = TopScoreDocCollector.create(3, false);
 
 Document doc = new Document();
 
 // DOC1
 doc.add(new Field("text","   ", Field.Store.YES,
Field.Index.ANALYZED));
 
 writer.addDocument(doc);
 doc = new Document(); 
 
 // DOC2
 doc.add(new Field("text","   "));
 
 writer.addDocument(doc);
 doc = new Document();
 
 // DOC3
 doc.add(new Field("text","     "));
 
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 
 searcher = new IndexSearcher(directory, false);
 
 SpanQuery[] clauses = new SpanQuery[2];  clauses[0] = new
SpanTermQuery(new Term("text", ""));  clauses[1] = new
SpanTermQuery(new Term("text", ""));
 
 // Don't care about order, so setting inOrder = false  SpanNearQuery q
= new SpanNearQuery(clauses, 1, false);  searcher.search(q, collector);
 
 // This assert fails - 3 docs are returned. Expecting only DOC2 and
DOC3
 Assert.assertEquals("Check 2 results", 2, collector.getTotalHits()); 
 
 collector = new TopScoreDocCollector.create(3, false);  clauses = new
SpanQuery[2];  clauses[0] = new SpanTermQuery(new Term("text", ""));
clauses[1] = new SpanTermQuery(new Term("text", ""));
 
 // Don't care about order, so setting inOrder = false  q = new
SpanNearQuery(clauses, 0, false);  searcher.search(q, collector);
 
 // This assert fails - 3 docs are returned. Expecting only DOC2
Assert.assertEquals("Check 1 result", 1, collector.getTotalHits()); }
 
}



From: Gregory Tarr [mailto:gregory.t...@detica.com]
Sent: 09 May 2011 12:29
To: java-user@lucene.apache.org
Subject: SpanNearQuery - inOrder parameter



I attach a junit test which shows strange behaviour of the inOrder
parameter on the SpanNearQuery constructor, using Lucene 2.9.4.

My understanding of this parameter is that true forces the order and
false doesn't care about the order. 

Using true always works. However using false works fine when the terms
in the query are distinct, but if they are equivalent, e.g. searching
for "john john", I do not get the expected results. The workaround seems
to be to always use true for queries with repeated terms.

Any help? 

Thanks 

Greg 

<> 


Please consider the environment before printing this email.

This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory.  The contents of this email may relate
to dealings with other companies within the Detica Limited group of
companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.


Please consider the environment before printing this email.

This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory.  The contents of this email may relate
to dealings with other companies within the Detica Limited group of
companies.

Detica Limited is registered in England under No: 1337451.

Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.

Please consider the environment before printing this email.

This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy by 
an authorised signatory.  The contents of this email may relate to dealings 
with other companies within the Detica Limited group of com

Re: Sharding Techniques

2011-05-10 Thread Mike Sokolov



Down to basics, Lucene searches work by locating terms and resolving
documents from them. For standard term queries, a term is located by a
process akin to binary search. That means that it uses log(n) seeks to
get the term. Let's say you have 10M terms in your corpus. If you stored
that in a single field in a single index with a single segment, it would
take log(10M) ~= 24 seeks to locate a term. This is of course very
simplified.

When you have 63 indexes, log(n) works against you. Even with the
unrealistic assumption that the 10M terms are evenly distributed and
without duplicates, the number of seeks for a search that hits all parts
will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
begun to estimate the merging part.
This is true, but if the indexes are kept on 63 separate servers, those 
seeks will be carried out in parallel.  The OP did indicate his indexes 
would be on different servers, I think?  I still agree with your overall 
point - at this scale a single server is probably best.  And if there 
are performance issues, I think the usual approach is to create multiple 
mirrored copies (slaves) rather than sharding.  Sharding is useful for 
very large indexes: indexes to big to store on disk and cache in memory 
on one commodity box


-Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Sharding Techniques

2011-05-10 Thread Samarendra Pratap
Hi Mike,
*"I think the usual approach is to create multiple mirrored copies (slaves)
rather than sharding"*
This is where my eyes stuck.

 We do have mirrors and in-fact a good number of those. 6 servers are being
used for serving regular queries (2 are for specific queries that do take
time) and each of them receives around 3-3.5 K queries per hour in peak
hours.

 The problem is that the interface being used by end users has a lot of
options plus a few text boxes where they can type up to 64 words each. (and
unfortunately i am not able to reduce these things as these are business
requirements)

 Normal queries go fine under 500 ms but when people start searching
"anything" some queries take up to > 100 seconds. Don't you think
distributing smaller indexes on different machines would reduce the average
search time. (Although I have a feeling that search time for smaller queries
may be slightly increased)


On Tue, May 10, 2011 at 6:32 PM, Mike Sokolov  wrote:

>
>  Down to basics, Lucene searches work by locating terms and resolving
>> documents from them. For standard term queries, a term is located by a
>> process akin to binary search. That means that it uses log(n) seeks to
>> get the term. Let's say you have 10M terms in your corpus. If you stored
>> that in a single field in a single index with a single segment, it would
>> take log(10M) ~= 24 seeks to locate a term. This is of course very
>> simplified.
>>
>> When you have 63 indexes, log(n) works against you. Even with the
>> unrealistic assumption that the 10M terms are evenly distributed and
>> without duplicates, the number of seeks for a search that hits all parts
>> will still be 63 * log(10M/63) ~= 63 * 18 = 1134. And we haven't even
>> begun to estimate the merging part.
>>
> This is true, but if the indexes are kept on 63 separate servers, those
> seeks will be carried out in parallel.  The OP did indicate his indexes
> would be on different servers, I think?  I still agree with your overall
> point - at this scale a single server is probably best.  And if there are
> performance issues, I think the usual approach is to create multiple
> mirrored copies (slaves) rather than sharding.  Sharding is useful for very
> large indexes: indexes to big to store on disk and cache in memory on one
> commodity box
>
> -Mike
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Regards,
Samar


RE: Sharding Techniques

2011-05-10 Thread Burton-West, Tom
Hi Samar,

>>Normal queries go fine under 500 ms but when people start searching
>>"anything" some queries take up to > 100 seconds. Don't you think
>>distributing smaller indexes on different machines would reduce the average
>>.search time. (Although I have a feeling that search time for smaller queries
>>may be slightly increased)

What are the characteristics of your slow queries?  Can you give examples?   
Are the slow queries always slow or only under heavy load?   What the 
bottleneck is and whether splitting into smaller indexes would help depends on 
just what your bottleneck is. It's not clear that your index is large enough 
that the size of the index is causing your bottleneck.

We run indexes of about 350GB with average response times under 200ms and 99th 
percentile reponse times of under 2 seconds. (We have a very low qps rate 
however).


Tom




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Query on using Payload with MoreLikeThis class

2011-05-10 Thread Saurabh Gokhale
Hi,

In the Lucene 2.9.4 project, there is a requirement to boost some of the
keywords in the document using payload.

Now while searching, is there a way I can boost the MoreLikeThis result
using the index time payload values?

Or can I merge MoreLikeThis output and PayloadTermQuery output somehow to
get the final percentage output?


Re: SpanNearQuery - inOrder parameter

2011-05-10 Thread Tom Hill
Since no one else is jumping in, I'll say that I suspect that the span
query code does not bother to check to see if two of the terms are the
same.

I think that would account for the behavior you are seeing. Since the
second SpanTermQuery would match the same term the first one did.

Note that I'm not familiar with the span query code, so this is just a
quick deduction.

Not sure how easy it would be to add this duplicate term detection, if
that's the problem.

Tom


On Tue, May 10, 2011 at 5:58 AM, Gregory Tarr  wrote:
> Anyone able to help me with the problem below?
>
> Thanks
>
> Greg
>
> -Original Message-
> From: Gregory Tarr [mailto:gregory.t...@detica.com]
> Sent: 09 May 2011 12:33
> To: java-user@lucene.apache.org
> Subject: RE: SpanNearQuery - inOrder parameter
>
> Attachment didn't work - test below:
>
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field; import
> org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.TopDocsCollector;
> import org.apache.lucene.search.TopScoreDocCollector;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Version;
> import org.junit.Assert;
> import org.junit.Test;
>
> public class TestSpanNearQueryInOrder {
>
> @Test
> public void testSpanNearQueryInOrder() {  RAMDirectory directory = new
> RAMDirectory();  IndexWriter writer = new IndexWriter(directory, new
> StandardAnalyzer(Version.LUCENE_29), true,
> IndexWriter.MaxFieldLength.UNLIMITED);
>  TopDocsCollector collector = TopScoreDocCollector.create(3, false);
>
>  Document doc = new Document();
>
>  // DOC1
>  doc.add(new Field("text","   ", Field.Store.YES,
> Field.Index.ANALYZED));
>
>  writer.addDocument(doc);
>  doc = new Document();
>
>  // DOC2
>  doc.add(new Field("text","   "));
>
>  writer.addDocument(doc);
>  doc = new Document();
>
>  // DOC3
>  doc.add(new Field("text","     "));
>
>  writer.addDocument(doc);
>  writer.optimize();
>  writer.close();
>
>  searcher = new IndexSearcher(directory, false);
>
>  SpanQuery[] clauses = new SpanQuery[2];  clauses[0] = new
> SpanTermQuery(new Term("text", ""));  clauses[1] = new
> SpanTermQuery(new Term("text", ""));
>
>  // Don't care about order, so setting inOrder = false  SpanNearQuery q
> = new SpanNearQuery(clauses, 1, false);  searcher.search(q, collector);
>
>  // This assert fails - 3 docs are returned. Expecting only DOC2 and
> DOC3
>  Assert.assertEquals("Check 2 results", 2, collector.getTotalHits());
>
>  collector = new TopScoreDocCollector.create(3, false);  clauses = new
> SpanQuery[2];  clauses[0] = new SpanTermQuery(new Term("text", ""));
> clauses[1] = new SpanTermQuery(new Term("text", ""));
>
>  // Don't care about order, so setting inOrder = false  q = new
> SpanNearQuery(clauses, 0, false);  searcher.search(q, collector);
>
>  // This assert fails - 3 docs are returned. Expecting only DOC2
> Assert.assertEquals("Check 1 result", 1, collector.getTotalHits()); }
>
> }
>
> 
>
> From: Gregory Tarr [mailto:gregory.t...@detica.com]
> Sent: 09 May 2011 12:29
> To: java-user@lucene.apache.org
> Subject: SpanNearQuery - inOrder parameter
>
>
>
> I attach a junit test which shows strange behaviour of the inOrder
> parameter on the SpanNearQuery constructor, using Lucene 2.9.4.
>
> My understanding of this parameter is that true forces the order and
> false doesn't care about the order.
>
> Using true always works. However using false works fine when the terms
> in the query are distinct, but if they are equivalent, e.g. searching
> for "john john", I do not get the expected results. The workaround seems
> to be to always use true for queries with repeated terms.
>
> Any help?
>
> Thanks
>
> Greg
>
> <>
>
>
> Please consider the environment before printing this email.
>
> This message should be regarded as confidential. If you have received
> this email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory.  The contents of this email may relate
> to dealings with other companies within the Detica Limited group of
> companies.
>
> Detica Limited is registered in England under No: 1337451.
>
> Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
> England.
>
>
> Please consider the environment before printing this email.
>
> This message should be regarded as confidential. If you have received
> this email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when c

RE: SpanNearQuery - inOrder parameter

2011-05-10 Thread Chris Hostetter


: I attach a junit test which shows strange behaviour of the inOrder
: parameter on the SpanNearQuery constructor, using Lucene 2.9.4.
: 
: My understanding of this parameter is that true forces the order and
: false doesn't care about the order. 
: 
: Using true always works. However using false works fine when the terms
: in the query are distinct, but if they are equivalent, e.g. searching
: for "john john", I do not get the expected results. The workaround seems
: to be to always use true for queries with repeated terms.

I don't think the situation of "overlapping spans" has changed much since 
this thread...

http://search.lucidimagination.com/search/document/ee23395e5a93c525/non_overlapping_span_queries#868b3a3ec6431afc

the crux of hte issue (as i recall) is that there is really no conecptual 
reason to why a query for "'john' near 'john', in any order, with slop of 
Z" shouldn't match a doc that contains only one instance of "john" ... the 
first SpanTermQuery says "i found a match at position X" the second 
SpanTermQuery says "i found a match at position Y" and the SpanNearQuery 
says "the differnece between X and Y is less then Z" therefore i have a 
match.  (The SpanNearQuery can't fail just because X and Y are the same -- 
they might be two distinct term instances, with differnet payloads 
perhaps, that just happen to have the same position).

However: if true==inOrder case works because the SpanNearQuery enforces 
that  "X must be less then Y" so the same term can't ever match twice.



-Hoss

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How do I sort lucene search results by relevance and time?

2011-05-10 Thread Johnbin Wang
Thanks for your suggestion!

 I try to set document boost factor when indexing document. In order to
bubble up recent documents' scores, I set last three month's documents'
boost to 2 , and set other documents' boost factor to 0.5. The I search
index sorting by two fields, lucene default score and time desc. The sorting
results seem good. It meet my requirement.

On Mon, May 9, 2011 at 6:31 PM, Ian Lea  wrote:

> Well, you can use one of the sorting search methods and pass multiple
> sort keys including relevance and a timestamp.  But I suspect the
> Google algorithm may be a bit more complex than that.
>
> One technique is boosting: set an index time document boost on recent
> documents.  Of course what is recent today may not be next week.
> There are other, more complex ways of customizing lucene scoring.  A
> Google search for something like "customized lucene scoring" will find
> lots of info, some recent, some older, but probably all relevant one
> way or another.
>
>
> --
> Ian.
>
>
> On Mon, May 9, 2011 at 4:59 AM, Johnbin Wang 
> wrote:
> > What do I want to do is just like Google search results.  The results in
> the
> > first page is the most relevant and also recent documents, but not
> > absolutely sorted by  time desc.
> >
> > --
> > cheers,
> > Johnbin Wang
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
cheers,
Johnbin Wang


Can I omit ShingleFilter's filler tokens

2011-05-10 Thread William Koscho
Hi,

Can I remove the filler token _ from the n-gram-tokens that are generated by
a ShingleFilter?

I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter,
and ShingleFilter to create phrase n-grams.  The ShingleFilter inserts
FILLER_TOKENs in place of the stopwords, but I don't want them.

How can I omit the filler tokens?

thanks
Bill


Re: Sharding Techniques

2011-05-10 Thread Ganesh

We also use similar kind of technique, breaking indexes in to smaller and 
search using ParallelMultiSearcher. We have to do incremental indexing and the 
records older than 6 months or 1 year (based on ageout setting) should be 
deleted. Having multiple small indexes is really fast in terms of indexing.  

Since you guys mentioned about keeping single large index. Search time woule be 
faster but the indexing and index optimization will take more time. How you are 
handling it in case of incremental indexing. If we keep the indexes size to 
100+ GB then each file size (fdt, fdx etc) would in GB's. Small addition or 
deletion to the file will not cause more IO as it has to skip those bytes and 
write it at the end of file.   

Regards
Ganesh

- Original Message - 
From: "Burton-West, Tom" 
To: 
Sent: Tuesday, May 10, 2011 9:46 PM
Subject: RE: Sharding Techniques


Hi Samar,

>>Normal queries go fine under 500 ms but when people start searching
>>"anything" some queries take up to > 100 seconds. Don't you think
>>distributing smaller indexes on different machines would reduce the average
>>.search time. (Although I have a feeling that search time for smaller queries
>>may be slightly increased)

What are the characteristics of your slow queries?  Can you give examples?   
Are the slow queries always slow or only under heavy load?   What the 
bottleneck is and whether splitting into smaller indexes would help depends on 
just what your bottleneck is. It's not clear that your index is large enough 
that the size of the index is causing your bottleneck.

We run indexes of about 350GB with average response times under 200ms and 99th 
percentile reponse times of under 2 seconds. (We have a very low qps rate 
however).


Tom




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org