Re: hybrid query (lucene + db)

2008-05-01 Thread Stephane Nicoll
Well for the moment we don't. The lucene index only contains the full
text content (indexed, not stored). We use lucene to perform full text
and fuzzy searches on the keywords field. Once we have the result, we
match them with the geospatial box provided by the user (we use Oracle
Spatial for that). We have no notion of city, state or zip code. Date
overlaps more than one countries most of the time actually.

We are thinking of reimplementing a quad tree in lucene to flag each
item with a spatial area. That way we will be able to pre-filter the
zone accordingly.

Still, this does not explain the deadlock on SegmentReader. If anyone
has an idea...

Thanks,
Stéphane

On Thu, May 1, 2008 at 8:50 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> Stephane,
>
>  Could you describe how you setup the spatial area? Having BooleanQuery with
>  200 terms in it definitely slows things down (I'm not sure exactly why yet
>  -- it seems like it shouldn't be "that" slow). If you can describe your
>  spatial area in fewer terms you can get much better performance. It just
>  depends on how you're describing your spatial areas and the number of
>  results in each zipcode. If you had a field like "city,state" in your index
>  you would have far less terms in your query than if that query had all the
>  zipcodes in a "city,state" combo, thus making your query much faster.
>
>  M
>
>  On Thu, May 1, 2008 at 2:15 AM, mark harwood <[EMAIL PROTECTED]>
>  wrote:
>
>
>
>  > The issue here is a general one of trying to perform an efficient join
>  > between an external resource (rdbms) and Lucene.
>  > This experiment may be of interest:
>  >http://issues.apache.org/jira/browse/LUCENE-434
>  >
>  > KeyMap.java embodies the core service which translates from lucene doc ids
>  > to DB primary keys or vice versa.
>  > There are a couple of implementations of KeyMap that are not optimal (they
>  > pre-date Lucene's FieldCache) but it may give you food for thought.
>  >
>  > Cheers
>  > Mark
>  >
>  >
>  > - Original Message 
>  > From: Stephane Nicoll <[EMAIL PROTECTED]>
>  > To: java-user@lucene.apache.org
>  > Sent: Thursday, 1 May, 2008 9:00:33 AM
>  > Subject: hybrid query (lucene + db)
>  >
>  > Hi there,
>  >
>  > We're using lucene with Hibernate search and we're very happy so far
>  > with the performance and the usability of lucene. We have however a
>  > specific use cases that prevent us to use only lucene: spatial
>  > queries. I already sent a mail on this list a while back about the
>  > problem and we started investigating multiple solutions.
>  >
>  > When the user selects a geographic area and some keywords we do the
>  > following:
>  >
>  > * Perform a search on the lucene index for the keywords with a
>  > projection that returns only the primaryKey of the element sorted by
>  > primary key
>  > * Perform a search on the database with other criterias and a
>  > projection that returns only the primary key of the elements
>  > * Iterate on both list to find N matching IDs, optionally with paging
>  > (some from X to X + N where X is the first result of the page)
>  > * Run a query on the database to return the actual objects (select a
>  > from MyClass a where a.id IN (the list of matching IDs) ) We limit the
>  > page to 1000 results
>  >
>  > We have searched a way to optimize the queries and to avoid to consume
>  > too much memory, knowing that we must support paging.
>  >
>  > With a single user a search by kewyords takes 30msec to complete, a
>  > search by box takes 45msec. With both (keywords + spatial area)  it
>  > takes 300msec
>  >
>  > With 10 concurrent users, a search by keywords takes 150msec/user  but
>  > for both it takes 3 sec/user !!!
>  >
>  > I had the profiler running on this scenario and I've found that *all*
>  > threads are waiting on org.apache.lucene.index.SegmentReader. I then
>  > configured Hibernate Search to use a separate index reader per thread.
>  > The deadlocks disappeared but it's still very slow (2.8sec).
>  >
>  > Some questions:
>  >
>  > * Does anyone knows where the deadlocks on SegmentReader are coming from?
>  > * Is the sorting on the primary keys a bad idea regarding performance
>  > and memory usage?
>  > * Does anyone has an idea to perform this kind of hybrid query in an
>  > efficient way?
>  >
>  > I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
>  > support on the Hibernate Search forum but did not get any answer so
>  > far.
>  >
>  > Thanks,
>  > Stéphane
>  >
>  > --
>  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  > you suck" -- S.Yegge
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>  >
>  >
>  >
>  >
>  >
>  >   __
>  > Sent from Yahoo! Mail.
>  > A Smarter Email http://uk.docs.yahoo.com/nowyoucan.h

Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Rajesh parab

One trick I can think of is somehow keeping internal
document id of Lucene document same after document is
updated (i.e. deleted and re-inserted). I am not sure
if we have this capability in Lucene.

Regards,
Rajesh

--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> That's correct, Rajesh.  ParallelReader has its
> uses, but I guess your case is not one of them,
> unless we are all missing some key aspect of PR or a
> trick to make it work in your case.
> 
> Otis 
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr -
> Nutch
> 
> - Original Message 
> > From: Rajesh parab <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Thursday, May 1, 2008 6:55:00 PM
> > Subject: Re: ParalleReader and synchronization
> between indexes
> > 
> > Thanks Yonik.
> > 
> > So, if rebuilding the second index is not an
> option
> > due to large no of documents, then ParallelReader
> will
> > not work :-(
> > 
> > And I believe there is no other way than
> > parallelReader to search across multiple indexes
> that
> > contain related data. Is there any other
> alternative?
> > I think, MultiSearcher or MultiReader will only
> work
> > with multiple, unrelated indexes.
> > 
> > Regards,
> > Rajesh
> 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene Supports Billions of data

2008-05-01 Thread Otis Gospodnetic
Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random 
doc->index/shard assignment), the relevance score will roughly match.

- If you have business rules for doc->index/shard distribution, then your 
relevance scores will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Toke Eskildsen <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". 
> The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, 
> the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the 
> score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Otis Gospodnetic
That's correct, Rajesh.  ParallelReader has its uses, but I guess your case is 
not one of them, unless we are all missing some key aspect of PR or a trick to 
make it work in your case.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Rajesh parab <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, May 1, 2008 6:55:00 PM
> Subject: Re: ParalleReader and synchronization between indexes
> 
> Thanks Yonik.
> 
> So, if rebuilding the second index is not an option
> due to large no of documents, then ParallelReader will
> not work :-(
> 
> And I believe there is no other way than
> parallelReader to search across multiple indexes that
> contain related data. Is there any other alternative?
> I think, MultiSearcher or MultiReader will only work
> with multiple, unrelated indexes.
> 
> Regards,
> Rajesh



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene Supports Billions of data

2008-05-01 Thread Toke Eskildsen
From: John Wang <[EMAIL PROTECTED]>
[...]
> sub index 1: 1 billion docs
> sub index 2: 1 billion docs
> sub index 3: 1 billion docs
> 
> federating search to these subindexes, you represent an index of 3 
> billiondocs, and all internal doc ids are of type int.

That falls under Daniel's "...unless you wrap your own framework around it". 
The problem with the solution you're describing is that it's not functionally 
equivalent to a single index of 3 billion docs.

If you just create 3 independent indexes and merge the top hits from all 3, the 
ranking of the documents will be messed up. You'll need to make sure that the 
scores from the different indexes can be compared. That's tricky when the score 
depends on the frequency of the terms in the whole corpus.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Does Lucene Supports Billions of data

2008-05-01 Thread John Wang
I am not sure why this is the case, docid is internal to the sub index. As
long as the sub index size is below 2 bil, there is no need for docid to be
long. With multiple indexes, I was thinking having an aggregater which
merges maybe only a page of search result.

Example:

sub index 1: 1 billion docs
sub index 2: 1 billion docs
sub index 3: 1 billion docs

federating search to these subindexes, you represent an index of 3 billion
docs, and all internal doc ids are of type int.

Maybe I am not understanding something.

-John

On Wed, Apr 30, 2008 at 4:10 PM, Daniel Noll <[EMAIL PROTECTED]> wrote:

> On Thursday 01 May 2008 00:01:48 John Wang wrote:
> > I am not sure how well lucene would perform with > 2 Billion docs in a
> > single index anyway.
>
> Even if they're in multiple indexes, the doc IDs being ints will still
> prevent
> it going past 2Gi unless you wrap your own framework around it.
>
> Daniel
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: hybrid query (lucene + db)

2008-05-01 Thread Michael Stoppelman
Stephane,

Could you describe how you setup the spatial area? Having BooleanQuery with
200 terms in it definitely slows things down (I'm not sure exactly why yet
-- it seems like it shouldn't be "that" slow). If you can describe your
spatial area in fewer terms you can get much better performance. It just
depends on how you're describing your spatial areas and the number of
results in each zipcode. If you had a field like "city,state" in your index
you would have far less terms in your query than if that query had all the
zipcodes in a "city,state" combo, thus making your query much faster.

M

On Thu, May 1, 2008 at 2:15 AM, mark harwood <[EMAIL PROTECTED]>
wrote:

> The issue here is a general one of trying to perform an efficient join
> between an external resource (rdbms) and Lucene.
> This experiment may be of interest:
>http://issues.apache.org/jira/browse/LUCENE-434
>
> KeyMap.java embodies the core service which translates from lucene doc ids
> to DB primary keys or vice versa.
> There are a couple of implementations of KeyMap that are not optimal (they
> pre-date Lucene's FieldCache) but it may give you food for thought.
>
> Cheers
> Mark
>
>
> - Original Message 
> From: Stephane Nicoll <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, 1 May, 2008 9:00:33 AM
> Subject: hybrid query (lucene + db)
>
> Hi there,
>
> We're using lucene with Hibernate search and we're very happy so far
> with the performance and the usability of lucene. We have however a
> specific use cases that prevent us to use only lucene: spatial
> queries. I already sent a mail on this list a while back about the
> problem and we started investigating multiple solutions.
>
> When the user selects a geographic area and some keywords we do the
> following:
>
> * Perform a search on the lucene index for the keywords with a
> projection that returns only the primaryKey of the element sorted by
> primary key
> * Perform a search on the database with other criterias and a
> projection that returns only the primary key of the elements
> * Iterate on both list to find N matching IDs, optionally with paging
> (some from X to X + N where X is the first result of the page)
> * Run a query on the database to return the actual objects (select a
> from MyClass a where a.id IN (the list of matching IDs) ) We limit the
> page to 1000 results
>
> We have searched a way to optimize the queries and to avoid to consume
> too much memory, knowing that we must support paging.
>
> With a single user a search by kewyords takes 30msec to complete, a
> search by box takes 45msec. With both (keywords + spatial area)  it
> takes 300msec
>
> With 10 concurrent users, a search by keywords takes 150msec/user  but
> for both it takes 3 sec/user !!!
>
> I had the profiler running on this scenario and I've found that *all*
> threads are waiting on org.apache.lucene.index.SegmentReader. I then
> configured Hibernate Search to use a separate index reader per thread.
> The deadlocks disappeared but it's still very slow (2.8sec).
>
> Some questions:
>
> * Does anyone knows where the deadlocks on SegmentReader are coming from?
> * Is the sorting on the primary keys a bad idea regarding performance
> and memory usage?
> * Does anyone has an idea to perform this kind of hybrid query in an
> efficient way?
>
> I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
> support on the Hibernate Search forum but did not get any answer so
> far.
>
> Thanks,
> Stéphane
>
> --
> Large Systems Suck: This rule is 100% transitive. If you build one,
> you suck" -- S.Yegge
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>
>   __
> Sent from Yahoo! Mail.
> A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Rajesh parab
Thanks Yonik.

So, if rebuilding the second index is not an option
due to large no of documents, then ParallelReader will
not work :-(

And I believe there is no other way than
parallelReader to search across multiple indexes that
contain related data. Is there any other alternative?
I think, MultiSearcher or MultiReader will only work
with multiple, unrelated indexes.

Regards,
Rajesh


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Yonik Seeley
On Wed, Apr 30, 2008 at 10:52 PM, Rajesh parab <[EMAIL PROTECTED]> wrote:
>  Can we somehow keep
>  internal document id same after updating (i.e. delete
>  and re-insert) index document?

No.  ParallelReader is not a general solution, it's an expert-level
solution that leaves the task of keeping the indexes in sync up to
you.  The easiest thing is to really rebuild the smaller index each
time.  If you can't do that, ParallelReader is probably not what you
are looking for.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene farsi problem

2008-05-01 Thread Grant Ingersoll


On May 1, 2008, at 4:36 AM, esra wrote:



Hi,

document's encoding is "UTF-8".

i tried the  explain() method and the result for "د-ژ"  range  
searching is:


 fieldWeight(keywordIndex:ساب وو�ر in 0),  
product of:

 1.0 = tf(termFreq(keywordIndex:ساب وو�ر)=1)
 0.30685282 = idf(docFreq=1)
 1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is "ساب ووفر".

i also  installed the "luke.jnlp"  but i don't know what to check by  
Luke.





http://wiki.apache.org/lucene-java/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

Luke can be used to view your index.  Not saying it is your problem  
here, but often times when I get back results that "seem" incorrect,  
the first thing I do is go look at my index using Luke, and compare  
the "incorrect" document with what is in the query to see where the  
(mis)match is occurring.   Usually, this analysis shows that my  
document/query is not what I thought it was.


Luke can browse documents and parse queries, amongst other useful  
things.






Thanks,

Esra



Grant Ingersoll-6 wrote:


I am not sure how Standard Analyzer will perform on Farsi.  The thing
to do now would be to get Luke and have a look at the actual document
that matches and see what it's tokens look like.  You might also try
using the explain() method to see why that document matches.

Also, are you sure you are loading the file w/ the proper encodings,
etc?

-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:



Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:




i googled for farsi analyzer and found nothing also i am not sure it
if
would solve my problem or not.

Thanks,

Esra


Grant Ingersoll-6 wrote:


What Analyzer are you using?  You might try looking in Luke to see
what is in your index, etc.  It also isn't clear to me what your
documents look like.

As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
see if you can find anything.  Otherwise, you will have to write  
your

own (and donate it)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:



hi,

i am using lucene's "IndexSearcher" to search the given xml by
keyword which
contains farsi information.
while searching i use ranges like

آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

when i do search for  "د-ژ"  range the results are wrong , they
are
the
results of  " س-ظ "range.

for example when i do search for "د-ژ"  one of the the results  
is

"ساب ووفر"
, this result also shown on the " س-ظ " range's result list  
which

is the
corret range.

As IndexSearcher use "compareTo" method and this method uses
unicodes for
comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of "ساب ووفر " is  س = U+633

Do you have any idea how to solve this problem, there are  
analyzers

for
different languages ,
will this be usefull if so do you know where to find a farsi
analyzer?

I would bu glad if you help.

thanks ,

Esra

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem- 
tp16977096p16977096.html

Sent from the Lucene - Java Users mailing list archive at
Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
Sent from the Lucene - Java Users mailing list archive at  
Nabble.com.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mai

RE: lucene farsi problem

2008-05-01 Thread Steven A Rowe
Hi Esra,

Going back to the original problem statement, I see something that looks 
illogical to me - please correct me if I'm wrong:

On Apr 30, 2008, at 3:21 AM, esra wrote:
> i am using lucene's "IndexSearcher" to search the given xml by
> keyword which contains farsi information.
> while searching i use ranges like
> 
> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
> 
> when i do search for  "د-ژ"  range the results are wrong , they
> are the results of  " س-ظ "range.
> 
> for example when i do search for "د-ژ"  one of the the results is
> "ساب ووفر", this result also shown on the " س-ظ " range's result
> list which is the corret range.
> 
> As IndexSearcher use "compareTo" method and this method uses
> unicodes for comparing, i found the unicodes of the characters.
> 
> د=U+62F
> ژ = U+698
> and the first letter of "ساب ووفر " is  س = U+633

It appears to me that *both* the "د-ژ" range [ U+062F - U+0698 ] and the "س-ظ" 
range [ U+0633 - U+0638 ] contain the first letter of "ساب ووفر", which is "س" 
= U+0633.  

You stated that U+0633 should be contained in the [ U+0633 - U+0638 ] range - I 
agree - but why do you think U+0633 should not be contained in the [ U+062F - 
U+0698 ] range?

In other words, it looks to me like your problem is not a problem at all.

Steve


Re: hybrid query (lucene + db)

2008-05-01 Thread mark harwood
The issue here is a general one of trying to perform an efficient join between 
an external resource (rdbms) and Lucene.
This experiment may be of interest:
http://issues.apache.org/jira/browse/LUCENE-434

KeyMap.java embodies the core service which translates from lucene doc ids to 
DB primary keys or vice versa.
There are a couple of implementations of KeyMap that are not optimal (they 
pre-date Lucene's FieldCache) but it may give you food for thought.

Cheers
Mark


- Original Message 
From: Stephane Nicoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 1 May, 2008 9:00:33 AM
Subject: hybrid query (lucene + db)

Hi there,

We're using lucene with Hibernate search and we're very happy so far
with the performance and the usability of lucene. We have however a
specific use cases that prevent us to use only lucene: spatial
queries. I already sent a mail on this list a while back about the
problem and we started investigating multiple solutions.

When the user selects a geographic area and some keywords we do the following:

* Perform a search on the lucene index for the keywords with a
projection that returns only the primaryKey of the element sorted by
primary key
* Perform a search on the database with other criterias and a
projection that returns only the primary key of the elements
* Iterate on both list to find N matching IDs, optionally with paging
(some from X to X + N where X is the first result of the page)
* Run a query on the database to return the actual objects (select a
from MyClass a where a.id IN (the list of matching IDs) ) We limit the
page to 1000 results

We have searched a way to optimize the queries and to avoid to consume
too much memory, knowing that we must support paging.

With a single user a search by kewyords takes 30msec to complete, a
search by box takes 45msec. With both (keywords + spatial area)  it
takes 300msec

With 10 concurrent users, a search by keywords takes 150msec/user  but
for both it takes 3 sec/user !!!

I had the profiler running on this scenario and I've found that *all*
threads are waiting on org.apache.lucene.index.SegmentReader. I then
configured Hibernate Search to use a separate index reader per thread.
The deadlocks disappeared but it's still very slow (2.8sec).

Some questions:

* Does anyone knows where the deadlocks on SegmentReader are coming from?
* Is the sorting on the primary keys a bad idea regarding performance
and memory usage?
* Does anyone has an idea to perform this kind of hybrid query in an
efficient way?

I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
support on the Hibernate Search forum but did not get any answer so
far.

Thanks,
Stéphane

-- 
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  __
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene farsi problem

2008-05-01 Thread esra

Hi,

document's encoding is "UTF-8".

i tried the  explain() method and the result for "د-ژ"  range searching is:

  fieldWeight(keywordIndex:ساب وو�ر in 0), product of:
  1.0 = tf(termFreq(keywordIndex:ساب وو�ر)=1)
  0.30685282 = idf(docFreq=1)
  1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is "ساب ووفر".

 i also  installed the "luke.jnlp"  but i don't know what to check by Luke.

Thanks,

Esra



Grant Ingersoll-6 wrote:
> 
> I am not sure how Standard Analyzer will perform on Farsi.  The thing  
> to do now would be to get Luke and have a look at the actual document  
> that matches and see what it's tokens look like.  You might also try  
> using the explain() method to see why that document matches.
> 
> Also, are you sure you are loading the file w/ the proper encodings,  
> etc?
> 
> -Grant
> 
> On Apr 30, 2008, at 8:06 AM, esra wrote:
> 
>>
>> Hi,
>> thanks for your reply.
>> I am using StandartAnalyzer now and my xml document is like below:
>>
>> 
>>  
>>
>> i googled for farsi analyzer and found nothing also i am not sure it  
>> if
>> would solve my problem or not.
>>
>> Thanks,
>>
>> Esra
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> What Analyzer are you using?  You might try looking in Luke to see
>>> what is in your index, etc.  It also isn't clear to me what your
>>> documents look like.
>>>
>>> As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
>>> see if you can find anything.  Otherwise, you will have to write your
>>> own (and donate it)
>>>
>>> -Grant
>>>
>>> On Apr 30, 2008, at 3:21 AM, esra wrote:
>>>

 hi,

 i am using lucene's "IndexSearcher" to search the given xml by
 keyword which
 contains farsi information.
 while searching i use ranges like

 آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی

 when i do search for  "د-ژ"  range the results are wrong , they  
 are
 the
 results of  " س-ظ "range.

 for example when i do search for "د-ژ"  one of the the results is
 "ساب ووفر"
 , this result also shown on the " س-ظ " range's result list which
 is the
 corret range.

 As IndexSearcher use "compareTo" method and this method uses
 unicodes for
 comparing, i found the unicodes of the characters.

 د=U+62F
 ژ = U+698
 and the first letter of "ساب ووفر " is  س = U+633

 Do you have any idea how to solve this problem, there are analyzers
 for
 different languages ,
 will this be usefull if so do you know where to find a farsi  
 analyzer?

 I would bu glad if you help.

 thanks ,

 Esra

 -- 
 View this message in context:
 http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html
 Sent from the Lucene - Java Users mailing list archive at  
 Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

>>>
>>> --
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16980977.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> --
> Grant Ingersoll
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993174.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



hybrid query (lucene + db)

2008-05-01 Thread Stephane Nicoll
Hi there,

We're using lucene with Hibernate search and we're very happy so far
with the performance and the usability of lucene. We have however a
specific use cases that prevent us to use only lucene: spatial
queries. I already sent a mail on this list a while back about the
problem and we started investigating multiple solutions.

When the user selects a geographic area and some keywords we do the following:

* Perform a search on the lucene index for the keywords with a
projection that returns only the primaryKey of the element sorted by
primary key
* Perform a search on the database with other criterias and a
projection that returns only the primary key of the elements
* Iterate on both list to find N matching IDs, optionally with paging
(some from X to X + N where X is the first result of the page)
* Run a query on the database to return the actual objects (select a
from MyClass a where a.id IN (the list of matching IDs) ) We limit the
page to 1000 results

We have searched a way to optimize the queries and to avoid to consume
too much memory, knowing that we must support paging.

With a single user a search by kewyords takes 30msec to complete, a
search by box takes 45msec. With both (keywords + spatial area)  it
takes 300msec

With 10 concurrent users, a search by keywords takes 150msec/user  but
for both it takes 3 sec/user !!!

I had the profiler running on this scenario and I've found that *all*
threads are waiting on org.apache.lucene.index.SegmentReader. I then
configured Hibernate Search to use a separate index reader per thread.
The deadlocks disappeared but it's still very slow (2.8sec).

Some questions:

* Does anyone knows where the deadlocks on SegmentReader are coming from?
* Is the sorting on the primary keys a bad idea regarding performance
and memory usage?
* Does anyone has an idea to perform this kind of hybrid query in an
efficient way?

I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
support on the Hibernate Search forum but did not get any answer so
far.

Thanks,
Stéphane

-- 
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Does Lucene Supports Billions of data

2008-05-01 Thread spring
> Even if they're in multiple indexes, the doc IDs being ints 
> will still prevent 
> it going past 2Gi unless you wrap your own framework around it.

Hm. Does this mean that a MultiReader has the int-limit too?
I thought that this limit applies to a single index only...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene farsi problem

2008-05-01 Thread esra

Hi Steve,

thanks for your reply , i know farsi is written and read right-to-left.
i am using RangeOuery class and it's rewrite(IndexReader reader) method
decides if the word is in range or not by compareTo method and this decision
is made by using unicodes.

while searching for "د-ژ" range the lowerTerm is "د" and  the upperTerm is
"ژ". 
And while comparing for the result "ساب ووفر" also takes the first letter as
س and does the comparison for this letter.

 د=U+62F
 ژ = U+698
 and the first letter of "ساب ووفر " is  س = U+633

Esra,


Steven A Rowe wrote:
> 
> Hi Esra,
> 
> Caveat: I don't speak, read, write, or dream in Farsi - I just know that
> it mostly shares its orthography with Arabic, and that they are both
> written and read right-to-left.
> 
> How are you constructing the queries?  Using QueryParser?  If so, then I
> suspect the problem is that you intend the range you supply to be read
> entirely right-to-left, but Lucene instead reads it left-to-right.  Have
> you tried using e.g. "د-ژ" instead of "د-ژ"?  (That is, placing the lower
> valued term on the left instead of the right.)
> 
> AFAICT, RangeFilter (called from ConstantScoreRangeQuery, which is called
> from QueryParser) does not test whether lowerTerm is in fact lower than
> upperTerm.  If it turns out that the problem is simply one of order, it
> might make sense to modify RangeFilter so that it flip them when lowerTerm
> > upperTerm.
> 
> Steve
> 
> On 04/30/2008 at 3:21 AM, esra wrote:
>> 
>> hi,
>> 
>> i am using lucene's "IndexSearcher" to search the given xml
>> by keyword which
>> contains farsi information.
>> while searching i use ranges like
>> 
>> آ-ث  |  ج-خ  |  د-ژ  |  س-ظ  |  ع-ق  |  ک-ل  |  م-ی
>> 
>> when i do search for  "د-ژ"  range the results are wrong ,
>> they are the
>> results of  " س-ظ "range.
>> 
>> for example when i do search for "د-ژ"  one of the the
>> results is "ساب ووفر"
>> , this result also shown on the " س-ظ " range's result list
>> which is the
>> corret range.
>> 
>> As IndexSearcher use "compareTo" method and this method uses
>> unicodes for
>> comparing, i found the unicodes of the characters.
>> 
>> د=U+62F
>> ژ = U+698
>> and the first letter of "ساب ووفر " is  س = U+633
>> 
>> Do you have any idea how to solve this problem, there are
>> analyzers for
>> different languages ,
>> will this be usefull if so do you know where to find a farsi analyzer?
>> 
>> I would bu glad if you help.
>> 
>> thanks ,
>> 
>> Esra
>> 
>> -- View this message in context:
>> http://www.nabble.com/lucene-farsi-problem-tp16977096p16977096.html Sent
>> from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> - To
>> unsubscribe, e-mail: [EMAIL PROTECTED] For
>> additional commands, e-mail: [EMAIL PROTECTED]
>> 
>>
> 
>  
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p16993041.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]