date:20080501

Re: hybrid query (lucene + db)

2008-05-01 Thread Stephane Nicoll

Well for the moment we don't. The lucene index only contains the full
text content (indexed, not stored). We use lucene to perform full text
and fuzzy searches on the keywords field. Once we have the result, we
match them with the geospatial box provided by the user (we use Oracle
Spatial for that). We have no notion of city, state or zip code. Date
overlaps more than one countries most of the time actually.

We are thinking of reimplementing a quad tree in lucene to flag each
item with a spatial area. That way we will be able to pre-filter the
zone accordingly.

Still, this does not explain the deadlock on SegmentReader. If anyone
has an idea...

Thanks,
Stéphane

On Thu, May 1, 2008 at 8:50 PM, Michael Stoppelman <[EMAIL PROTECTED]> wrote:
> Stephane,
>
>  Could you describe how you setup the spatial area? Having BooleanQuery with
>  200 terms in it definitely slows things down (I'm not sure exactly why yet
>  -- it seems like it shouldn't be "that" slow). If you can describe your
>  spatial area in fewer terms you can get much better performance. It just
>  depends on how you're describing your spatial areas and the number of
>  results in each zipcode. If you had a field like "city,state" in your index
>  you would have far less terms in your query than if that query had all the
>  zipcodes in a "city,state" combo, thus making your query much faster.
>
>  M
>
>  On Thu, May 1, 2008 at 2:15 AM, mark harwood <[EMAIL PROTECTED]>
>  wrote:
>
>
>
>  > The issue here is a general one of trying to perform an efficient join
>  > between an external resource (rdbms) and Lucene.
>  > This experiment may be of interest:
>  >http://issues.apache.org/jira/browse/LUCENE-434
>  >
>  > KeyMap.java embodies the core service which translates from lucene doc ids
>  > to DB primary keys or vice versa.
>  > There are a couple of implementations of KeyMap that are not optimal (they
>  > pre-date Lucene's FieldCache) but it may give you food for thought.
>  >
>  > Cheers
>  > Mark
>  >
>  >
>  > - Original Message 
>  > From: Stephane Nicoll <[EMAIL PROTECTED]>
>  > To: java-user@lucene.apache.org
>  > Sent: Thursday, 1 May, 2008 9:00:33 AM
>  > Subject: hybrid query (lucene + db)
>  >
>  > Hi there,
>  >
>  > We're using lucene with Hibernate search and we're very happy so far
>  > with the performance and the usability of lucene. We have however a
>  > specific use cases that prevent us to use only lucene: spatial
>  > queries. I already sent a mail on this list a while back about the
>  > problem and we started investigating multiple solutions.
>  >
>  > When the user selects a geographic area and some keywords we do the
>  > following:
>  >
>  > * Perform a search on the lucene index for the keywords with a
>  > projection that returns only the primaryKey of the element sorted by
>  > primary key
>  > * Perform a search on the database with other criterias and a
>  > projection that returns only the primary key of the elements
>  > * Iterate on both list to find N matching IDs, optionally with paging
>  > (some from X to X + N where X is the first result of the page)
>  > * Run a query on the database to return the actual objects (select a
>  > from MyClass a where a.id IN (the list of matching IDs) ) We limit the
>  > page to 1000 results
>  >
>  > We have searched a way to optimize the queries and to avoid to consume
>  > too much memory, knowing that we must support paging.
>  >
>  > With a single user a search by kewyords takes 30msec to complete, a
>  > search by box takes 45msec. With both (keywords + spatial area)  it
>  > takes 300msec
>  >
>  > With 10 concurrent users, a search by keywords takes 150msec/user  but
>  > for both it takes 3 sec/user !!!
>  >
>  > I had the profiler running on this scenario and I've found that *all*
>  > threads are waiting on org.apache.lucene.index.SegmentReader. I then
>  > configured Hibernate Search to use a separate index reader per thread.
>  > The deadlocks disappeared but it's still very slow (2.8sec).
>  >
>  > Some questions:
>  >
>  > * Does anyone knows where the deadlocks on SegmentReader are coming from?
>  > * Is the sorting on the primary keys a bad idea regarding performance
>  > and memory usage?
>  > * Does anyone has an idea to perform this kind of hybrid query in an
>  > efficient way?
>  >
>  > I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
>  > support on the Hibernate Search forum but did not get any answer so
>  > far.
>  >
>  > Thanks,
>  > Stéphane
>  >
>  > --
>  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  > you suck" -- S.Yegge
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>  >
>  >
>  >
>  >
>  >
>  >   __
>  > Sent from Yahoo! Mail.
>  > A Smarter Email http://uk.docs.yahoo.com/nowyoucan.h

Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Rajesh parab


One trick I can think of is somehow keeping internal
document id of Lucene document same after document is
updated (i.e. deleted and re-inserted). I am not sure
if we have this capability in Lucene.

Regards,
Rajesh

--- Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> That's correct, Rajesh.  ParallelReader has its
> uses, but I guess your case is not one of them,
> unless we are all missing some key aspect of PR or a
> trick to make it work in your case.
> 
> Otis 
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr -
> Nutch
> 
> - Original Message 
> > From: Rajesh parab <[EMAIL PROTECTED]>
> > To: java-user@lucene.apache.org
> > Sent: Thursday, May 1, 2008 6:55:00 PM
> > Subject: Re: ParalleReader and synchronization
> between indexes
> > 
> > Thanks Yonik.
> > 
> > So, if rebuilding the second index is not an
> option
> > due to large no of documents, then ParallelReader
> will
> > not work :-(
> > 
> > And I believe there is no other way than
> > parallelReader to search across multiple indexes
> that
> > contain related data. Is there any other
> alternative?
> > I think, MultiSearcher or MultiReader will only
> work
> > with multiple, unrelated indexes.
> > 
> > Regards,
> > Rajesh
> 
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Supports Billions of data

2008-05-01 Thread Otis Gospodnetic

Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random 
doc->index/shard assignment), the relevance score will roughly match.

- If you have business rules for doc->index/shard distribution, then your 
relevance scores will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Toke Eskildsen <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". 
> The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, 
> the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the 
> score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Otis Gospodnetic

That's correct, Rajesh.  ParallelReader has its uses, but I guess your case is 
not one of them, unless we are all missing some key aspect of PR or a trick to 
make it work in your case.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: Rajesh parab <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, May 1, 2008 6:55:00 PM
> Subject: Re: ParalleReader and synchronization between indexes
> 
> Thanks Yonik.
> 
> So, if rebuilding the second index is not an option
> due to large no of documents, then ParallelReader will
> not work :-(
> 
> And I believe there is no other way than
> parallelReader to search across multiple indexes that
> contain related data. Is there any other alternative?
> I think, MultiSearcher or MultiReader will only work
> with multiple, unrelated indexes.
> 
> Regards,
> Rajesh



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Supports Billions of data

2008-05-01 Thread Toke Eskildsen

From: John Wang <[EMAIL PROTECTED]>
[...]
> sub index 1: 1 billion docs
> sub index 2: 1 billion docs
> sub index 3: 1 billion docs
> 
> federating search to these subindexes, you represent an index of 3 
> billiondocs, and all internal doc ids are of type int.

That falls under Daniel's "...unless you wrap your own framework around it". 
The problem with the solution you're describing is that it's not functionally 
equivalent to a single index of 3 billion docs.

If you just create 3 independent indexes and merge the top hits from all 3, the 
ranking of the documents will be messed up. You'll need to make sure that the 
scores from the different indexes can be compared. That's tricky when the score 
depends on the frequency of the terms in the whole corpus.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Supports Billions of data

2008-05-01 Thread John Wang

I am not sure why this is the case, docid is internal to the sub index. As
long as the sub index size is below 2 bil, there is no need for docid to be
long. With multiple indexes, I was thinking having an aggregater which
merges maybe only a page of search result.

Example:

sub index 1: 1 billion docs
sub index 2: 1 billion docs
sub index 3: 1 billion docs

federating search to these subindexes, you represent an index of 3 billion
docs, and all internal doc ids are of type int.

Maybe I am not understanding something.

-John

On Wed, Apr 30, 2008 at 4:10 PM, Daniel Noll <[EMAIL PROTECTED]> wrote:

> On Thursday 01 May 2008 00:01:48 John Wang wrote:
> > I am not sure how well lucene would perform with > 2 Billion docs in a
> > single index anyway.
>
> Even if they're in multiple indexes, the doc IDs being ints will still
> prevent
> it going past 2Gi unless you wrap your own framework around it.
>
> Daniel
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: hybrid query (lucene + db)

2008-05-01 Thread Michael Stoppelman

Stephane,

Could you describe how you setup the spatial area? Having BooleanQuery with
200 terms in it definitely slows things down (I'm not sure exactly why yet
-- it seems like it shouldn't be "that" slow). If you can describe your
spatial area in fewer terms you can get much better performance. It just
depends on how you're describing your spatial areas and the number of
results in each zipcode. If you had a field like "city,state" in your index
you would have far less terms in your query than if that query had all the
zipcodes in a "city,state" combo, thus making your query much faster.

M

On Thu, May 1, 2008 at 2:15 AM, mark harwood <[EMAIL PROTECTED]>
wrote:

> The issue here is a general one of trying to perform an efficient join
> between an external resource (rdbms) and Lucene.
> This experiment may be of interest:
>http://issues.apache.org/jira/browse/LUCENE-434
>
> KeyMap.java embodies the core service which translates from lucene doc ids
> to DB primary keys or vice versa.
> There are a couple of implementations of KeyMap that are not optimal (they
> pre-date Lucene's FieldCache) but it may give you food for thought.
>
> Cheers
> Mark
>
>
> - Original Message 
> From: Stephane Nicoll <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, 1 May, 2008 9:00:33 AM
> Subject: hybrid query (lucene + db)
>
> Hi there,
>
> We're using lucene with Hibernate search and we're very happy so far
> with the performance and the usability of lucene. We have however a
> specific use cases that prevent us to use only lucene: spatial
> queries. I already sent a mail on this list a while back about the
> problem and we started investigating multiple solutions.
>
> When the user selects a geographic area and some keywords we do the
> following:
>
> * Perform a search on the lucene index for the keywords with a
> projection that returns only the primaryKey of the element sorted by
> primary key
> * Perform a search on the database with other criterias and a
> projection that returns only the primary key of the elements
> * Iterate on both list to find N matching IDs, optionally with paging
> (some from X to X + N where X is the first result of the page)
> * Run a query on the database to return the actual objects (select a
> from MyClass a where a.id IN (the list of matching IDs) ) We limit the
> page to 1000 results
>
> We have searched a way to optimize the queries and to avoid to consume
> too much memory, knowing that we must support paging.
>
> With a single user a search by kewyords takes 30msec to complete, a
> search by box takes 45msec. With both (keywords + spatial area)  it
> takes 300msec
>
> With 10 concurrent users, a search by keywords takes 150msec/user  but
> for both it takes 3 sec/user !!!
>
> I had the profiler running on this scenario and I've found that *all*
> threads are waiting on org.apache.lucene.index.SegmentReader. I then
> configured Hibernate Search to use a separate index reader per thread.
> The deadlocks disappeared but it's still very slow (2.8sec).
>
> Some questions:
>
> * Does anyone knows where the deadlocks on SegmentReader are coming from?
> * Is the sorting on the primary keys a bad idea regarding performance
> and memory usage?
> * Does anyone has an idea to perform this kind of hybrid query in an
> efficient way?
>
> I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
> support on the Hibernate Search forum but did not get any answer so
> far.
>
> Thanks,
> Stéphane
>
> --
> Large Systems Suck: This rule is 100% transitive. If you build one,
> you suck" -- S.Yegge
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
>
>   __
> Sent from Yahoo! Mail.
> A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Rajesh parab

Thanks Yonik.

So, if rebuilding the second index is not an option
due to large no of documents, then ParallelReader will
not work :-(

And I believe there is no other way than
parallelReader to search across multiple indexes that
contain related data. Is there any other alternative?
I think, MultiSearcher or MultiReader will only work
with multiple, unrelated indexes.

Regards,
Rajesh


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: ParalleReader and synchronization between indexes

2008-05-01 Thread Yonik Seeley

On Wed, Apr 30, 2008 at 10:52 PM, Rajesh parab <[EMAIL PROTECTED]> wrote:
>  Can we somehow keep
>  internal document id same after updating (i.e. delete
>  and re-insert) index document?

No.  ParallelReader is not a general solution, it's an expert-level
solution that leaves the task of keeping the indexes in sync up to
you.  The easiest thing is to really rebuild the smaller index each
time.  If you can't do that, ParallelReader is probably not what you
are looking for.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene farsi problem

2008-05-01 Thread Grant Ingersoll

On May 1, 2008, at 4:36 AM, esra wrote:

Hi,

document's encoding is "UTF-8".

i tried the explain() method and the result for "د-ژ" range
searching is:

fieldWeight(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø± in 0),
product of:

1.0 = tf(termFreq(keywordIndex:Ø³Ø§Ø¨ ÙˆÙˆÙ�Ø±)=1)
0.30685282 = idf(docFreq=1)
1.0 = fieldNorm(field=keywordIndex, doc=0)

here keywordIndex is "ساب ووفر".

i also installed the "luke.jnlp" but i don't know what to check by
Luke.

http://wiki.apache.org/lucene-java/LuceneFAQ#head-3558e5121806fb4fce80fc022d889484a9248b71

Luke can be used to view your index. Not saying it is your problem
here, but often times when I get back results that "seem" incorrect,
the first thing I do is go look at my index using Luke, and compare
the "incorrect" document with what is in the query to see where the
(mis)match is occurring. Usually, this analysis shows that my
document/query is not what I thought it was.

Luke can browse documents and parse queries, amongst other useful
things.

Thanks,

Esra

Grant Ingersoll-6 wrote:

I am not sure how Standard Analyzer will perform on Farsi. The thing
to do now would be to get Luke and have a look at the actual document
that matches and see what it's tokens look like. You might also try
using the explain() method to see why that document matches.

Also, are you sure you are loading the file w/ the proper encodings,
etc?

-Grant

On Apr 30, 2008, at 8:06 AM, esra wrote:

Hi,
thanks for your reply.
I am using StandartAnalyzer now and my xml document is like below:

i googled for farsi analyzer and found nothing also i am not sure it
if
would solve my problem or not.

Thanks,

Esra

Grant Ingersoll-6 wrote:

What Analyzer are you using? You might try looking in Luke to see
what is in your index, etc. It also isn't clear to me what your
documents look like.

As for a Farsi analyzer, I would Google "Farsi analyzer Lucene" and
see if you can find anything. Otherwise, you will have to write
your

own (and donate it)

-Grant

On Apr 30, 2008, at 3:21 AM, esra wrote:

hi,

i am using lucene's "IndexSearcher" to search the given xml by
keyword which
contains farsi information.
while searching i use ranges like

آ-ث | ج-خ | د-ژ | س-ظ | ع-ق | ک-ل | م-ی

when i do search for "د-ژ" range the results are wrong , they
are
the
results of " س-ظ "range.

for example when i do search for "د-ژ" one of the the results
is

"ساب ووفر"
, this result also shown on the " س-ظ " range's result list
which

is the
corret range.

As IndexSearcher use "compareTo" method and this method uses
unicodes for
comparing, i found the unicodes of the characters.

د=U+62F
ژ = U+698
and the first letter of "ساب ووفر " is س = U+633

Do you have any idea how to solve this problem, there are
analyzers

for
different languages ,
will this be usefull if so do you know where to find a farsi
analyzer?

I would bu glad if you help.

thanks ,

Esra

--
View this message in context:
http://www.nabble.com/lucene-farsi-problem-
tp16977096p16977096.html

Sent from the Lucene - Java Users mailing list archive at
Nabble.com.