Re: hybrid query (lucene + db)

Stephane Nicoll Fri, 02 May 2008 09:59:37 -0700

Hi,

Thanks for the response. The very first reason  we're using lucene is
because we're building a product that must support different database
(Oracle 10, Oracle 11 and Postgresql with spatial extensions).


I had a look to this project already but we cannot stick to one database vendor.

Cheers,
Stéphane

On Fri, May 2, 2008 at 6:55 PM, Marcelo Ochoa <[EMAIL PROTECTED]> wrote:
> Hi Stéphane:
>   If you are using Oracle Spatial I assume that you are using Oracle
>  too for storing text :)
>   Have you take a look at Oracle-Lucene integration project sponsored
>  by LendingClub.com?
>  http://docs.google.com/Doc?id=ddgw7sjp_54fgj9kg
>  
> http://sourceforge.net/project/showfiles.php?group_id=56183&package_id=255524&release_id=589900
>   Its a new domain index for Oracle using Lucene inside the Oracle JVM.
>   By doing that We can use Lucene as Oracle Text, but with many other
>  features, and using inline pagination We can get better perfomance
>  than latest 11g Text Counpound Domain Index.
>   If you are interested in this implementation simply drop me an email.
>   Best regards, Marcelo.
>
>
>
>  On Fri, May 2, 2008 at 3:58 AM, Stephane Nicoll
>  <[EMAIL PROTECTED]> wrote:
>  > Well for the moment we don't. The lucene index only contains the full
>  >  text content (indexed, not stored). We use lucene to perform full text
>  >  and fuzzy searches on the keywords field. Once we have the result, we
>  >  match them with the geospatial box provided by the user (we use Oracle
>  >  Spatial for that). We have no notion of city, state or zip code. Date
>  >  overlaps more than one countries most of the time actually.
>  >
>  >  We are thinking of reimplementing a quad tree in lucene to flag each
>  >  item with a spatial area. That way we will be able to pre-filter the
>  >  zone accordingly.
>  >
>  >  Still, this does not explain the deadlock on SegmentReader. If anyone
>  >  has an idea...
>  >
>  >  Thanks,
>  >  Stéphane
>  >
>  >
>  >
>  >  On Thu, May 1, 2008 at 8:50 PM, Michael Stoppelman <[EMAIL PROTECTED]> 
> wrote:
>  >  > Stephane,
>  >  >
>  >  >  Could you describe how you setup the spatial area? Having BooleanQuery 
> with
>  >  >  200 terms in it definitely slows things down (I'm not sure exactly why 
> yet
>  >  >  -- it seems like it shouldn't be "that" slow). If you can describe your
>  >  >  spatial area in fewer terms you can get much better performance. It 
> just
>  >  >  depends on how you're describing your spatial areas and the number of
>  >  >  results in each zipcode. If you had a field like "city,state" in your 
> index
>  >  >  you would have far less terms in your query than if that query had all 
> the
>  >  >  zipcodes in a "city,state" combo, thus making your query much faster.
>  >  >
>  >  >  M
>  >  >
>  >  >  On Thu, May 1, 2008 at 2:15 AM, mark harwood <[EMAIL PROTECTED]>
>  >  >  wrote:
>  >  >
>  >  >
>  >  >
>  >  >  > The issue here is a general one of trying to perform an efficient 
> join
>  >  >  > between an external resource (rdbms) and Lucene.
>  >  >  > This experiment may be of interest:
>  >  >  >    http://issues.apache.org/jira/browse/LUCENE-434
>  >  >  >
>  >  >  > KeyMap.java embodies the core service which translates from lucene 
> doc ids
>  >  >  > to DB primary keys or vice versa.
>  >  >  > There are a couple of implementations of KeyMap that are not optimal 
> (they
>  >  >  > pre-date Lucene's FieldCache) but it may give you food for thought.
>  >  >  >
>  >  >  > Cheers
>  >  >  > Mark
>  >  >  >
>  >  >  >
>  >  >  > ----- Original Message ----
>  >  >  > From: Stephane Nicoll <[EMAIL PROTECTED]>
>  >  >  > To: java-user@lucene.apache.org
>  >  >  > Sent: Thursday, 1 May, 2008 9:00:33 AM
>  >  >  > Subject: hybrid query (lucene + db)
>  >  >  >
>  >  >  > Hi there,
>  >  >  >
>  >  >  > We're using lucene with Hibernate search and we're very happy so far
>  >  >  > with the performance and the usability of lucene. We have however a
>  >  >  > specific use cases that prevent us to use only lucene: spatial
>  >  >  > queries. I already sent a mail on this list a while back about the
>  >  >  > problem and we started investigating multiple solutions.
>  >  >  >
>  >  >  > When the user selects a geographic area and some keywords we do the
>  >  >  > following:
>  >  >  >
>  >  >  > * Perform a search on the lucene index for the keywords with a
>  >  >  > projection that returns only the primaryKey of the element sorted by
>  >  >  > primary key
>  >  >  > * Perform a search on the database with other criterias and a
>  >  >  > projection that returns only the primary key of the elements
>  >  >  > * Iterate on both list to find N matching IDs, optionally with paging
>  >  >  > (some from X to X + N where X is the first result of the page)
>  >  >  > * Run a query on the database to return the actual objects (select a
>  >  >  > from MyClass a where a.id IN (the list of matching IDs) ) We limit 
> the
>  >  >  > page to 1000 results
>  >  >  >
>  >  >  > We have searched a way to optimize the queries and to avoid to 
> consume
>  >  >  > too much memory, knowing that we must support paging.
>  >  >  >
>  >  >  > With a single user a search by kewyords takes 30msec to complete, a
>  >  >  > search by box takes 45msec. With both (keywords + spatial area)  it
>  >  >  > takes 300msec
>  >  >  >
>  >  >  > With 10 concurrent users, a search by keywords takes 150msec/user  
> but
>  >  >  > for both it takes 3 sec/user !!!
>  >  >  >
>  >  >  > I had the profiler running on this scenario and I've found that *all*
>  >  >  > threads are waiting on org.apache.lucene.index.SegmentReader. I then
>  >  >  > configured Hibernate Search to use a separate index reader per 
> thread.
>  >  >  > The deadlocks disappeared but it's still very slow (2.8sec).
>  >  >  >
>  >  >  > Some questions:
>  >  >  >
>  >  >  > * Does anyone knows where the deadlocks on SegmentReader are coming 
> from?
>  >  >  > * Is the sorting on the primary keys a bad idea regarding performance
>  >  >  > and memory usage?
>  >  >  > * Does anyone has an idea to perform this kind of hybrid query in an
>  >  >  > efficient way?
>  >  >  >
>  >  >  > I am using lucene 2.3.1 and Hibernate Search 3.0.1. I already ask for
>  >  >  > support on the Hibernate Search forum but did not get any answer so
>  >  >  > far.
>  >  >  >
>  >  >  > Thanks,
>  >  >  > Stéphane
>  >  >  >
>  >  >  > --
>  >  >  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  >  >  > you suck" -- S.Yegge
>  >  >  >
>  >  >  > ---------------------------------------------------------------------
>  >  >  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  >  >  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >  >  >
>  >  >  >
>  >  >  >
>  >  >  >
>  >  >  >
>  >  >  >
>  >  >  >       __________________________________________________________
>  >  >  > Sent from Yahoo! Mail.
>  >  >  > A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html
>  >  >  >
>  >  >  > ---------------------------------------------------------------------
>  >  >  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  >  >  > For additional commands, e-mail: [EMAIL PROTECTED]
>  >  >  >
>  >  >  >
>  >  >
>  >
>  >
>  >
>  >  --
>  >
>  >
>  > Large Systems Suck: This rule is 100% transitive. If you build one,
>  >  you suck" -- S.Yegge
>  >
>  >  ---------------------------------------------------------------------
>  >  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  >  For additional commands, e-mail: [EMAIL PROTECTED]
>  >
>  >
>
>
>
>  --
>  Marcelo F. Ochoa
>  http://marceloochoa.blogspot.com/
>  http://marcelo.ochoa.googlepages.com/home
>  ______________
>  Do you Know DBPrism? Look @ DB Prism's Web Site
>  http://www.dbprism.com.ar/index.html
>  More info?
>  Chapter 17 of the book "Programming the Oracle Database using Java &
>  Web Services"
>  http://www.amazon.com/gp/product/1555583296/
>  Chapter 21 of the book "Professional XML Databases" - Wrox Press
>  http://www.amazon.com/gp/product/1861003587/
>  Chapter 8 of the book "Oracle & Open Source" - O'Reilly
>  http://www.oreilly.com/catalog/oracleopen/
>
>
>
>  ---------------------------------------------------------------------
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Large Systems Suck: This rule is 100% transitive. If you build one,
you suck" -- S.Yegge

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: hybrid query (lucene + db)

Reply via email to