AW: PyLucene use JCC shared object by default

Thomas Koch Mon, 23 Apr 2012 09:38:03 -0700

Dear Caleb,

Thanks for the sample - as usual code says more than thousand words .-)


The API really looks MUCH simpler than the current lucene facet API (as far as 
I can tell from my first steps is quite complex).

> With initial tests, the algorithm is about 100 faster in C++ than when 
> implemented in Python.

Wow that’s a nice factor you gain! Did you also compare it to the "standard" 
lucene facet approach?
 
The main difference I can observe so far is that lucene facet search allows to

a) define a kind of "category hierarchy" and search/count within this tree, 
e.g. regarding your example have
  'Helmets/Type/Full Face'
  'Helmets/Type/Open Face'
  ...
  'Helmets/User/Adult'
  'Helmets/User/Youth'
  etc.

This is done via the CategoryPath mainly  -see example code I just posted on 
the list (though I assume you're familiar with that approach).
  
b) 'drilldown' - i.e. re-run a search with the same query but restrict it to 
some facet/category of interest

Or is this also provided by your API?


best regards

Thomas 
--
OrbiTeam Software GmbH & Co. KG, Germany
http://www.orbiteam.de



> -----Ursprüngliche Nachricht-----
> Von: Caleb Burns [mailto:[email protected]]
> Gesendet: Mittwoch, 18. April 2012 22:17
> An: [email protected]
> Betreff: Re: PyLucene use JCC shared object by default
> 
> Hi Thomas,
> 
> Our primary motivation was performance and secondary was a "pythonic"
> api.
> Our needs were simpler than the complexity of the whole lucene.facet
> package. On the Lucene side of things, it looks like we have something similar
> to CategoryPath (statically 2 deep: "/Field/Value") and FacetRequest (only
> allow searching at root level, optionally only on filtered docs set and 
> fields).
> Specifically, we implemented an index/cache of all documents and their
> terms. As far as I know SOLR uses caching of the Lucene index to perform
> faceting.
> 
> Our implementation is based on
> http://lucene.apache.org/solr/api/org/apache/solr/request/UnInvertedFiel
> d.html
> and
> the interface in Python is almost identical. You pass our object an
> IndexReader and by default all Terms with TermVectors are indexed. You can
> then selectively retrieve fields. Here's an example of use:
> http://pastebin.com/Lq3LZKMp. The whole module is ~2000 lines (python
> interface, c++ implementation, comments). With initial tests, the algorithm is
> about 100 faster in C++ than when implemented in Python.
> 
> On Wed, Apr 18, 2012 at 9:31 AM, Thomas Koch <[email protected]> wrote:
> 
> > Hi,
> > sounds like an interesting project – may I ask what you actually
> > implemented and what’s the motivation (e.g. performance?)?
> >
> > I’ve started to experiment with the Facet support in Lucene (actually
> > in PyLucene – ported an example to Python) and found that facetted
> > search support in Lucene looks powerful (though API is still said to
> > be ‘experimental’ and I can’t say anything about performance yet).
> > I’m talking about the org.apache.lucene.facet.* packages – part of the
> > contrib part of Lucene and available as JARs that’s accessible in PyLucene 
> > as
> well.
> > I’m not that familiar with Solr but AFAIK it’s based on Lucene (Java)
> > and should (hopefully) use the same Java code for its facet search
> > support. Of course Solr adds some nice configuration support and web
> > GUI to Lucene, but the ‘core’ search is built on Lucene (to my
> > knowledge). So did you re-implement the Lucene facet search/index code
> > (like TaxonomyReader/Writer, FacetRequest stuff etc.) in C++ or what
> > part of Solr??
> >
> > Regarding Facet support in PyLucene I can share the samples I’ve ‘ported’
> > to Python so far. There’s still a patch pending for JavaList (required
> > by facet features) which I come back to later on this list (still some
> > open issues). Hopefully this can be included in the PyLucene 3.6
> > version …
> >
> > Regards
> > Thomas
> > --
> > OrbiTeam Software GmbH & Co. KG
> > Germany  http://www.orbiteam.de
> >
> >
> > Von: Caleb Burns [mailto:[email protected]]
> > Gesendet: Dienstag, 17. April 2012 21:16
> > An: [email protected]
> > Betreff: PyLucene use JCC shared object by default
> >
> > Hi,
> >
> > I've finished the process at my organization of re-implementing SOLR's
> > faceting algorithm (in C++).
> >
> > We would like the public at large to have access to the work we've
> > done and plan to do. In order for this to be a real possibility the
> > code needs to be built against and use the same JVM as the PyLucene
> installation does.
> > The most logical way we feel to have this accomplished is by having
> > PyLucenes' default installation use JCC as a Shared Object.
> >
> > We have yet more plans to extend and provide utilities that work with
> > PyLucene, but this all hinges on having the shared object. The only
> > alternative methodology would require the bundling of our source with
> > the PyLucene project itself as a fork.
> >
> > We are eager to start open sourcing our work, so please let us know
> > what would be the best way to integrate our work.
> >
> 
> 
> 
> --
> Caleb Burns
> Developer | Riders Discount

AW: PyLucene use JCC shared object by default

Reply via email to