Dear Caleb, Thanks for the sample - as usual code says more than thousand words .-)
The API really looks MUCH simpler than the current lucene facet API (as far as I can tell from my first steps is quite complex). > With initial tests, the algorithm is about 100 faster in C++ than when > implemented in Python. Wow that’s a nice factor you gain! Did you also compare it to the "standard" lucene facet approach? The main difference I can observe so far is that lucene facet search allows to a) define a kind of "category hierarchy" and search/count within this tree, e.g. regarding your example have 'Helmets/Type/Full Face' 'Helmets/Type/Open Face' ... 'Helmets/User/Adult' 'Helmets/User/Youth' etc. This is done via the CategoryPath mainly -see example code I just posted on the list (though I assume you're familiar with that approach). b) 'drilldown' - i.e. re-run a search with the same query but restrict it to some facet/category of interest Or is this also provided by your API? best regards Thomas -- OrbiTeam Software GmbH & Co. KG, Germany http://www.orbiteam.de > -----Ursprüngliche Nachricht----- > Von: Caleb Burns [mailto:[email protected]] > Gesendet: Mittwoch, 18. April 2012 22:17 > An: [email protected] > Betreff: Re: PyLucene use JCC shared object by default > > Hi Thomas, > > Our primary motivation was performance and secondary was a "pythonic" > api. > Our needs were simpler than the complexity of the whole lucene.facet > package. On the Lucene side of things, it looks like we have something similar > to CategoryPath (statically 2 deep: "/Field/Value") and FacetRequest (only > allow searching at root level, optionally only on filtered docs set and > fields). > Specifically, we implemented an index/cache of all documents and their > terms. As far as I know SOLR uses caching of the Lucene index to perform > faceting. > > Our implementation is based on > http://lucene.apache.org/solr/api/org/apache/solr/request/UnInvertedFiel > d.html > and > the interface in Python is almost identical. You pass our object an > IndexReader and by default all Terms with TermVectors are indexed. You can > then selectively retrieve fields. Here's an example of use: > http://pastebin.com/Lq3LZKMp. The whole module is ~2000 lines (python > interface, c++ implementation, comments). With initial tests, the algorithm is > about 100 faster in C++ than when implemented in Python. > > On Wed, Apr 18, 2012 at 9:31 AM, Thomas Koch <[email protected]> wrote: > > > Hi, > > sounds like an interesting project – may I ask what you actually > > implemented and what’s the motivation (e.g. performance?)? > > > > I’ve started to experiment with the Facet support in Lucene (actually > > in PyLucene – ported an example to Python) and found that facetted > > search support in Lucene looks powerful (though API is still said to > > be ‘experimental’ and I can’t say anything about performance yet). > > I’m talking about the org.apache.lucene.facet.* packages – part of the > > contrib part of Lucene and available as JARs that’s accessible in PyLucene > > as > well. > > I’m not that familiar with Solr but AFAIK it’s based on Lucene (Java) > > and should (hopefully) use the same Java code for its facet search > > support. Of course Solr adds some nice configuration support and web > > GUI to Lucene, but the ‘core’ search is built on Lucene (to my > > knowledge). So did you re-implement the Lucene facet search/index code > > (like TaxonomyReader/Writer, FacetRequest stuff etc.) in C++ or what > > part of Solr?? > > > > Regarding Facet support in PyLucene I can share the samples I’ve ‘ported’ > > to Python so far. There’s still a patch pending for JavaList (required > > by facet features) which I come back to later on this list (still some > > open issues). Hopefully this can be included in the PyLucene 3.6 > > version … > > > > Regards > > Thomas > > -- > > OrbiTeam Software GmbH & Co. KG > > Germany http://www.orbiteam.de > > > > > > Von: Caleb Burns [mailto:[email protected]] > > Gesendet: Dienstag, 17. April 2012 21:16 > > An: [email protected] > > Betreff: PyLucene use JCC shared object by default > > > > Hi, > > > > I've finished the process at my organization of re-implementing SOLR's > > faceting algorithm (in C++). > > > > We would like the public at large to have access to the work we've > > done and plan to do. In order for this to be a real possibility the > > code needs to be built against and use the same JVM as the PyLucene > installation does. > > The most logical way we feel to have this accomplished is by having > > PyLucenes' default installation use JCC as a Shared Object. > > > > We have yet more plans to extend and provide utilities that work with > > PyLucene, but this all hinges on having the shared object. The only > > alternative methodology would require the bundling of our source with > > the PyLucene project itself as a fork. > > > > We are eager to start open sourcing our work, so please let us know > > what would be the best way to integrate our work. > > > > > > -- > Caleb Burns > Developer | Riders Discount
