Doğacan Güney wrote:
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialecki<a...@getopt.org> wrote:
Doğacan Güney wrote:
Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki<a...@getopt.org> wrote:
Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.

Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.
There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...

I think I remember a conversation a while back about this :) Not OSGI specifically but changing the plugin framework. I am all for changing it to something like OSGI though.

Dennis



Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?

Oh, an additional comment about the scoring API: I don't think the claimed
benefits of OPIC outweigh the widespread complications that it caused in the
API. Besides, getting the static scoring right is very very tricky, so from
the engineer's point of view IMHO it's better to do the computation offline,
where you have more control over the process and can easily re-run the
computation, rather than rely on an online unstable algorithm that modifies
scores in place ...


Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.

Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing &
searching layer that we now have, and how distributed Solr fits into this
picture?

I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.
Me too..

About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoop&hbase have, i.e. if a machine dies you
lose that part of the index.
Grant Ingersoll is doing some initial work on integrating distributed Solr
and Zookeeper, once this is in a usable shape then I think perhaps it's more
or less equivalent to Katta. I have a patch in my queue that adds direct
Hadoop->Solr indexing, using Hadoop OutputFormat. So there will be many
options to push index updates to distributed indexes. We just need to offer
the right API to implement the integration, and the current API is IMHO
quite close.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?
There is the Bailey.sf.net project that fits this description, but it's
dormant - either it was too early, or there were just too many design
questions (or simply the committers moved to other things).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





Reply via email to