Re: new branch 1.4 and possible features

lewis john mcgibbney Mon, 13 Jun 2011 04:40:21 -0700

On Fri, Jun 10, 2011 at 12:11 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:


>
> > Guys,
> >
> > I added a new label 1.4 on the JIRA. Shall we create a new branch 1.4 on
> > SVN from the existing 1.3? I agree that it is a pain to have to maintain
> > 1.x AND trunk in parallel but my feeling is that 2.0 needs more work
> > before being completely reliable and in the meantime we might want to add
> > new features to the stable 1.x branch.
>
> Agreed.
>

Yes indeed. I see that Gora is still in incubation and I have not been using
trunk for sometime as it has been broken due to Gora dependencies? I think
this suggestion is the only sensible way to continue. As I have not been
using trunk, what is the current situation with this? Do we have some kind
of a time-scale for fixing Gora which would then enable Nutch trunk to
build?

>
> >
> > One possible feature would be to add a new endpoint for indexing-backends
> > and make the indexing plugable. at the moment we are hardwired to SOLR -
> > which is OK - but as other resources like ElasticSearch are becoming more
> > popular it would be better to handle this as plugins. Not sure about the
> > name of the endpoint though : we already have indexing-plugins (which are
> > about generating fields sent to the backends) and moreover the backends
> are
> > not necessarily for indexing / searching but could be just an external
> > storage e.g. CouchDB. The term backend on its own would be confusing in
> 2.0
> > as this could be pertaining to the storage in GORA. 'indexing-backend' is
> > the best name that came to my mind so far - please suggest better ones.
>
> Yes, i'd like to see this `renamed` as well. I makes perfectly sense to
> have a
> plugin to `index` to CouchDB as well as send the stuff to Solr and ES. I'm
> unsure how to name this. Indexing becomes a bit ambiguous since 1.3.
>

This is true. At this stage, from what I can see we now have attractive
alternatives to simply using Solr backend. Would it be reasonable to confirm
some preferred/obvious options for backend storage or indexing before
progressing to naming potential plugin names? How many backend options are
there that we wish to adopt within Nutch? Which of these provide viable
options for Nutch integration?

>
> >
> > For 1.4 (and 2.0) it would be good to improve the detection of duplicates
> > so that it detects them using mapreduce on the crawldb instead of pulling
> > the docs from SOLR.
>
> Yes, i remeber a ticket for deduplicating locally (or was it mentioned in
> the
> 404 cleaner). Anyway, this is really desired as it can take a lot of strain
> on
> the Solr index, especially if it is also a query/slave node.
>
> I think we should come up with generic map/reduce jobs for indexing,
> deduplicating and cleaning and maybe add a Nutch extension point there so
> we
> can easily hook up indexing, cleaning and deduplicating for various ...
> end-
> points?
>
> >
> > Let's just add to the wishlist on JIRA with the tag 1.4. Is everybody
> happy
> > with having a new branch 1.4?
>
> I'm not everybody but +1 anyway ;)
>

Just checked out 1.4 (thanks Julien) so will be thinking of various
additions to those highlighted above. I'll have more suggestions when I get
ICLA approved, commit status confirmed and get more rights on JIRA.


> >
> > Jul
>



-- 
*Lewis*

Re: new branch 1.4 and possible features

Reply via email to