On Fri, Jun 10, 2011 at 12:11 PM, Markus Jelsma <markus.jel...@openindex.io>wrote:
> > > Guys, > > > > I added a new label 1.4 on the JIRA. Shall we create a new branch 1.4 on > > SVN from the existing 1.3? I agree that it is a pain to have to maintain > > 1.x AND trunk in parallel but my feeling is that 2.0 needs more work > > before being completely reliable and in the meantime we might want to add > > new features to the stable 1.x branch. > > Agreed. > Yes indeed. I see that Gora is still in incubation and I have not been using trunk for sometime as it has been broken due to Gora dependencies? I think this suggestion is the only sensible way to continue. As I have not been using trunk, what is the current situation with this? Do we have some kind of a time-scale for fixing Gora which would then enable Nutch trunk to build? > > > > > One possible feature would be to add a new endpoint for indexing-backends > > and make the indexing plugable. at the moment we are hardwired to SOLR - > > which is OK - but as other resources like ElasticSearch are becoming more > > popular it would be better to handle this as plugins. Not sure about the > > name of the endpoint though : we already have indexing-plugins (which are > > about generating fields sent to the backends) and moreover the backends > are > > not necessarily for indexing / searching but could be just an external > > storage e.g. CouchDB. The term backend on its own would be confusing in > 2.0 > > as this could be pertaining to the storage in GORA. 'indexing-backend' is > > the best name that came to my mind so far - please suggest better ones. > > Yes, i'd like to see this `renamed` as well. I makes perfectly sense to > have a > plugin to `index` to CouchDB as well as send the stuff to Solr and ES. I'm > unsure how to name this. Indexing becomes a bit ambiguous since 1.3. > This is true. At this stage, from what I can see we now have attractive alternatives to simply using Solr backend. Would it be reasonable to confirm some preferred/obvious options for backend storage or indexing before progressing to naming potential plugin names? How many backend options are there that we wish to adopt within Nutch? Which of these provide viable options for Nutch integration? > > > > > For 1.4 (and 2.0) it would be good to improve the detection of duplicates > > so that it detects them using mapreduce on the crawldb instead of pulling > > the docs from SOLR. > > Yes, i remeber a ticket for deduplicating locally (or was it mentioned in > the > 404 cleaner). Anyway, this is really desired as it can take a lot of strain > on > the Solr index, especially if it is also a query/slave node. > > I think we should come up with generic map/reduce jobs for indexing, > deduplicating and cleaning and maybe add a Nutch extension point there so > we > can easily hook up indexing, cleaning and deduplicating for various ... > end- > points? > > > > > Let's just add to the wishlist on JIRA with the tag 1.4. Is everybody > happy > > with having a new branch 1.4? > > I'm not everybody but +1 anyway ;) > Just checked out 1.4 (thanks Julien) so will be thinking of various additions to those highlighted above. I'll have more suggestions when I get ICLA approved, commit status confirmed and get more rights on JIRA. > > > > Jul > -- *Lewis*