Hi All, We just wanted to flush some our discussions on Nutch 2.0 to the public mailing lists. A few key points:
1. Doğacan, Julien and Enis are proposing to base Nutch 2.0 on GORA, a technology that can be viewed here [1]. The upshot of GORA is that it's an ORM layer that abstracts away backend stores like Cassandra or SQL or any other database (though Gora works better with NoSQL stores). A couple of questions I've got: a) is GORA ASL licensed? b) what's the maintenance plan for GORA? Will it continue to live in Github? Will you guys propose it into the Apache Incubator as an ASF project? 2. Development on Nutchbase has occurred at Github [2] since Doğacan originally checked it into Nutch SVN under a branch [3] at the ASF. I expressed some concerns about this since it's hard to review huge sweeping patches and since a lot of development has occurred off of the public Apache mailing lists. Specifically I asked Doğacan et al. to enumerate a list of the changes in the Git version of Nutchbase [2] versus the ASF version [3]. We then need to come up with a plan of how to merge the 2 and get the latest into ASF SVN. Doğacan estimates the difference between the Nutchbase branch at the ASF [3] and that of Github [2] to be ~25 hrs of work. Doğacan generated this list of major changes that have happened at Github and not at Apache: ---snip 1) Porting nutchbase to GORA: This was discussed in issues NUTCH-808 and NUTCH-811. 2) Using ivy in nutch: NUTCH-821 and NUTCH-825 3) Removal of nutch's custom developed search code (and using SOLR instead). IIRC, this was also discussed and accepted by nutch community. However, if not, we can simply put this code back (since this was a trivial delete). ---snip So, that really brings everyone up to speed I think. So, that said, I am +1 for moving forward on #1 above, provided we address the 2 questions I listed (a+b). We need to understand it from a Nutch perspective. As for #2, we can rectify it by doing the following things: (a) svn copy NutchBase from GitHub to the nutchbase branch in http://svn.apache.org/repos/asf/nutch/branches/nutchbase bringing the ASF branch up to date. (b) Once the GORA license issues are figured out (they must be compatible with the ASF or we cannot use it), then we update Nutch to depend on the GORA jars via Ivy? (c) svn tag current Nutch trunk as 1.2-branch (d) svn merge nutchbase branch with nutch trunk (e) roll the version # in nutch trunk to 2.0-dev (f) all issues in JIRA should be updated to reflect 2.0-dev fixes where it makes sense (g) a 2.1 version is added to mark anything that we don't want in 2.0 and we file post 2.0 issues there (h) Nutch 2.0 trunk is fixed, and brought up to speed and old code is removed. All unit tests should pass regression where it makes sense. (i) Nutch documentation is brought up to date on wiki and checked into SVN (j) We roll a 2.0 release That sound good to everyone? Cheers, Chris [1] http://github.com/enis/gora [2] http://github.com/dogacan/nutchbase [3] http://svn.apache.org/repos/asf/nutch/branches/nutchbase/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++