Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Chris Mattmann Wed, 10 Oct 2007 11:50:33 -0700

Hi Guys,

>> 
>> I vote for reverting this patch, unless there is an overall consensus
>> among Nutch developers that it's ok to keep it as it is - on one hand
>> considering the added functionality and simplification of Nutch code,
>> and on the other hand considering the (lack of) maturity of Tika.
> 
> I agree with Andrzej here. I would have waited a bit more before rushing
> into this. Because at this point (where no Tika releases have been made)
> it might (even though it does not look like it right now) even be
> possible that the project will be retired without any releases at all.


I'm not out for beating a dead horse here, but the thought comes to mind:
what about the vitality of the code as it exists within the Nutch code base?
When was the last time anybody at all worked on the mime system? It was
pioneered by Jerome, but he's been largely inactive as a committer for more
than a year now, and it doesn't look like that's going to change.

I ported what was largely Nutch's mime system, with Jerome's improvements to
Tika, where the code is actively being developed, by me (and vetted by the
other *active* members of the team) -- in contrast to Nutch. As a developer,
I don't want to maintain the code in both places, but I'm willing to
maintain the Nutch use of and interface to Tika, which means that Nutch will
inherit the benefits using this approach. Being a member of the Nutch
community for almost 2 years now, I can't tell you how many times people
have asked for Nutch to be able to reliably detect XML content. This is
reified in the form of a number of different JIRA issues that reference that
deficiency, that are for all intents and purposes, not being worked on at
all. 

I'm all for following the process, and so forth, but at the same time, I
think the Nutch community needs to take a serious look at itself with
regards to the "sacred" nature of the trunk, which we currently treat with a
large amount of sensitivity, etc. However, the trunk as it stands on other
projects (and of course, I'm bias, but I use my work as an example and also
say something like Tika), the trunk is not something that is expected to be
"always working" and is regularly expected as somewhere where bugs can
exist, and where they can be fixed before a release is made. That's not the
way I feel on this project and quite honestly I think it stymies progress.

Finally, there is precedence for what I did with the Tika patch and making
its way into the Nutch. If I recall something very similar happened when
Hadoop came along and NDFS (at the time as it was called) and MapReduce made
their way into an external library, and Nutch was made to rely on that (at
the time) in-development library. This makes sense, because the folks
working on Hadoop were actively working on updates to the portion of the
code that Nutch relied upon, and all the developers that were interested in
that portion of the code started developing in that arena. I'm not
compariing Hadoop to Tika, but certainly there are some similarities here.

-Chris


______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

Reply via email to