That's strange - I did an svn up at the root of the nutch-trunk
directory and merged all of the changes with my code base. I must
have missed the changes to the conf director when merging as I was
only diffing the src directory.
--Ned
On Nov 6, 2007, at 7:05 PM, Chris Mattmann wrote:
[..snip..]
return type.getName();
}
The NPE was being thrown on the last line, so I did some tracing and
found out that the call to MimeType.clean(typeName) [typeName <-
"text/html] worked fine, but the next line caused a problem. The
this.mimeTypes.getRepository.forName(cleanedMimeType) was returning
null. My problem was that I downloaded the trunk and it didn't
have a
MimeUtils anymore so I had no way to trace this.
Yes, this class was removed as part of NUTCH-562. Its usage was
replaced
with the class of the same name within the Tika API, which is based
on the
Nutch API for mime types.
Anyway, after an hour or so of banging my head against the wail I
realized the update to Nutch didn't have the correct .xml file
describing mime types in the conf/ directory. Thus, I unzipped
the Tika
jar, grabbed the .xml file and changed nutch-default.xml to point to
that xml for mime types and it started working.
This is strange: as part of the patch for NUTCH-562, there was a
file called
tika-mimetypes.xml, that was committed to the conf/ folder within
the trunk.
Do you not have this file? The nutch-default.xml file within the conf/
folder in the nutch trunk points to the tika-mimetypes.xml, so that
should
have worked. I'm wondering if you had an old version of the /conf
directory
and neglected to svn up it?
Sorry again for being so vague. I'm not sure if I should submit a
JIRA
issue for this, but I'm happy to do so if anyone else has seen
this issue.
No problem: let's discuss the JIRA issue once we get an answer to
the above
questions.
Thanks for being more descriptive and looking forward to your
response.
Cheers,
Chris
Thanks,
Ned
Chris Mattmann wrote:
Hi Ned,
Glad to see you're poking around with the Tika software and its
use in
Nutch. To start, you probably want to go to the website for Tika:
http://incubator.apache.org/tika/
On that website, you should see the links to the SVN repository.
The
version of Tika that was used was a version that I built the same
day I
committed the fix for NUTCH-562:
http://issues.apache.org/jira/browse/NUTCH-562
Which appears to be a version of Tika built on October 8th. The
API for the
mime framework has changed a bit since then (to its betterment),
however, I
neglected to upgrade the Nutch API because of the strong objection I
received from Andrzej and input from Dennis Kubes regarding the
use of the
Tika API in Nutch. I stand by my email I sent in reply to the
objections:
http://www.nabble.com/forum/ViewPost.jtp?post=13142174&framed=y
However, out of respect for the other committers, neglected to
make any
updates to the Nutch use of the Tika API since I never heard back
from
anyone after my response.
That said, could you be a bit more specific Ned as to the exact
problem
you're having, e.g., "I tried visiting this site (URL here), the
content
type was (content/type here), and then it got into Content.java,
and on line
XXX it seems that the MimeType is getting set to null when it
tries to...".
With that info, I could probably help you quite a bit more. Also,
depending
upon how the rest of the Nutch committers want to handle the use
of Tika
(revert and remain stagnant, or use Tika and leverage the updates
we're
making to the Mime framework there), then we could come up with a
strategy
to help you out with the issue you're having.
Thanks!
Cheers,
Chris
On 11/6/07 3:47 PM, "Ned Rockson" <[EMAIL PROTECTED]> wrote:
I think there may be a bug in the Content.java when it tries to
convert
the textual representation of the type to a MimeType. It always
returns
null. I'm trying to fix it but I can't find an API for Tika (or
even
src). Can someone point me in the right direction?
Thanks,
Ned
______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not
reflect
those of either NASA, JPL, or the California Institute of
Technology.
______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not
reflect
those of either NASA, JPL, or the California Institute of Technology.