Chris Mattmann wrote:
Hi Ned,
Glad to see you're poking around with the Tika software and its use in
Nutch. To start, you probably want to go to the website for Tika:
http://incubator.apache.org/tika/
On that website, you should see the links to the SVN repository. The
version of Tika tha
[..snip..]
> return type.getName();
> }
>
>
> The NPE was being thrown on the last line, so I did some tracing and
> found out that the call to MimeType.clean(typeName) [typeName <-
> "text/html] worked fine, but the next line caused a problem. The
> this.mimeTypes.getRepository.forName(c
Sorry about being so vague. I was trying to run Fetcher (and Fetcher2
as well) and noticed that every fetched page was throwing an NPE.
Essentially, the NPE was coming from Content.java that referenced
HttpResponse and the line that was causing the problem had a call to
MimeUtils.getRepositor
Hi Ned,
Glad to see you're poking around with the Tika software and its use in
Nutch. To start, you probably want to go to the website for Tika:
http://incubator.apache.org/tika/
On that website, you should see the links to the SVN repository. The
version of Tika that was used was a version t
Hi,
Wondering which does a better job - MD5 or TextProfile signature? From what
I get from the apis and if there is content on a page, MD5 calculates the
raw binary content of a page and TextProfile calculates the plain text
profile of the page. I believe the values calculated are used to delete
d
I think there may be a bug in the Content.java when it tries to convert
the textual representation of the type to a MimeType. It always returns
null. I'm trying to fix it but I can't find an API for Tika (or even
src). Can someone point me in the right direction?
Thanks,
Ned
Hi All,
I need to add dmoz meta-data to my index. I see some people have commented
about it but I didn't find a solution. Can someone read the steps below and
give me some hints or pointers? This is the code that I added:
1) injector.java: datum.setCategory("dmoz-cat");
2) crawldatum.java: add
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116
]
samx edited comment on NUTCH-356 at 11/6/07 10:57 AM:
-
I applied cache_classes.patch to nutch
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116
]
samx edited comment on NUTCH-356 at 11/6/07 10:57 AM:
-
I applied cache_classes.patch to nutch
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116
]
samx edited comment on NUTCH-356 at 11/6/07 10:56 AM:
-
I applied cache_classes.patch to nutch
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116
]
samx edited comment on NUTCH-356 at 11/6/07 10:57 AM:
-
I applied cache_classes.patch to nutch
[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116
]
samx edited comment on NUTCH-356 at 11/6/07 10:55 AM:
-
I applied cache_classes.patch to nutch
12 matches
Mail list logo