Re: Tika API

2007-11-06 Thread Dennis Kubes
Chris Mattmann wrote: Hi Ned, Glad to see you're poking around with the Tika software and its use in Nutch. To start, you probably want to go to the website for Tika: http://incubator.apache.org/tika/ On that website, you should see the links to the SVN repository. The version of Tika tha

Re: Tika API

2007-11-06 Thread Chris Mattmann
[..snip..] > return type.getName(); > } > > > The NPE was being thrown on the last line, so I did some tracing and > found out that the call to MimeType.clean(typeName) [typeName <- > "text/html] worked fine, but the next line caused a problem. The > this.mimeTypes.getRepository.forName(c

Re: Tika API

2007-11-06 Thread Ned Rockson
Sorry about being so vague. I was trying to run Fetcher (and Fetcher2 as well) and noticed that every fetched page was throwing an NPE. Essentially, the NPE was coming from Content.java that referenced HttpResponse and the line that was causing the problem had a call to MimeUtils.getRepositor

Re: Tika API

2007-11-06 Thread Chris Mattmann
Hi Ned, Glad to see you're poking around with the Tika software and its use in Nutch. To start, you probably want to go to the website for Tika: http://incubator.apache.org/tika/ On that website, you should see the links to the SVN repository. The version of Tika that was used was a version t

MD5 vs TextProfile Signature

2007-11-06 Thread karthik085
Hi, Wondering which does a better job - MD5 or TextProfile signature? From what I get from the apis and if there is content on a page, MD5 calculates the raw binary content of a page and TextProfile calculates the plain text profile of the page. I believe the values calculated are used to delete d

Tika API

2007-11-06 Thread Ned Rockson
I think there may be a bug in the Content.java when it tries to convert the textual representation of the type to a MimeType. It always returns null. I'm trying to fix it but I can't find an API for Tika (or even src). Can someone point me in the right direction? Thanks, Ned

adding dmoz meta data to index.

2007-11-06 Thread [EMAIL PROTECTED]
Hi All, I need to add dmoz meta-data to my index. I see some people have commented about it but I didn't find a solution. Can someone read the steps below and give me some hints or pointers? This is the code that I added: 1) injector.java: datum.setCategory("dmoz-cat"); 2) crawldatum.java: add

[jira] Issue Comment Edited: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-11-06 Thread Sam Xia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116 ] samx edited comment on NUTCH-356 at 11/6/07 10:57 AM: - I applied cache_classes.patch to nutch

[jira] Issue Comment Edited: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-11-06 Thread Sam Xia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116 ] samx edited comment on NUTCH-356 at 11/6/07 10:57 AM: - I applied cache_classes.patch to nutch

[jira] Issue Comment Edited: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-11-06 Thread Sam Xia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116 ] samx edited comment on NUTCH-356 at 11/6/07 10:56 AM: - I applied cache_classes.patch to nutch

[jira] Issue Comment Edited: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-11-06 Thread Sam Xia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116 ] samx edited comment on NUTCH-356 at 11/6/07 10:57 AM: - I applied cache_classes.patch to nutch

[jira] Issue Comment Edited: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-11-06 Thread Sam Xia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540116 ] samx edited comment on NUTCH-356 at 11/6/07 10:55 AM: - I applied cache_classes.patch to nutch