Re: Library for extracting text content from binaries

2006-07-24 Thread Jukka Zitting
Hi, On 7/24/06, Chris Mattmann <[EMAIL PROTECTED]> wrote: Thanks for your email. Jerome Charron and I proposed a project with a similar goal in mind that we wanted to dub "Tika". Tika would effectively be a Lucene sub-project, and would factor out some of the capabilities you mention below from

Re: Why was "prune" removed in 0.8?

2006-07-24 Thread Andrzej Bialecki
Stefan Neufeind wrote: Hi, I might be bringing up old discussions (sorry if so) - but discussing about segread/readseg I wondered why "prune" is missing in bin/nutch. It's still working when you give the full classname by hand. But could it be (re)added to bin/nutch again as well? I think P

Why was "prune" removed in 0.8?

2006-07-24 Thread Stefan Neufeind
Hi, I might be bringing up old discussions (sorry if so) - but discussing about segread/readseg I wondered why "prune" is missing in bin/nutch. It's still working when you give the full classname by hand. But could it be (re)added to bin/nutch again as well? Regards, Stefan

Re: segread vs. readseg

2006-07-24 Thread Stefan Groschupf
I like it! Am 24.07.2006 um 16:10 schrieb Andrzej Bialecki: Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a th

Re: segread vs. readseg

2006-07-24 Thread Andrzej Bialecki
Stefan Neufeind wrote: Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Yes, it seems more consistent. However, if we change it

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-07-24 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] Andrzej Bialecki commented on NUTCH-322: - Good questions ... ;) ad 1: Google shows only the final page, and you can access it through both the original (s

Re: segread vs. readseg

2006-07-24 Thread Stefan Neufeind
Andrzej Bialecki wrote: Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Yes, it seems more consistent. However, if we change it then scripts people wr

Re: Library for extracting text content from binaries

2006-07-24 Thread Michael Wechner
Jukka Zitting wrote: Hi, Any interest in this? definitely :-) Michi If not, is there some other Lucene project that I should approach? BR, Jukka Zitting On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote: Hi, I'm a committer of the Apache Jackrabbit project, and I've recently been

Re: segread vs. readseg

2006-07-24 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Yes, it seems more consistent. However, if we change it then scripts people wrote would break. We could

RE: Library for extracting text content from binaries

2006-07-24 Thread Chris Mattmann
Hi Jukka, Thanks for your email. Jerome Charron and I proposed a project with a similar goal in mind that we wanted to dub "Tika". Tika would effectively be a Lucene sub-project, and would factor out some of the capabilities you mention below from Nutch, incl: 1. MimeType repository 2. Parser i

Re: Library for extracting text content from binaries

2006-07-24 Thread Jukka Zitting
Hi, Any interest in this? If not, is there some other Lucene project that I should approach? BR, Jukka Zitting On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote: Hi, I'm a committer of the Apache Jackrabbit project, and I've recently been working on improving the full text indexing support

segread vs. readseg

2006-07-24 Thread Stefan Groschupf
Hi developers, we have command like readdb and readlinkdb but segread. Wouldn't be more consistent to name the command readseg instead segread? ... just a thought. Stefan

[jira] Closed: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-24 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-324?page=all ] Andrzej Bialecki closed NUTCH-324. --- Fix Version/s: 0.8-dev Resolution: Fixed Patch applied, with minor whitespace diffs and doc. clarifications. Thank you! > db.score.link.internal an

[jira] Updated: (NUTCH-167) Observation of directive

2006-07-24 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-167?page=all ] Andrzej Bialecki updated NUTCH-167: Attachment: patch.txt This patch implements support for Pragma: no-cache and Robots: noarchive. Three "cache policies" are supported in this patch: * C

[jira] Closed: (NUTCH-329) CrawlDbReader processTopNJob does not set jobNames

2006-07-24 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-329?page=all ] Andrzej Bialecki closed NUTCH-329. --- Resolution: Fixed Fixed. Thanks! > CrawlDbReader processTopNJob does not set jobNames > -- > >

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-07-24 Thread Enrico Triolo (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422996 ] Enrico Triolo commented on NUTCH-322: - Ok, I can see your point, nevertheless I think we should consider some potential problems that could arise from such modi

Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-07-24 Thread Sami Siren
Are you planning to update Hadoop to trunk/ ? I'd rather be careful with that - I'm not sure if it's still compatible with Java 1.4, besides being unreleased/unstable ... Not planning an upgrade, just wan't to know if it resolves the issues. We can then decide what's the best thing to do.

Re: [jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-07-24 Thread Andrzej Bialecki
Sami Siren (JIRA) wrote: [ http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12422929 ] Sami Siren commented on NUTCH-266: -- I finally found the time to setup an environment with cygwin and try this out. I can confirm that the