Re: [Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Chris Mattmann
Dennis, +1 On 6/25/07 4:42 PM, Dennis Kubes [EMAIL PROTECTED] wrote: If no one has any objections, I will go ahead and commit this. Dennis Kubes Dennis Kubes (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugi

Re: [Nutch-dev] svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/n

2007-06-25 Thread Chris Mattmann
On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Jun 25 20:33:59 2007 New Revision: 550669 URL: http://svn.apache.org/viewvc?view=revrev=550669 Log: NUTCH-497: Fixes problems relating to StackOverflow errors and extreme nested tags. Adds general

Re: [Nutch-dev] svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/n

2007-06-25 Thread Chris Mattmann
No problemo! Thanks! Cheers, Chris On 6/25/07 9:45 PM, Dennis Kubes [EMAIL PROTECTED] wrote: ooopsgotta remember to do that. Done. Dennis Chris Mattmann wrote: On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Jun 25 20:33:59 2007 New

Re: [Nutch-dev] Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Chris Mattmann
Doğacan, This is strange indeed. I noticed this during my testing of parse-feed, however, thought it was an anomaly. I got this same strange cryptic unit test error message, and then after some frustration figuring it out, I did ant clean, then ant compile-core test, and miraculously the error

Re: [Nutch-dev] Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Chris Mattmann
On 6/20/07 7:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote: It never passes for me (not even when I do it in src/plugin/feed). If you check the output, parseResult only contains a single entry which is rsstest.rss. Okay, please tell me I'm not crazy here. I'm on Mac OS X 10.4, Java version:

Re: [Nutch-dev] Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Chris Mattmann
On 6/20/07 8:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote: Since you are doing compile-core, no plugins get compiled (say, urlfilter-prefix), then when you do a ant test in feed only protocol-file gets compiled. So, no urlfilter-prefix, no problem :). I have to say that I am certain that I

[Nutch-dev] Committer

2007-05-30 Thread Chris Mattmann
Hi Folks, I'd just like to throw out my +1 for Doğacan Güney's committer status. I've been impressed by several of his contributions and the guy just keeps them coming and coming. I'm not a member of the Lucene PMC, so I don't have official voting rights, however, I would like to express my

Re: [Nutch-dev] Nutch Release 0.9 - Waiting for release to propagate to mirrors

2007-04-05 Thread Chris Mattmann
announcing the completion of the release. Thanks! Cheers, Chris On 4/4/07 7:21 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, I've just moved forward with step 13 in the release process (waiting for release to propogate to mirrors). Should I just go ahead and do the other

[Nutch-dev] Nutch 0.9 officially released!

2007-04-05 Thread Chris Mattmann
Hi Folks, After some hard work from all folks involved, we've managed to push out Apache Nutch, release 0.9. This is the second release of Nutch based entirely on the underlying Hadoop platform. This release includes several critical bug fixes, as well as key speedups described in more detail at

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Chris Mattmann
wrapped up tonight! :-) Cheers, Chris On 4/4/07 8:04 AM, Sami Siren [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/rc2/ Please vote on releasing these packages as Apache

[Nutch-dev] Nutch Release 0.9 - Waiting for release to propagate to mirrors

2007-04-04 Thread Chris Mattmann
Hi Guys, I've just moved forward with step 13 in the release process (waiting for release to propogate to mirrors). Should I just go ahead and do the other steps (update Nutch site, update Lucene site, Update javadoc, create version in JIRA, etc.)? It seems that I could do these without the

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann
Hi Guys, I think we're discussing about the same thing(improving the process), I just don't think 0.9 is out yet :) But to wrap it up for me: +1 for creating 0.9 branch after fixing the bug (and removing the tag), creating new rc and starting a vote. +1. +1. So, that's 3

Re: [Nutch-dev] svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Chris Mattmann
Hi Dennis, Thanks for taking care of this. :-) Could you update CHANGES.txt as well? Once you take care of that, in about 2 hrs (when I get home), I'll begin the release process again. Thanks! Cheers, Chris On 4/2/07 2:40 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes

Re: [Nutch-dev] svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Chris Mattmann
to it sooner. Dennis Kubes Chris Mattmann wrote: Hi Dennis, Thanks for taking care of this. :-) Could you update CHANGES.txt as well? Once you take care of that, in about 2 hrs (when I get home), I'll begin the release process again. Thanks! Cheers, Chris On 4/2/07 2:40

[Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann
Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/rc2/ See the included CHANGES-0.9.txt file for details on release contents and latest changes. The release was made from the 0.9-dev trunk, including the recent patch applied

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann
Folks, As an FYI, here is a link to the log of the steps that I followed to get to this point in the release: http://people.apache.org/~mattmann/NUTCH_0.9_release_log_v2.doc Cheers, Chris On 4/2/07 10:52 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Folks, I have posted a candidate

Re: [Nutch-dev] Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Chris Mattmann
My +1 for 1.0.0. I already changed it to 0.10.0, but this can be easily reverted, and was probably something that I should have brought to the attention of the dev list before I did that (sorry about that). In any case, I think 1.0.0 makes a lot of sense, politically, and software wise. Nutch is

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann
Hi Sami, A very limited acid test shows that I can do crawling and searching through web app so that part is ok. Great! Similar tests of my own showed the same. About signatures: I can't find your public gpg key anywhere (to verify the signature), not in KEYS file nor in keyservers I

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann
/, using the same convention as the others. To get the header, I did a gpg --list-keys. Thanks! Cheers, Chris On 3/27/07 8:14 AM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Sami, A very limited acid test shows that I can do crawling and searching through web app so that part is ok. Great

[Nutch-dev] FW: [jira] Created: (HADOOP-1147) remove all @author tags from source

2007-03-22 Thread Chris Mattmann
Hey Doug, Do you think we should do this in Nutch too? I'm in favor of doing this -- what does everyone else feel? Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data

Re: [Nutch-dev] svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann
Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append

Re: [Nutch-dev] svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann
Dennis, No probs. Thanks, a lot! Cheers, Chris On 3/10/07 5:35 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted

Re: [Nutch-dev] [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2007-03-08 Thread Chris Mattmann
On 3/8/07 1:55 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Andrzej, Yep, +1. I also want to make a small update, where instead of creating a new NutchConf object, to just pass it through (maybe via the protocol layer?). Does this make sense? I'm not sure

[Nutch-dev] 0.9 release

2007-03-07 Thread Chris Mattmann
Hi Folks, As suggested by Sami, I'm moving this discussion to the nutch-dev list. Seems like I am the guy that is going to do the Nutch 0.9 release :-) However, it seems also that there are some issues that need to be sorted out first. I'd like to follow up to Andrzej's email about loose ends

[Nutch-dev] FW: Nutch release process help

2007-03-06 Thread Chris Mattmann
, Chris -- Forwarded Message From: Chris Mattmann [EMAIL PROTECTED] Date: Mon, 05 Mar 2007 21:25:30 -0800 To: Piotr Kosiorowski [EMAIL PROTECTED] Cc: Chris Mattmann [EMAIL PROTECTED], Andrzej Bialecki [EMAIL PROTECTED] Conversation: Nutch release process help Subject: Nutch release process help

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-05 Thread Chris Mattmann
Hi Guys, Blocker * NUTCH-400 (Update add missing license headers) - I believe this is fixed and should be closed +1, thanks to Sami for closing it. * NUTCH-353 (pages that serverside forwards will be refetched every time) - this was partially fixed in NUTCH-273, but a more

Re: [Nutch-dev] log guards

2007-02-28 Thread Chris Mattmann
Jérôme Charron wrote: Hi Chris, The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309 Thanks for your help. Jérôme On 2/13/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Doug, and Jerome, Ah, yes, the log guard conversation. I remember this from a while back

Re: [Nutch-dev] Welcome Dennis Kubes as Nutch committer

2007-02-28 Thread Chris Mattmann
Dennis, I take my coffee black: with a single creamer ;) Okay, okay, sorry: I thought we were talking about *real* hazing ;) Cheers, Chris On 2/28/07 12:31 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Hi All, Thank you Andrzej for your kind words. I am looking forward to working

Re: [Nutch-dev] log guards

2007-02-13 Thread Chris Mattmann
Hi Doug, and Jerome, Ah, yes, the log guard conversation. I remember this from a while back. Hmmm, do you guys know what issue that this recorded as in JIRA? I have some free time recently, so I will be able to add this to my list of Nutch stuff to work on, and would be happy to take the lead

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Chris Mattmann
, and contacting the folks who've begun work on this issue. Thanks! Cheers, Chris On 2/7/07 1:31 PM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann
Guys, Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... Cheers, Chris On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote: Renaud Richardet

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann
PROTECTED] wrote: Chris Mattmann wrote: Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... A Content would no longer generate a single Parse. Instead

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Chris Mattmann
Hi Doug, Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? IMHO, yes.

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re

Re: [Nutch-dev] RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann
://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann
It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). So this raises an interesting question: People (such as Scott G.) out there -- are you folks

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann
Hi Doug, So, does this render the patch that I wrote obsolete? Cheers, Chris On 1/25/07 10:08 AM, Doug Cutting [EMAIL PROTECTED] wrote: Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter

Re: [Nutch-dev] Reviving Nutch 0.7

2007-01-22 Thread Chris Mattmann
Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-21 Thread Chris Mattmann
Hi Dennis, On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote: All, I am working on a How to Become a Nutch Developer document for the wiki and I need some input. I need an overview of how the process for JIRA works? If I am a developer new to Nutch and just starting to look at

Re: [Nutch-dev] Next Nutch release

2007-01-16 Thread Chris Mattmann
Folks, When would you like to make the release? I've been working on NUTCH-185, but got a bit bogged down with other work. If there is interest in having NUTCH-185 included in the release, I could make a push to get out a patch by week's end... As for the rest, my +1 for NUTCH-61 being

Re: [Nutch-dev] svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann
Hi Sami, On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: siren Date: Sat Dec 9 14:27:07 2006 New Revision: 485076 URL: http://svn.apache.org/viewvc?view=revrev=485076 Log: Optimize SpellCheckedMetadata further by taking into account the fact that it is used only

Re: [Nutch-dev] svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann
in org.apache.nutch.metadata that aggreates all the met key fields from HttpHeaders, and it would be the place that the met key fields for FileHeaders, etc. could go into. Let me know what you think, and thanks! Cheers, Chris On 12/9/06 3:53 PM, Sami Siren [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Sami

Re: [Nutch-dev] [jira] Closed: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread Chris Mattmann
Hi Sami, On 11/23/06 9:45 AM, Sami Siren [EMAIL PROTECTED] wrote: Couple of points: 1. You used tabs I just installed a new version of Eclipse, and forgot to change the default preference for using tabs versus just whitespaces. I've went ahead and changed this in my Eclipse and will commit

Re: [Nutch-dev] Welcome Chris Mattmann as Nutch committer

2006-11-23 Thread Chris Mattmann
Thanks, Andrzej, thanks to the rest of the folks who voted me in! I really appreciate the honor and pledge to help maintain the high quality of the Nutch source code. Best wishes and happy holidays to all the folks on the list! Cheers, Chris On 11/23/06 4:10 AM, Andrzej Bialecki [EMAIL

[Nutch-dev] Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann
Hi Folks, I noticed that Nutch now requires JDK 5 in order to compile, due to recent changes to the PluginRepository and some other classes. I think that this is a good move, however, I wasn't sure that I had seen any official announcement that Nutch now requires 1.5... Cheers, Chris

Re: [Nutch-dev] Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann
The switch to 1.5 format was also logged on jira issue http://issues.apache.org/jira/browse/NUTCH-360 -- Sami Siren Ahh, I didn't see this. Way to go Sami, I love it when people actually keep records of changes! ;) Cheers, Chris __ Chris A.

Re: [Nutch-dev] Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann
the email address for JIRA to not use the Apache incubator one anymore, and to use to Lucene one. Sound good? If so, could someone with permissions please take care of it? :-) Cheers, Chris On 10/3/06 9:04 AM, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Chris Mattmann wrote

Re: [Nutch-dev] Patch Available status?

2006-08-30 Thread Chris Mattmann
Hi Doug and Andrzej, +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? Cheers, Chris On 8/30/06 3:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doug

Re: [Nutch-dev] 0.8 not loading plugins

2006-08-17 Thread Chris Mattmann
Hi Chris, It seems from your email message that your plugin is located in $NUTCH_HOME/build/custom-meta? Is this where your plugin * code * is currently stored? If so, this is the wrong location and the most likely reason that your plugin isn't being loaded. Plugin code should live in

Re: [Nutch-dev] Tika update

2006-08-16 Thread Chris Mattmann
Hi Jukka, Thanks for your email. Indeed, there was discussion on the Lucene PMC email list, about the Tika project. It was decided by the powers that be to discuss it more on the Nutch mailing list before moving forward with any vote on making Tika a sub-project of Apache Lucene. With regards to

Re: [Nutch-dev] Tika update

2006-08-16 Thread Chris Mattmann
the following list of candidate committers who have expressed interested in our proposed project. The leader of the Tika project would be Chris Mattmann. Chris works at NASA's Jet Propulsion Laboratory as a Member of the Technical Staff in the Modeling and Data Management Systems Section. Chris has

Re: [Nutch-dev] Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Chris Mattmann
Hi Steven, On 8/16/06 7:36 AM, steven shingler [EMAIL PROTECTED] wrote: (This thread moved from the User List.) OK Lukas, lets open it up to the dev list! :) Particularly, does the group feel moving to Maven would be _a good thing_ ? +1 I suggested this (however did not make any

[Nutch-dev] Patch Available status?

2006-08-15 Thread Chris Mattmann
Hi Guys, I've seen on the Hadoop mailing list recently that there was a new status added for issues in JIRA called Patch Available to let committers know that a patch is ready for review to commit. How about we add this to the Nutch jira instance as well? I tried doing this, but I don't think I

Re: [Nutch-dev] parse-plugins.xml

2006-08-03 Thread Chris Mattmann
Hi Marko, Thanks for your question. Basically it was set up as a sort of last result of getting at least * some * information from the PDF file, albeit littered with garbage. If indeed the parse-text does not really make sense in terms of a backup parser to handle PDF files and get at least

Re: [Nutch-dev] parse-plugins.xml

2006-08-03 Thread Chris Mattmann
Hey Andrzej, On 8/3/06 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Marko, Thanks for your question. Basically it was set up as a sort of last result of getting at least * some * information from the PDF file, albeit littered with garbage. If indeed

Re: [Nutch-dev] Library for extracting text content from binaries

2006-07-24 Thread Chris Mattmann
Hi Jukka, Thanks for your email. Jerome Charron and I proposed a project with a similar goal in mind that we wanted to dub Tika. Tika would effectively be a Lucene sub-project, and would factor out some of the capabilities you mention below from Nutch, incl: 1. MimeType repository 2. Parser

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann
Folks, Before I (or someone else) reopens the issue, I think it's important to understand the implications: 1) Having a *side-effect* of the entire system stop processing after merely logging a message at a certain event level is a poor practice. I'm not sure that the Fetcher quitting is a *

Re: [Nutch-dev] [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann
Hi Andrzej, The main problem, as Scott observed, is that the static flag affects all instances of the task executing inside the same JVM. If there are several Fetcher tasks (or any other tasks that check for SEVERE flag!), belonging to different jobs, all of them will quit. This is

[Nutch-dev] Re: Nutch Parser Bug

2006-04-25 Thread Chris Mattmann
Hi Alex, I also noticed this issue a while back. It's described here: http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200510.mbox/%3c435 [EMAIL PROTECTED] Cheers, Chris On 4/25/06 2:41 PM, Alex [EMAIL PROTECTED] wrote: Hi there, I'm fairly new to nutch and in working on the

[Nutch-dev] RE: plugin.dtd

2006-04-16 Thread Chris Mattmann
Hi Stefan, The DTD actually does allow for custom attributes: Jerome factored them out of the form: implementation your_attr_name=your_attr_value your_attr2_name=your_attr2_value . Into the form: implementation ... parameter name=your_attr_name value=your_attr_value/

[Nutch-dev] Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann
+1 On 4/7/06 10:20 AM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: +1 for a release sooner rather than later. I think this is a good plan. There's no reason we can't do another release in a month. If it is back-compatbible we can call it 0.8.x and if it's incompatible

[Nutch-dev] Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann
Hi Andrzej, On 4/7/06 12:18 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Do you guys have any additional insights / suggestions whether NUTCH-240 and/or NUTCH-61 should be included in this release? Looking at the JIRA popular issues pane for Nutch (

[Nutch-dev] Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-06 Thread Chris Mattmann
+1 for a release sooner rather than later. Several interesting features contributed since the 0.7 branch I believe are now tested and production-worthy, at least in my environment. Hats off to the folks who were able to split the MapReduce and NDFS into Hadoop -- I'm going to be experimenting with

[Nutch-dev] Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann
Hi Folks, I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. It seems that in some cases, the method: private Extension getExtension(String lang) { Extension extension = (Extension)

[Nutch-dev] Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann
Thanks Jerome! :-) Cheers, Chris On 3/13/06 4:02 PM, Jérôme Charron [EMAIL PROTECTED] wrote: I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. Fixed (r385702). Thanks Chris. NOTE: not sure if

[Nutch-dev] RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
Hi Stefan, after a short time I already had 1602 time this lines in my tasktracker log files. 060307 022707 task_m_2bu9o4 found resource parse-plugins.xml at file:/home/joa/nutch/conf/parse-plugins.xml Sounds like this file is loaded 1602 (after lets say 3 minutes) I guess that wasn't

[Nutch-dev] RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
RuntimeException(x point + Parser.X_POINT_ID + not found.); Cheers, Chris Cheers, Stefan Am 07.03.2006 um 04:38 schrieb Chris Mattmann: Hi Stefan, after a short time I already had 1602 time this lines in my tasktracker log files. 060307 022707 task_m_2bu9o4 found resource parse

[Nutch-dev] RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann
) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Monday, March 06, 2006 7:51 PM To: 'nutch-dev@lucene.apache.org' Subject: RE: found resource parse-plugins.xm? Hi Stefan, Hi Chris

[Nutch-dev] Re: duplicate libs

2006-02-13 Thread Chris Mattmann
Hey Doug, I think that at least in the case of parse-rss, parse-pdf, and the nutch core if there's probably some utility in having lib-xxx plugins (or at least putting these jars in the $NUTCH_HOME/lib) for: commons-httpclient log4j xerces Then, protocol-httpclient, parse-pdf and the rest of

[Nutch-dev] RE: duplicate libs

2006-02-13 Thread Chris Mattmann
Hi Andrzej, commons-httpclient-3.0-beta1.jar src/plugin/parse-rss/lib commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib Not sure what was the reason to use the beta1, perhaps no reason except that it was the latest available at the moment... Yup, I think that was

[Nutch-dev] Re: ignore eclipse .project and .classpath

2006-02-09 Thread Chris Mattmann
and .classpath +1 Am 08.02.2006 um 06:16 schrieb Chris Mattmann: Hi Folks, Just wondering if someone could add to the svn:ignore property for Nutch the files: .classpath .project I happen to use eclipse to do Nutch development and always ignore these files in my other

[Nutch-dev] ignore eclipse .project and .classpath

2006-02-07 Thread Chris Mattmann
Hi Folks, Just wondering if someone could add to the svn:ignore property for Nutch the files: .classpath .project I happen to use eclipse to do Nutch development and always ignore these files in my other eclipse projects as well. Cheers, Chris

[Nutch-dev] RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Chris Mattmann
Hi Gail, Check out: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/ That's the way that the parser factory currently works. Also added, but not described in that proposal is the ability to call a parser by its id, which is a method present in ParseUtil.java. G'luck! Cheers,

[Nutch-dev] Nutch Deployment

2006-01-06 Thread Chris Mattmann
Hi Folks, Jerome and I have been thinking a bit about the whole issue of static NutchConf, versus removing it and making it a constructor parameter, etc. I personally think that a lot of this issue stems from the fact that the actual source code for nutch, and the what I would call source

[Nutch-dev] RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann
Hi Folks, I've tried removing the 5 copies of the comment, however I can't find a button on JIRA to remove comments. Maybe an administrator for Nutch can do it? Anyways, the dang thing is running so slow right now, it may just have to wait until the server stops returning the 503 service

[Nutch-dev] RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them

[Nutch-dev] bug in parse-rtf?

2005-12-16 Thread Chris Mattmann
Hi Folks, Anybody been experiencing problems building the parse-rtf plugin? I just noticed while working on NUTCH-139 that there's a line at the end of RTFParser.java in parse-rtf that returns a new ParseImpl, however, the constructor for ParseData uses the old ParseData constructor (pre

[Nutch-dev] Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe,

[Nutch-dev] Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
insensitive? Stefan Am 13.12.2005 um 18:07 schrieb Chris Mattmann: Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything

[Nutch-dev] Idea about aliases in the parse-plugins.xml file

2005-12-13 Thread Chris Mattmann
Hi Folks, Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType-list of pluginIds rather than mimeType-list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would

[Nutch-dev] Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann
Hi Guys, Okay, that makes sense then. I will create an issue in JIRA later today describing the update, and then begin working on this over the next few days. Thanks for your responses and reviews. Cheers, Chris On 12/13/05 12:45 PM, Jérôme Charron [EMAIL PROTECTED] wrote: I agree, too.

[Nutch-dev] RE: submitting a patch?

2005-12-06 Thread Chris Mattmann
Hi James, You can submit your patch via JIRA (http://issues.apache.org/jira/browse/NUTCH). You can create an issue there and then attach your patch to that issue. G'luck, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and

[Nutch-dev] NUTCH-112: Link in cached.jsp page to cached content is an absolute link

2005-12-06 Thread Chris Mattmann
Hi Guys, Just wondering if any of the committers checked out http://issues.apache.org/jira/browse/NUTCH-112. Turns out the link to the cached.jsp page to the cached content contains an absolute link which makes the link mess up when you don't deploy the nutch webapp in the root context. I've

[Nutch-dev] Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Jerome, I think that this is a great idea and ensures that there isn't replication of so-called management information across the system. It could be easily implemented as a utility method because we have utility java classes that represent the ParsePluginList, that you could get the mimeTypes

[Nutch-dev] RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Doug, Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type

[Nutch-dev] RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann
Hi Jerome, Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the document's URL. So, another alternative, could be to exclude only files extensions that are registered in the mime-type repository (some well known file extensions) but for which

[Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
generic forms of XML markup content. Cheers, Chris Mattmann Am 24.11.2005 um 00:01 schrieb Jérôme Charron: Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch

[Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann
Hi Stefan, and Jerome, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I

[Nutch-dev] RE: [jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Chris Mattmann
Hi Doug, I just noticed this comment from your original email: First, the ParserFactory sometimes uses LOG.severe() which causes the Fetcher to exit. Is there a reason this cannot be LOG.warning()? LOG.severe() should only be used if you intend the application to exit. This configuration

[Nutch-dev] Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann
Hi Doug, Thanks, that worked. Cheers, Chris On 10/17/05 11:56 AM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: So, my question to you then is, what type of QueryFilter should I develop in order to get my query for contactemail:email address to work as a standalone query

[Nutch-dev] RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann
Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way

[Nutch-dev] Re: [Nutch-cvs] [Nutch Wiki] Update of ParserFactoryImprovementProposal by ChrisMattmann

2005-09-15 Thread Chris Mattmann
Hi Otis, Point taken. In actuality since both convey the same information I think that it's okay to support both, but by default say we could code the initial plugins specified in parse-plugins.xml without the order= attribute. Fair enough? Cheers, Chris On 9/15/05 3:23 PM, [EMAIL

[Nutch-dev] Re: RSS Parser Bug!?

2005-09-08 Thread Chris Mattmann
Hi Jack, Wow, that's a weird error. I'm not exactly sure what's causing that, let me look at the stack trace you provided and get back to you at some point on that. As for your 2nd question: My question is that can parse-rss support application/xml or more content-type? The answer to that is

[Nutch-dev] Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-08 Thread Chris Mattmann
Hi Jerome, I may have some time to work on this over the next few days if no one else does. So, if you're taking the lead on this, I volunteer my help if you'd like it. Thanks, Chris On 9/8/05 2:06 AM, Jerome Charron (JIRA) [EMAIL PROTECTED] wrote: Enhance ParserFactory plugin selection

[Nutch-dev] RE: [jira] Commented: (NUTCH-30) rss feed parser

2005-07-30 Thread Chris Mattmann
73005. The patch and source distro are zipped up in the file: parse-rss-73005.zip. Here is a direct link: http://issues.apache.org/jira/secure/attachment/12311475/parse-rss-73005.zip Thanks! Cheers, Chris Mattmann __ Chris A. Mattmann [EMAIL PROTECTED

[Nutch-dev] Re: [VOTE] new Nutch committers

2005-06-08 Thread Chris Mattmann
They have my votes. Great job so far guys. Jérôme Charron: +1 Piotr Kosiorowski: +1 Cheers, Chris Mattmann On 6/8/05 1:09 PM, Doug Cutting [EMAIL PROTECTED] wrote: I propose that we add Jérôme Charron and Piotr Kosiorowski as Nutch committers. Both Jérôme and Piotr have contributed many

[Nutch-dev] problems running crawl tool

2005-04-26 Thread Chris Mattmann
recently. Does anyone have any clue as to what causes it? I can attach the full crawl log if necessary. Please let me know. Thanks very much, Chris Mattmann __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data

[Nutch-dev] RE: parse-rss fetch problems

2005-04-20 Thread Chris Mattmann
, Chris Mattmann -Original Message- From: Marco PV [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 20, 2005 7:24 PM To: nutch-dev@incubator.apache.org Subject: parse-rss fetch problems Hi, I'm using /nutch-nightly from April 18th. I've downloaded and uploaded the last src/plugin/parse

[Nutch-dev] RE: [jira] Commented: (NUTCH-30) rss feed parser

2005-04-20 Thread Chris Mattmann
. Thanks for trying to help Marco out, but his problem regarding compiling had to do with having the old (0.6) version of Nutch. Thanks, Chris __ -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, April 20, 2005 8

  1   2   >