from:"Chris Mattmann"

Re: [jira] Commented: (NUTCH-30) rss feed parser

2005-07-27 Thread Chris Mattmann

Hi Michael,

 The zip from Hasan  isn't the working version of the parser. He tried to
upgrade some of my old code from the old net.nutch.* style package names to
the org.apache.* ones, but I had already performed that work. The working
version of the parser is the 5th attachment, titled:
"parse-rss-srcbin-incl-path.zip", uploaded on 17/Aprl/05, at 9:42 PM. This
was discussed back in April/May on the Nutch list, so you may have missed
that conversation. Here is a direct link to the parser that I uploaded:

http://issues.apache.org/jira/secure/attachment/19661/parse-rss-srcbin-incl-
path.zip

Here is a link to a page where you can see the different attachments and
upload dates:

http://issues.apache.org/jira/secure/ManageAttachments.jspa?id=31220

One thing to note is that my plugin was pre-Andrzej's updates to the
protocol plugins and the parser code, so it may need to be updated to work
with the latest Nutch SVN. I have the parse-rss plugin currently managed in
a local CVS of mine, so I will download the latest SVN of Nutch and see if
it works with it, and if an updated patch is needed, I can take care of
that. 

Another thing to note is that Andrzej was working with me on getting this
plugin included in the source, but that was before he left for Vacation, so
we may have to wait till he gets back before we make any progress on
commiting it...

Cheers,
  Chris

On 7/27/05 8:42 AM, "Michael Nebel (JIRA)" <[EMAIL PROTECTED]> wrote:

> [ 
> http://issues.apache.org/jira/browse/NUTCH-30?page=comments#action_12316928 ]
> 
> Michael Nebel commented on NUTCH-30:
> 
> 
> I loaded the latest sources from the svn yesterday and tried to integrate this
> plugin (I used the Zip from Hasan) . I found:
> 
> - getParse throws a ParseException which isn't supported by getParse
> - the call to new ParseData needs a new parameter "ParseStatus"
> 
> My fixes are far from perfect (I just identified the problems by now), so I'm
> not creating a patch. :-(
> 
>> rss feed parser
>> ---
>> 
>>  Key: NUTCH-30
>>  URL: http://issues.apache.org/jira/browse/NUTCH-30
>>  Project: Nutch
>> Type: Improvement
>>   Components: fetcher
>> Reporter: Stefan Grroschupf
>> Assignee: Chris A. Mattmann
>> Priority: Minor
>>  Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0-040605.zip,
>> parse-rss-patch.txt, parse-rss-srcbin-incl-path.zip, parse-rss.zip,
>> parseRss.zip
>> 
>> A simple rss feed parser supporting:
>> rss and atom:
>> + version 0.3
>> +  version 09
>> + version 10
>> + version 20
>> Converting of different rss versions  is done via xslt.
>> The xslt was contributed by Frank Henze - Thanks!

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246

___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Commented: (NUTCH-30) rss feed parser

2005-07-30 Thread Chris Mattmann

Hi Folks,
 
  I response to Michael's comment, I've went ahead and uploaded a working
patch and an updated patch and source distribution for the parse-rss plugin.
The latest patch and source work against the new protocol and parsing APIs
by Andrzej. The patch was made against the latest SVN from 73005.

The patch and source distro are zipped up in the file: parse-rss-73005.zip.
Here is a direct link:
http://issues.apache.org/jira/secure/attachment/12311475/parse-rss-73005.zip


Thanks!

Cheers,
  Chris Mattmann
__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -Original Message-
> From: Michael Nebel (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 27, 2005 8:42 AM
> To: [EMAIL PROTECTED]
> Subject: [jira] Commented: (NUTCH-30) rss feed parser
> 
> [ http://issues.apache.org/jira/browse/NUTCH-
> 30?page=comments#action_12316928 ]
> 
> Michael Nebel commented on NUTCH-30:
> 
> 
> I loaded the latest sources from the svn yesterday and tried to integrate
> this plugin (I used the Zip from Hasan) . I found:
> 
> - getParse throws a ParseException which isn't supported by getParse
> - the call to new ParseData needs a new parameter "ParseStatus"
> 
> My fixes are far from perfect (I just identified the problems by now), so
> I'm not creating a patch. :-(
> 
> > rss feed parser
> > ---
> >
> >  Key: NUTCH-30
> >  URL: http://issues.apache.org/jira/browse/NUTCH-30
> >  Project: Nutch
> > Type: Improvement
> >   Components: fetcher
> > Reporter: Stefan Grroschupf
> > Assignee: Chris A. Mattmann
> > Priority: Minor
> >  Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0-
> 040605.zip, parse-rss-patch.txt, parse-rss-srcbin-incl-path.zip, parse-
> rss.zip, parseRss.zip
> >
> > A simple rss feed parser supporting:
> > rss and atom:
> > + version 0.3
> > +  version 09
> > + version 10
> > + version 20
> > Converting of different rss versions  is done via xslt.
> > The xslt was contributed by Frank Henze - Thanks!
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>http://www.atlassian.com/software/jira

Re: RSS Parser Bug!?

2005-09-08 Thread Chris Mattmann

Hi Jack,

 Wow, that's a weird error. I'm not exactly sure what's causing that, let me
look at the stack trace you provided and get back to you at some point on
that. As for your 2nd question:

> My question is that can parse-rss support "application/xml" or more
> content-type?

The answer to that is a resounding "yes". Parse-rss, being based on the
commmons-feedparser can support the following (taken from the feedparser
site, http://jakarta.apache.org/commons/sandbox/feedparser/):

"Jakarta FeedParser is a Java RSS/Atom parser designed to elegantly support
all versions of RSS (0.9, 0.91, 0.92, 1.0, and 2.0), Atom 0.5 (and future
versions) as well as easy ad hoc extension and RSS 1.0 modules capability."

In my experience using the feedparser, I have found this to be true as well.
The mimeType (what you list as "application/xml" above) is just what is
returned by the webserver to describe the content type of the RSS file,
parse-rss's support for RSS is really outside the scope of the mimeType - it
really has to do with the schema of the RSS being used, and as stated above,
parse-rss can support several different feed schemas (i.e., RSS 1.x, 2.x,
ATOM, etc.)

Looking at the stack trace that you provided below, I'm not even sure that
it's making it to the point where the parse-rss plugin is getting called -
however, I'll have a look and see if I can figure out what your stack trace
is being caused by. Stay tuned...


Cheers,
  Chris




On 9/8/05 8:19 AM, "Jack Tang" <[EMAIL PROTECTED]> wrote:

> Hi Chris
> 
> Thanks for your explain.
> I wanna let "application/xml" content type go in parse-rss plugin, so
> I add the statement
> 
> if (contentType != null
> && (!contentType.startsWith("text/xml") &&
> !contentType.startsWith("application/rss+xml") &&
> !contentType.startsWith("application/xml")))
> return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
> "Content-Type not text/xml, application/xml or
> application/rss+xml: "
> + contentType).getEmptyParse();
> 
> 
>  But unfortunately, it failed again.  Here is the error message:
> --
> ---
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.proxy.host = null
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.proxy.port = 8080
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.timeout = 1
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.content.limit = 65536
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.agent = NutchCVS/0.06-dev (Nutch;
> http://www.nutch.org/docs/en/bot.html;
> [EMAIL PROTECTED])
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.auth.ntlm.username =
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> fetcher.server.delay = 1000
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] -
> http.max.delays = 100
> 050908 231018 org.apache.nutch.protocol.httpclient.Http [11] - Configured
> Client
> 050908 231023 org.apache.nutch.fetcher.Fetcher$FetcherThread [11] -
> SEVERE error writing output:java.lang.NullPointerException
> java.lang.NullPointerException
> at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
> at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
> at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
> at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
> at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
> at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:262)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
> Exception in thread "main" java.lang.RuntimeException: SEVERE error
> logged.  Exiting fetcher.
> at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
> at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
> at net.recruit.fetch.JobCrawlTool.main(JobCrawlTool.java:150)
> 
> It seems plugins confliction?
> My question is that can parse-rss support "application/xml" or more
> content-type?
> 
> Thanks
> /Jack
> 
> On 9/8/05, CHRIS A MATTMANN <[EMAIL PROTECTED]> wrote:
>> Hi Jack,
>> 
>>   I'm not necessarily sure that this is a "bug" per se: it's just the fact
>> that several different content types are potentially possible when any ol'
>> webserver returns an RSS file. To be honest, I performed a pretty detailed
>> crawl (100s of thousands of pages) when I originally wrote the plugin way
>> back in March/April of this year, and the two content types that you see in
>> the code right now that it checks for are what I found to be the most
>> pervasive content type returned from webservers for RSS. However, in n

Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-08 Thread Chris Mattmann

Hi Jerome,

  I may have some time to work on this over the next few days if no one else
does. So, if you're taking the lead on this, I volunteer my help if you'd
like it.

Thanks,
 Chris



On 9/8/05 2:06 AM, "Jerome Charron (JIRA)" <[EMAIL PROTECTED]> wrote:

> Enhance ParserFactory plugin selection policy
> -
> 
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
>  Fix For: 0.8-dev
> 
> 
> The ParserFactory choose the Parser plugin to use based on the content-types
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType"
> attribute matches the beginning of the content's type is used.
> If none match, then the first whose "pathSuffix" attribute matches the end of
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the
> empty string is used.
> 
> This policy has a lot of problems when no matching is found, because a random
> parser is used (and there is a lot of chance this parser can't handle the
> content).
> On the other hand, the content-type associated to a parser plugin is specified
> in the plugin.xml of each plugin (this is the value used by the
> ParserFactory), AND the code of each parser checks itself in its code if the
> content-type is ok (it uses an hard-coded content-type value, and not uses the
> value specified in the plugin.xml => possibility of missmatches between
> content-type hard-coded and content-type delcared in plugin.xml).
> 
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html
> 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

2005-09-15 Thread Chris Mattmann

Hi Otis,



On 9/15/05 10:14 AM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
wrote:

> Quick comment about order="N" and the paragraph that describes how to
> deal with cases where people mess things up and enter multiple plugins
> for the same content type and the same order:
> 
> - Why is the order attribute even needed?  It looks like a redundant
> piece of information - why not derive order from the order of plugin
> definitions in the XML file?

Well, yes and no. Having the "order=" attribute explicitly forces the user
to decide up front what the order should be. By having order be semantically
derived from the order in which the tags are parsed, you remove that
information from the user up front, and assume they know that order matters.

However, in the interest of time and brevity, we decided that it would be
nice to support both options, that is, if they specify an order, then go
ahead and accept that, otherwise, if they don't specify one, then the
ordering which you suggest is applied.

We think that such a solution affords both types of users: users that like
to explicitly spell out options in their XML, and on the other hand, users
which like the short hand of just doing the xml attributes in order.
We are open to suggestions on this though...

Thanks for your comments Otis, and for reading the proposal!

Cheers,
  Chris

> 
> For instance:
> Instead of this:
> 
>   
>   
>   
>  
>   
> 
> We have this:
> 
>   
>   
>   
>  
>   
> 
> parse-text first, another-one-default-parser second.  Less typing, and
> we avoid the case of equal ordering all together.
> 
> Otis
> 
> 
> --- Apache Wiki <[EMAIL PROTECTED]> wrote:
> 
>> Dear Wiki user,
>> 
>> You have subscribed to a wiki page or wiki category on "Nutch Wiki"
>> for change notification.
>> 
>> The following page has been changed by ChrisMattmann:
>> http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
>> 
>> The comment on the change is:
>> Initial Draft of ParserFactoryImprovementProposal
>> 
>> New page:
>> = Parser Factory Improvement Proposal =
>> 
>> 
>> == Summary of Issue ==
>> Currently Nutch provides a plugin mechanism wherein plugins register
>> certain metadata about themselves, including their id, classname, and
>> so forth. In particular, the set of parsing plugins register which
>> contentTypes and file suffixes they can support with a
>> PluginRepository.
>> 
>> One â€œadopted practiceâ€� in current Nutch parsing plugins
>> (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has
>> also been to verify that the content type passed to it during a fetch
>> is indeed one of the contentTypes that it supports (be it
>> application/xml, or application/pdf, etc.). This practice is
>> cumbersome for a few reasons:
>> 
>>  *Any updates to supported content types for a parsing plugin will
>> require a recompilation of the plugin code
>>  *Checking for â€œhard codedâ€� content types within the parsing
>> plugin is a duplication of information that already exists in the
>> pluginâ€™s descriptor file, plugin.xml
>>  *By the time that content gets to a parsing plugin, (e.g., the
>> parsing plugin is returned by the ParserFactory, and provided content
>> during a fetch), the ParsingFactory should have already ensured that
>> the appropriate plugin is getting called for a particular
>> contentType.
>> 
>> In addition to this problem is the fact that several parsing plugins
>> may all support many of the same content types. For instance, the
>> parse-js plugin may be the only well suited parsing plugin for
>> javascript, but perhaps it may also provided a good enough heuristic
>> parser for plain text as well, and so it may support both types.
>> However, there may be a parsing plugin for text (which there is!),
>> parse-text, whose primary purpose is to parse plain text as well.
>> 
>> == Suggested Remedy ==
>> To deal with ensuring the desired parsing plugin is called for the
>> appropriate content type, and to in effect, â€œkill two birds with
>> one stoneâ€�, we propose that there be a parsing plugin preference
>> list for each content type that Nutch knows how to handle, i.e., each
>> content type available via the mimeType system. Therefore, during a
>> fetch, once the appropriate mimeType has been determined for content,
>> and the ParserFactory is tasked with returning a parsing plugin, the
>> ParserFactory should consult a preference list for that contentType,
>> allowing it to determine which plugin has the highest preference for
>> the contentType. That parsing plugin should be returned via the
>> ParserFactory to the fetcher. If there is any problem using the
>> initial returned parsing plugin for a particular contentType (i.e.,
>> if a ParseException is throw during the parser, or a null ParseStatus
>> is returned), then the ParserFactory should be called again, this
>> time asking for the â€œnext highest ranked
>>  â€� plugin for that contentType. Such a process should repeat on and
>

Re: [Nutch-cvs] [Nutch Wiki] Update of "ParserFactoryImprovementProposal" by ChrisMattmann

2005-09-15 Thread Chris Mattmann

Hi Otis,

 Point taken. In actuality since both convey the same information I think
that it's okay to support both, but by default say we could code the initial
plugins specified in parse-plugins.xml without the "order=" attribute. Fair
enough?

Cheers,
  Chris



On 9/15/05 3:23 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Well, you have to tell users about order="N" somewhere in the docs.
> Instead of telling them about order="N", tell them that the order in
> XML matters.  Either case requires education, and the latter one
> requires less typing and avoids the case described in the proposal.
> 
> Otis
> 
> --- Sébastien LE CALLONNEC <[EMAIL PROTECTED]> wrote:
> 
>> Hi Otis,
>> 
>> 
>> This issue arose during our discussion for this proposal, and my
>> feeling was that the XML specification doesn't state that the order
>> is
>> significant in an XML file.  I therefore read the spec again, and
>> indeed didn't find anything on that subject...
>> 
>> I think it is somehow reasonable to consider that a parser _might_
>> return the elements in a different order—though, as I mentioned to
>> Chris & Jerome, that would be quite unheard of, and, to be honnest,
>> rather irritating.
>> 
>> What do you think?
>> 
>> 
>> Regards,
>> Sebastien.
>> 
>> 
>> 
>>> Quick comment about order="N" and the paragraph that describes how
>> to
>>> deal with cases where people mess things up and enter multiple
>>> plugins
>>> for the same content type and the same order:
>>> 
>>> - Why is the order attribute even needed?  It looks like a
>> redundant
>>> piece of information - why not derive order from the order of
>> plugin
>>> definitions in the XML file?
>>> 
>>> For instance:
>>> Instead of this:
>>> 
>>>   
>>>   
>>>   
>>>  
>>>   
>>> 
>>> We have this:
>>> 
>>>   
>>>   
>>>   
>>>  
>>>   
>>> 
>>> parse-text first, another-one-default-parser second.  Less typing,
>>> and
>>> we avoid the case of equal ordering all together.
>>> 
>>> Otis
>>> 
>>> 
>>> --- Apache Wiki <[EMAIL PROTECTED]> wrote:
>>> 
 Dear Wiki user,
 
 You have subscribed to a wiki page or wiki category on "Nutch
>> Wiki"
 for change notification.
 
 The following page has been changed by ChrisMattmann:
 http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
 
 The comment on the change is:
 Initial Draft of ParserFactoryImprovementProposal
 
 New page:
 = Parser Factory Improvement Proposal =
 
 
 == Summary of Issue ==
 Currently Nutch provides a plugin mechanism wherein plugins
>>> register
 certain metadata about themselves, including their id, classname,
>>> and
 so forth. In particular, the set of parsing plugins register
>> which
 contentTypes and file suffixes they can support with a
 PluginRepository.
 
 One â€œadopted practiceâ€� in current Nutch parsing plugins
 (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
>> has
 also been to verify that the content type passed to it during a
>>> fetch
 is indeed one of the contentTypes that it supports (be it
 application/xml, or application/pdf, etc.). This practice is
 cumbersome for a few reasons:
 
  *Any updates to supported content types for a parsing plugin
>> will
 require a recompilation of the plugin code
  *Checking for â€œhard codedâ€� content types within the parsing
 plugin is a duplication of information that already exists in the
 pluginâ€™s descriptor file, plugin.xml
  *By the time that content gets to a parsing plugin, (e.g., the
 parsing plugin is returned by the ParserFactory, and provided
>>> content
 during a fetch), the ParsingFactory should have already ensured
>>> that
 the appropriate plugin is getting called for a particular
 contentType.
 
 In addition to this problem is the fact that several parsing
>>> plugins
 may all support many of the same content types. For instance, the
 parse-js plugin may be the only well suited parsing plugin for
 javascript, but perhaps it may also provided a good enough
>>> heuristic
 parser for plain text as well, and so it may support both types.
 However, there may be a parsing plugin for text (which there
>> is!),
 parse-text, whose primary purpose is to parse plain text as well.
 
 == Suggested Remedy ==
 To deal with ensuring the desired parsing plugin is called for
>> the
 appropriate content type, and to in effect, â€œkill two birds
>> with
 one stoneâ€�, we propose that there be a parsing plugin
>> preference
 list for each content type that Nutch knows how to handle, i.e.,
>>> each
 content type available via the mimeType system. Therefore, during
>> a
 fetch, once the appropriate mimeType has been determined for
>>> content,
 and the ParserFactory is tasked with returning a parsing plugin,
>>> the
 ParserFactory should consult a preference list for

failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-26 Thread Chris Mattmann

Hi there,

 

 I just noticed after checking out the latest SVN of Nutch that I am
currently failing the TestSegmentMergeTool Junit test when I type "ant test"
for Nutch. Is anyone experiencing the same problem? Here is the relevant
information which I captured out of the
$NUTCH_HOME/build/test/TEST-org.apache.nutch.tools.TestSegmentMergeTool.txt
file:

 

Testsuite: org.apache.nutch.tools.TestSegmentMergeTool

Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 46.256 sec

- Standard Error -

050926 215316 parsing
file:/C:/Program%20Files/eclipse/workspace/nutch/conf/nutch-default.xml

050926 215316 parsing
file:/C:/Program%20Files/eclipse/workspace/nutch/build/test/classes/nutch-si
te.xml

050926 215316 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer

050926 215321 No FS indicated, using default:local

050926 215321 * Opening 10 segments:

050926 215321  - segment seg0: 500 records.

050926 215321  - segment seg1: 500 records.

050926 215321  - segment seg2: 500 records.

050926 215321  - segment seg3: 500 records.

050926 215321  - segment seg4: 500 records.

050926 215321  - segment seg5: 500 records.

050926 215321  - segment seg6: 500 records.

050926 215321  - segment seg7: 500 records.

050926 215321  - segment seg8: 500 records.

050926 215321  - segment seg9: 500 records.

050926 215321 * TOTAL 5000 input records in 10 segments.

050926 215321 * Creating master index...

050926 215328 * Creating index took 6356 ms

050926 215328 * Optimizing index took 0 ms

050926 215328 * Removing duplicate entries...

050926 215328 * Deduplicating took 652 ms

050926 215328 * Merging all segments into output

050926 215333 * Merging took 4381 ms

050926 215333 * Deleting old segments...

050926 215333 Finished SegmentMergeTool: INPUT: 5000 -> OUTPUT: 5000 entries
in 12.15 s (416.6 entries/sec).

050926 215339 No FS indicated, using default:local

050926 215339 * Opening 10 segments:

050926 215339  - segment seg0: 500 records.

050926 215339  - segment seg1: 500 records.

050926 215339  - segment seg2: 500 records.

050926 215339  - segment seg3: 500 records.

050926 215339  - segment seg4: 500 records.

050926 215339  - segment seg5: 500 records.

050926 215339  - segment seg6: 500 records.

050926 215339  - segment seg7: 500 records.

050926 215339  - segment seg8: 500 records.

050926 215339  - segment seg9: 500 records.

050926 215339 * TOTAL 5000 input records in 10 segments.

050926 215339 * Creating master index...

050926 215344 * Creating index took 5083 ms

050926 215344 * Optimizing index took 0 ms

050926 215344 * Removing duplicate entries...

050926 215344 * Deduplicating took 150 ms

050926 215344 * Merging all segments into output

050926 215345 * Merging took 662 ms

050926 215345 * Deleting old segments...

050926 215345 Finished SegmentMergeTool: INPUT: 5000 -> OUTPUT: 500 entries
in 6.316 s (833. entries/sec).

java.lang.Exception: Missing or invalid 'fetcher' or 'fetcher_output'
directory in
c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index

at
org.apache.nutch.segment.SegmentReader.isParsedSegment(SegmentReader.java:16
8)

at
org.apache.nutch.segment.SegmentReader.(SegmentReader.java:143)

at
org.apache.nutch.segment.SegmentReader.(SegmentReader.java:82)

at
org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo
ol.java:185)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:324)

at junit.framework.TestCase.runTest(TestCase.java:154)

at junit.framework.TestCase.runBare(TestCase.java:127)

at junit.framework.TestResult$1.protect(TestResult.java:106)

at junit.framework.TestResult.runProtected(TestResult.java:124)

at junit.framework.TestResult.run(TestResult.java:109)

at junit.framework.TestCase.run(TestCase.java:118)

at junit.framework.TestSuite.runTest(TestSuite.java:208)

at junit.framework.TestSuite.run(TestSuite.java:203)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu
nner.java:289)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR
unner.java:523)

junit.framework.AssertionFailedError: Missing or invalid 'fetcher' or
'fetcher_output' directory in
c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index

at junit.framework.Assert.fail(Assert.java:47)

at
org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo
ol.java:190)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflec

Re: failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-27 Thread Chris Mattmann

You know what the crazy thing is:

Seemingly, all tests pass now. And I didn't change a thing. Honest. I swear.

Very strange, indeed, but I'm happy because at least the tests are passing!
:-)

Cheers,
  Chris

On 9/27/05 12:29 PM, "Paul Baclace" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>>  I just noticed after checking out the latest SVN of Nutch that I am
>> currently failing the TestSegmentMergeTool Junit test when I type "ant test"
>> for Nutch. 
> 
> I'm on the mapred branch, not the trunk, and all tests pass.
> 
> One thing I have noticed is that it is best to start with 'ant clean'
> and if you made any mods to the conf files, rewind them back by copying
> the x.template files to x.
> 
> Paul

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann

Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a  tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.

Cheers,
 Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -Original Message-
> From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, October 12, 2005 5:19 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
> characters
> 
>  [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
> 
> [EMAIL PROTECTED] updated NUTCH-110:
> 
> 
> Attachment: fixIllegalXmlChars.patch
> 
> Attached patch runs all xml text through a check for bad xml characters.
> This patch is brutal dropping silently illegal characters.  Patch was made
> after hunting xalan, jdk, and nutch itself for a method that would do the
> above filtering but was unable to find any such method -- perhaps an
> oversight on my part?
> 
> > OpenSearchServlet outputs illegal xml characters
> > 
> >
> >  Key: NUTCH-110
> >  URL: http://issues.apache.org/jira/browse/NUTCH-110
> >  Project: Nutch
> > Type: Bug
> >   Components: searcher
> > Versions: 0.7
> >  Environment: linux, jdk 1.5
> > Reporter: [EMAIL PROTECTED]
> >  Attachments: fixIllegalXmlChars.patch
> >
> > OpenSearchServlet does not check text-to-output for illegal xml
> characters; dependent on  search result, its possible for OSS to output
> xml that is not well-formed.  For example, if text has the character FF
> character in it -- -- i.e. the ascii character at position (decimal) 12 --
> the produced XML will show the FF character as '' The
> character/entity '' is not legal in XML according to
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>http://www.atlassian.com/software/jira

developing a parse-/index-/query- plugin set

2005-10-16 Thread Chris Mattmann

Hi Folks,

 

  I was wondering if anybody could give me some advice on what I'm doing
wrong in the following situation. 

 

I am trying to fetch and search some bioinformatics data with specific data
elements that I want to index, parse out, and search on. For instance, for
each page of data I fetch, I would like to store things like PROTOCOL_ID,
and CONTACT_EMAIL. Okay, so to go about this, I went and wrote a
parse-specimen plugin to suck out the specific metadata elements I wanted to
index. I have tested and verified that this part of the process is working.
For instance, after the page content is fetched, I've instrumented the code
with LOG.log commands to verify that the metadata is being added to the
Properties object that is sent back with the ParseImpl. Okay, so then I
wrote an index-specimen plugin, that basically takes the reconstructed parse
data (as all indexing plugins do), gets out the specific properties that I
captured during the parse, and then adds them to the Lucene document and
returns the document. I have also verified that this portion of the process
is working as well, for instance, I have instrumented the code with LOG.log
commands again, and verified that the fields are getting added to the
Document object, which is then returned. Okay, so then I just thought I
could deploy and start up the nutch web app at that point, and I would be
able to do queries like, "PROTOCOL_ID:36.0", and
"CONTACT_EMAIL:[EMAIL PROTECTED]", for instance, and since the
metadata was stored in the index, that the hits would come back. However, of
course, I found out that this wasn't the case. After some snooping around, I
saw that it seems that in order for the query to work right, a user needs to
then write a query-xxx plugin that declares its support for the specific
fields that were indexed, and that you want to search on. Well I've been
trying to do this for the last day and a half, and for the life of me, I
can't get the thing working. Could someone give me some help or suggestions
on how to do this? To write my query-specimen plugin that I have now, that
doesn't work; I used the model of the query-more plugin. I've written two
classes which extend the RawFieldQueryFilter, to test out if I could at
least get the PROTOCOL_ID and CONTACT_EMAIL queries working. So I wrote a
ProtocolIDQueryFilter class and a ContactEmailQueryFilter class, which just
extended the RawFieldQueryFilter class, and passed in "PROTOCOL_ID" and
"CONTACT_EMAIL" to the constructor of it, again, this is what I saw in the
query-more plugin that I used as an example. Additionally, in my plugin.xml
file for the query-specimen plugin, I've declared that my plugin supports
those 2 raw fields, in the following fashion:

 

 

   

  

   

   

   

  

   

   

 

However, after rebuilding the Nutch webapp with the query-specimen plugin
enabled (which I have verified via the LOG files that it is actually
enabled), and then trying the queries such as "PROTOCOL_ID:36.0", the
queries still don't work. I've verified that the fields were indexed
correct, and that 36.0 is actually a valid value for the PROTOCOL_ID,
because for instance, when I just do a regular query that I know returns
hits (I've only indexed 3 documents so far), and then I click on the
"explain" link, it shows that I have indexed all the fields which I wanted
to query on (such as PROTOCOL_ID, and CONTACT_EMAIL), and it shows me the
values for each field, such as PROTOCOL_ID = 36.0. So, now I'm stuck. I
can't get the queries to work and if anyone can help me with this, I would
be really appreciative. Oh yeah, one more thing, it turns out that a lot of
my fields are numeric-like values, such as 36.0, 2.0, etc. However, when I
indexed them I indexed them as Field.Text() values in the Lucene document.
I've never done this before, so if that was the wrong thing to do, then that
might be the problem? Here is the snippet of code in my index-specimen
plugin where I index the fields:

 

public Document filter(Document doc, Parse parse, FetcherOutput fo)

throws IndexingException {



//get the parse metadata

Properties metadata = parse.getData().getMetadata();



 

for(int i = 0; i < edrnCDES.length; i++){

String key = edrnCDES[i];

String val = (String)metadata.get(key);

if(val != null){

LOG.log(Level.INFO,"SpecimenIndexer:adding
["+key+"=>"+val+"]");

doc.add(Field.Text(key,val));

}



 

}



return doc;



}

 

 

"edrnCDES" is an array of the field names I want to index, such as
"PROTOCOL_ID" and "CONTACT_EMAIL". So, does the fact that some of these
fields are numerical values make a difference, even though I'm trying to
index them as text? I mean, one thing that I know is that even the
non-numerical values, e.g., CONTACT_EMAIL, isn't working, so I suspect that
the numeri

RE: developing a parse-/index-/query- plugin set

2005-10-16 Thread Chris Mattmann

Hi Folks,

 

 I've done some tracing on my problem that I previously posted to the list
about developing a parse-/index-/query- plugin set. It seems that by default
The NutchAnalysis class I believe turns all fields into lower case, e.g.
PROTOCOL_ID gets turned into protocol_id. Then, in the Query.parse method
there, is a call to Query.fixup. The fixup method, if it can't match a field
in a clause to one of the fields provided by filters registered with the
QueryFilters class, turns the field into a Default field, so protocol_id
gets turned into "protocol id", two separate strings, or tokens. Then, to
top it all of, the query PROTOCOL_ID:36.0 gets turned into "protocol id 36
0".

 

 So, one thing it seems is that fields to be indexed, and used in a field
query must be fully lowercase to work? Additionally, it seems that they
can't have symbols in them, such as "_", is that correct? Would you guys
consider this to be a bug? I mean, maybe it was your intention for it to be
this way, but I haven't found anything in the Nutch documentation that
states that fields should be lowercase? 

 

 Okay, so that's one thing. So, what I did was then make all my fields
lowercase, and I removed the "_" character, so PROTOCOL_ID becomes
protocolid. However, I'm still stuck. Now, the query gets formulated
correctly, for instance, "protocolid:36.0" gets translated to
protocolid:36.0, when it is sent to the filter. Then, the filter that I
wrote correctly recognizes that it can handle that term, and it adds a
TermQuery clause to the booleanQuery output from the QueryFilters class.
However, my query for protocolid:36.0 still returns nothing. I've traced the
call all the way down to the LuceneQueryOptimizer.optimize method. I've
added two System.out.printlns in that method at the end to see what's going
on. Here is the small snippet of code:

 

 

if (sortField == null && !reverse) {

  System.out.println("Performing Lucene Query: "+query);

  System.out.println("using filter "+filter+" and numHits = "+numHits);

  return searcher.search(query, filter, numHits);

} else {

  return searcher.search(query, filter, numHits,

 new Sort(sortField, reverse));

}

 

 

Okay, and here is what those System.out.printlns print out for me:

 

 

Performing Lucene Query: 

using filter QueryFilter(+contactemail:[EMAIL PROTECTED]) and
numHits = 20

051016 190347 11 total hits: 0

 

 

However, as I mentioned, even though I can look at my results and see that
there is a result with:

 

contactemail = [EMAIL PROTECTED]

 

I still get no hits. Does anybody have any clue as to what I'm doing wrong? 

 

 

Thanks in advance.

 

Cheers,

  Chris

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann

Hi Andrzej,

On 10/17/05 10:59 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
> 
>> I still get no hits. Does anybody have any clue as to what I'm doing wrong?
> 
> I have a clue (which is not the same as a solution ;-) ). Please use
> Luke and check how the terms look like in your index. The best way to do
> it is to open the index, then go to one of the documents and press
> "Reconstruct & Edit". In the dialog that pops up you will have all
> fields content, and also how they were tokenized (which is more
> important). It's possible that NutchAnalyzer swallowed some of the text
> you are looking for... you should see that in the tokenized field
> content. If your query plugin returns the clause as you wrote it, i.e.
> with at sign, dots and whatever, then a corresponding token needs to
> show up in the tokenized content - and I bet it doesn't, because it was
> broken into parts by the tokenizer...
> 

I downloaded Luke from the getopt site during the peaks of my frustration,
and then browsed my small index of 3 documents (which I can send to you in a
separate email if you want to look at it, it's real small). I  looked up the
field for "contactemail" for one of the documents in the index. I also
verified as I mentioned, that my query was being captured by the filter
correctly. For instance a query for
"contactemail:[EMAIL PROTECTED]" correctly shows up as:
"contactemail:[EMAIL PROTECTED]". When I used Luke to look up the
doc in the index, and its corresponding contactemail field, here is what it
appeared as under the "tokenized" tab:

"[EMAIL PROTECTED]"

Which is the exact same way that it was stored, and the same way that I
queried on it. So, not really sure what the problem is here. Thanks for the
suggestion, however. Any other ideas? :-)

Take care,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann

Hi Doug,

On 10/17/05 11:38 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>>  So, one thing it seems is that fields to be indexed, and used in a field
>> query must be fully lowercase to work? Additionally, it seems that they
>> can't have symbols in them, such as "_", is that correct? Would you guys
>> consider this to be a bug?
> 
> Yes, this sounds like a bug.

Okay, I will look and see if I can figure out why this is happening and if I
can, I will try and submit a patch.

> 
>> Performing Lucene Query:
>> 
>> using filter QueryFilter(+contactemail:[EMAIL PROTECTED]) and
>> numHits = 20
>> 
>> 051016 190347 11 total hits: 0
> 
> A query whose only clause has a boost of 0.0 will return no results.
> Nutch uses the convention that clauses whose boost is 0.0 may be
> converted to filters, for efficiency.  A filter affects the set of hits,
> but not their ranking.  So a boost of 0.0 is used to declare that a
> clause does not affect ranking and may not be used in isolation.  This
> makes it akin to searching for "filetype:pdf" on Google--filetype is
> only used to filter other queries and may not be a standalone query.

Okay, this makes sense. In fact, when I do a query now for:

"contactemail:[EMAIL PROTECTED] specimen"

The query actually works. Of the 3 documents I indexed only one of them has
the contactemail [EMAIL PROTECTED], and so I only got one result
back. So your answer there makes total sense. So, my question to you then
is, what type of QueryFilter should I develop in order to get my query for
contactemail: to work as a standalone query? For instance,
right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
the right way to do it now. Is there a class in Nutch that I can sub-class
to get most of the functionality for doing a type: query as a
standalone query?

Thanks for the help.

Cheers,
  Chris

> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann

Hi Doug,

 Thanks, that worked.

Cheers,
  Chris

On 10/17/05 11:56 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> So, my question to you then
>> is, what type of QueryFilter should I develop in order to get my query for
>> contactemail: to work as a standalone query? For instance,
>> right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
>> the right way to do it now. Is there a class in Nutch that I can sub-class
>> to get most of the functionality for doing a type: query as a
>> standalone query?
> 
> You can simply pass a non-zero boost to the RawFieldQueryFilter
> constructor, e.g.:
> 
> public class MyQueryFilter extends RawFieldQueryFilter {
>public MyQueryFilter() {
>  super("myfield", 1.0f);
>}
> }
> 
> Or you can implement QueryFilter directly.  There's not that much to it.
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Chris Mattmann

Hi Doug,

 I just noticed this comment from your original email:

> First, the ParserFactory sometimes uses LOG.severe() which causes the
> Fetcher to exit.  Is there a reason this cannot be LOG.warning()?
> LOG.severe() should only be used if you intend the application to exit.
> This configuration problem does not seem to warrant that.  And I'm getting
> it with the default settings when an application/pdf is encountered.

In fact, I can't speak for Jerome and Sebastien, but I actually intended the
application to exit in this case. Here is a snippet, taken from:
http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

///
If an activated parse plugin is not listed in the parse-plugins.xml, then it
won't get called for parsing. The purpose of the parse-plugins.xml file
would be to map parsing-plugin to contentType. Therefore, if an activated
plugin is not mapped to a content type, then it is "activated", but won't
get called. This is very similar to Apache HTTPD. See below:

//httpd.conf example
//add handler for php

LoadModule php4_modulelibexec/httpd/libphp4.so

// map handler to mimeType
AddType application/x-httpd-php .php
AddType application/x-httpd-php-source .phps

AddHandler php-script   php
AddHandler php-script   phps

There are two different levels in the above example. First, the plugin is
"activated" in the LoadModule section. Then, the plugin is "mapped" to a
content type in the AddHandler section. We believe that this is the way to
go. Apache HTTPD is pervasive, and its model is well understood by many of
the same folks who would want to use Nutch. Although we realize that this is
a change from the way that Nutch currently works, and that people don't like
change, we believe that this change is entirely needful and represents
something that Nutch should adopt.
///

The above case you mention with respect to the application/pdf documents
happens because in the parse-plugins.xml file there is a mapping of the
parse-pdf plugin to the "application/pdf" mimeType, even though the
parse-pdf plugin isn't activated by default via the plugin.includes property
(note, this is the opposite case of the snippet that I pasted from the
ImprovementProposal off the Wiki above). Therein lays the problem. My idea
was that, similar to the above case in Apache HTTPD, if you map an
"unactivated plugin" to a mimeType via parse-plugins.xml, then really, there
is a configuration error there. I think that this is a LOG.severe()
configuration error because you really need to "activate" a plugin, before
you "map it" to a mime type. For example, why would you want to run a fetch
if you have plugins mapped to mimeTypes via parse-plugins.xml that will
never get called because they have never been activated? Before I run a
fetch, I want to make sure of two important things:

1. I have enabled the entire set of appropriate parse plugins for the
content that I want to fetch

2. I've mapped the enabled parsing plugins to the mimeTypes that they can
deal with (in order of preference)

If I ensure that I do both of these things, then we're fine in the above
case you mention with the PDF files. 

Now, I know that this is a somewhat different process than what people are
used to with Nutch. Totally understandable. But I think that the
improvements that are reaped in the ParserFactory by doing it this way far
outweigh the inconvenience of ensuring consistency between the
plugin.includes property in nutch-default.xml and the parse-plugins.xml
file. 

Of course, there is another issue. The current code committed in the trunk
causes the fetcher to exit right out of the box for certain content types,
because, as far as I can tell, the only enabled parse plugins out of the box
are:

parse-(text|html|js)

I guess this is really a design issue in Nutch. Is there really any reason
that the rest of the parsing plugins aren't enabled by default? I mean, I
guess you guys want to go with the "smallest set" of parsing plugins that
makes Nutch a functional search engine out of the box, no? If so, then I
understand only having these parsing plugins enabled. But for instance, I
would say that many of the other parsing plugins, being committed to the
trunk and included in existing Nutch releases so far (e.g.,
parse-ext|mp3|mspowerpoint|msword|pdf|rss|rtf) are tested enough to be
enabled by default, right? If the answer to that lies in a requirement
similar to what I mentioned, i.e., you want to go with the "smallest set" of
parse plugins out of the box, then two ways can deal with what's in trunk:

1. What you suggested, changing the LOG level to warning, instead of SEVERE,
which alleviates the out-of-the-box functionality problem, but also opens up
a problem where a user will wonder why the PDF content that he tried to
fetch didn't get parsed even though it was mapped correctly in
parse-plugins

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann

Hi Stefan,


> -1!
> Xsl is terrible slow!

You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
"backend", as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifically, XML based content, then XSL will be
used to create an intermediary "parse-out" xml file, containing the fields
to index. I don't think, given the percentage of xml-based markup content
out there (of course excluding "html"), compared to regular content, that
this would significantly degrade performance. 

> Xml will blow up memory and storage usage.

Possibly, but I would think that we would do it in a clever fashion. For
instance, the parse-out xml files would most likely be small (~kb) files
that could be deleted if space is a concern. It could be a parameterized
option. 

> Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that. In fact, I think many people would disagree
with that in fact. Dublin core is a "standard" metadata model for electronic
resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
"author" field, or "content creator", or "document creator", or whatever you
want to call it, I think it would be nice to provide the DC metadata because
at least it is well known and provides interoperability with other content
storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

> In general the goal must be to minimalize memory usage and improve
> performance such a parser would increase memory usage and definitely
> slow down parsing.

I dont think it would slow down parsing significantly, as I mentioned above
markup content represents a small portion of the amount of content out
there.

> The magic world is minimalism.
> So I vote against this suggestion!
> Stefan

In general, this proposal represents a step forward in being able to parse
generic XML content in Nutch, which is a very challenging problem. Thanks
for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.


Cheers,
   Chris Mattmann

> 
> 
> 
> 
> 
> Am 24.11.2005 um 00:01 schrieb Jérôme Charron:
> 
> > Hi,
> >
> > We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and
> > me) just
> > add a new proposal on the nutch Wiki:
> > http://wiki.apache.org/nutch/MarkupLanguageParserProposal
> >
> > Here is the Summary of Issue:
> > "Currently, Nutch provides some specific markup language parsing
> > plugins:
> > one for handling HTML, another one for RSS, but no generic XML parsing
> > plugin. This is extremely cumbersome as adding support for a new
> > markup
> > language implies that you have to develop the whole XML parsing
> > code from
> > scratch. This methodology causes: (1) code duplication, with little
> > or no
> > reuse of common pieces of XML parsing code, and (2) dependency library
> > duplication, where many XML parsing plugins may rely on similar xml
> > parsing
> > libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing
> > plugin
> > keeps its own local copy of these libraries. It is also very
> > difficult to
> > identify precisely the type of XML content encountered during a
> > parse. That
> > difficult issue is outside the scope of this proposal, and will be
> > identified in a future proposal."
> >
> > Thanks for your feedback, comments, suggestions (and votes).
> >
> > Regards
> >
> > Jérôme
> >
> > --
> > http://motrech.free.fr/
> > http://www.frutch.org/

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann

Hi Stefan, and Jerome,

> A  mail archive is a amazing source of information, isn't it?! :-)
> To answer your question, just ask your self how many pages per second
> your plan to fetch and parse and how much queries per second a lucene
> index is able to handle - and you can deliver in the ui.
> I have here something like 200++ to a maximal 20 queries per second.
> http://wiki.apache.org/nutch/HardwareRequirements

I'm not sure that our proposal affects the ui, really at all. Parsing occurs
only during a fetch, which creates the index for the ui, no? So, why mention
the amount of queries per second that the ui can handle?

> 
> Speed improvement in ui can be done by caching components you use to
> assemble the ui. "There are some ways to improve speed"
> But seriously I don't think there will be any pages  that contains
> 'cacheable' items until parsing.
> Until last years there is one thing I notice that matters in a search
> engine - minimalism.
> There is no usage in nutch of a  logging library, 

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)

> no RMI and no meta
> data in the web db. Why?
> Minimalism.
> Minimalism == speed, speed == scalability, scalability == serious
> enterprise search engine projects.
> 
> I don't think it would be a good move to slow down html parsing (most
> used parser) to make rss parser writing more easier for developers.

This proposal isn't meant for RSS, that's seriously constraining the scope.
The proposal is meant for making writing * XML * parsers easier. Note the
"XML". RSS is a significantly small subset of XML as a whole. And, there
currently exists no default support for generic XML documents in Nutch.


> BTW, we already have a html and feed parser that works, as far I know.
> I guess 90 % of the nutch users use the html parser but only 10 % the
> feed-parser (since blogs are mostly html as well).

This may or may not be true however I wouldn't be surprised if it was
because it is representative of the division of content on the web -- HTML
definitely is orders of magnitude more pervasive than RSS.

> 
>  From my perspective we have much more general things to solve in
> nutch (manageability, monitoring, ndfs block based task-routing, more
> dynamic search servers) than improving thing we already have.

I would tend to agree with Jerome on this one -- these seem to be the items
on your agenda: a representative set indeed, but by no means an exhaustive
set of what's needed to improve, and benefit Nutch. One of the motivations
behind our proposal was several emails posted to the Nutch list by users
interested in crawling blogs and RSS:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69417.html

One of my replies to this thread was a message on October 19th, 2005, which
really identified the main problem:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69576.html

There is a lack of a general XML parser in Nutch that would allow it to deal
with general XML content based on user defined schemas and DTDs. Our
proposal would be the initial step towards a solution to this overall
problem. At least, that's part of its intention.


> Anyway as you may know we have a plugin system and one goal of the
> plugin system is to give developers the freedom to develop custom
> plugins. :-)

Indeed. And our goal is help developers in their endeavors by providing at
starting point and generic solution for XML based parsing plugins :-)

Cheers,
  Chris


> 
> Cheers,
> Stefan
> B-)
> 
> P.S. Do you think it makes sense to run another public nutch mailing
> list, since 'THE nutch [...]' (mailing list  is nutch-
> [EMAIL PROTECTED]), 'Isn't it?'
> http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
> 
> 
> 
> Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
> 
> > Hi Stefan,
> >
> > And thanks for taking time to read the doc and giving us your
> > feedback.
> >
> > -1!
> >> Xsl is terrible slow!
> >> Xml will blow up memory and storage usage.
> >
> > But there still something I don't understand...
> > Regarding a previous discussion we had about the use of OpenSearch
> > API to
> > replace Servlet => HTML by Servlet => XML => HTML (using xsl),
> > here is a copy of one of my comment:
> >
> > In my opinion, it is the front-end "dreamed" architecture. But more
> > pragmatically, I'm not sure it's a good idea. XSL transformation is a
> > rather slow process!! And the Nutch front-end must be very responsive.
> >
> > and then your response and Doug response too:
> > Stefan:
> > We already done experiments using XSLT.
> > There are some ways to improve speed, however it is 20 ++ % slower
> > then jsp.
> > Doug:
> > I don't think this would make a significant impact on overall Nutch
> > search
> > performance.
> > (the complete thread is available at
> > http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/
> > msg03811.html
> > )
> >
> > I'm a little bit confused... why the use o

Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Jerome,

 I think that this is a great idea and ensures that there isn't replication
of so-called "management information" across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes from.
Additionally, we could create a utility method that searches the extension
point list for parsing plugins and returns a boolean true or false whether
they are activated or not. Using this information, I believe that the url
filtering would be a snap.

+1

Cheers,
  Chris

On 12/1/05 12:11 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:

> Suggestion:
> For consistency purpose, and easy of nutch management, why not filtering the
> extensions based on the activated plugins?
> By looking at the mime-types defined in the parse-plugins.xml file and the
> activated plugins, we know which content-types will be parsed.
> So, by getting the file extensions associated to each content-type, we can
> build a list of file extensions to include (other ones will be excluded) in
> the fecth process.
> No?
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Hi Doug,

On 12/1/05 1:11 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Jérôme Charron wrote:
[...]
> 
> What about a site that develops a content system that has urls that end
> in .foo, which we would exclude, even though they return html?
> 
> Doug

  In principle, the mimeType system should give us some guidance on
determining the appropriate mimeType for the content, regardless of whether
it ends in .foo, .bar or the like. I'm not sure if the mime type registry is
there yet, but I know that Jerome was working on a major update that would
help in recognizing these types of situations. Of course, efficiency comes
into play as well, in terms of now slowing down the fetch/parse, but it
would be nice to have a general solution that made use of the information
available in parse-plugins.xml to determine the appropriate set of allowed
extensions in a URLFilter, if possible. It may be a pipe dream, but I'd say
it's worth exploring...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Hi Doug,

> 
> Chris Mattmann wrote:
> >   In principle, the mimeType system should give us some guidance on
> > determining the appropriate mimeType for the content, regardless of
> whether
> > it ends in .foo, .bar or the like.
> 
> Right, but the URL filters run long before we know the mime type, in
> order to try to keep us from fetching lots of stuff we can't process.
> The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
> The latter is not strictly true. Nutch could issue an HTTP HEAD  
> before the HTTP GET, and determine the mime-type before actually  
> grabbing the content.
> 
> It's not how Nutch works now, but this might be more useful than a 
> super-detailed set of regexes...

I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or "non-whole-internet" crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris

RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Hi Jerome,

> Yes, the fetcher can't rely on the document mime-type.
> The only thing we can use for filtering is the document's URL.
> So, another alternative, could be to exclude only files extensions that
> are
> registered in the mime-type repository
> (some well known file extensions) but for which no parser is activated.
> And
> accepting all other ones.
> So that the .foo files will be fetched...

Yup, the key phrase is "well known". It would sort of be an optimization, or
heuristic, to save some work on the regex...

Cheers,
  Chris


> 
> Jérôme

RE: submitting a patch?

2005-12-06 Thread Chris Mattmann

Hi James,

 You can submit your patch via JIRA
(http://issues.apache.org/jira/browse/NUTCH). You can create an issue there
and then attach your patch to that issue.


G'luck,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -Original Message-
> From: James Nelson [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 06, 2005 8:45 AM
> To: nutch-dev@lucene.apache.org
> Subject: submitting a patch?
> 
> Hello, hope this is the right place to ask this.
> 
> I'm working on patching nutch to support sorting results on multiple
> fields. I have a patch partially completed and would like to get
> feedback on it. I made the patch against the 0.7 line and want to know
> if it's ok to post the patch against that branch or if I should make
> it against mapred branch?
> 
>  thanks,
> 
> James

NUTCH-112: Link in cached.jsp page to cached content is an absolute link

2005-12-06 Thread Chris Mattmann

Hi Guys,

 

  Just wondering if any of the committers checked out
http://issues.apache.org/jira/browse/NUTCH-112. Turns out the link to the
cached.jsp page to the cached content contains an absolute link which makes
the link mess up when you don't deploy the nutch webapp in the root context.
I've attached a pretty simple patch to the issue, and tested it. It would be
nice to have this included for those people like me who are using Nutch
deployed at a context other than root, e.g., http://myhost/nutch/.

 

 

Thanks,

  Chris 

 

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Folks,

 I was just thinking about the ParseData java.util.Properties metaata object
and thinking about the way that we store names in there. Currently, people
are free to name their string-based properties anything that they want, such
as having names of "Content-type", "content-TyPe", "CONTENT_TYPE" all having
the same meaning. Stefan G. I believe proposed a solution in which all
property names be converted to lower case, but in essence this really only
fixes half the problem right (the case of identifying that "CONTENT_TYPE"
and "conTeNT-TyPE" and all the permutations are really the same). What about
if named it "Content Type", or "ContentType"?

 I propose that a way to correct this would be to create a standard set of
named Strings in the ParseData class that the protocol framework and the
parsing framework could use to identify common properties such as
"Content-type", "Creator", "Language", etc.

 The properties would be defined at the top of the ParseData class,
something like:

 public class ParseData{

   .

public static final String CONTENT_TYPE = "content-type";
public static final String CREATOR = "creator";

   

}


In this fashion, users could at least know what the name of the standard
properties that they can obtain from the ParseData are, for example by
making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get
the content type or a call to
ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of course,
this wouldn't preclude users from doing what they are currently doing, it
would just provide a standard method of obtaining some of the more common,
critical metadata without pouring over the code base to figure out what they
are named.

What do you all think? If you guys think that this is a good solution, I'll
create an issue in JIRA about it and contribute a patch near the end of the
week.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Stefan,

 Thanks. Yup, I noticed it and I think it will really help out a lot. Great
job to the both of you :-)

Cheers,
  Chris



On 12/13/05 10:59 AM, "Stefan Groschupf" <[EMAIL PROTECTED]> wrote:

> +1!
> BTW, did you notice  that Jerome committed a patch that makes Content
> meta data now case insensitive?
> 
> Stefan
> 
> Am 13.12.2005 um 18:07 schrieb Chris Mattmann:
> 
>> Hi Folks,
>> 
>>  I was just thinking about the ParseData java.util.Properties
>> metaata object
>> and thinking about the way that we store names in there. Currently,
>> people
>> are free to name their string-based properties anything that they
>> want, such
>> as having names of "Content-type", "content-TyPe", "CONTENT_TYPE"
>> all having
>> the same meaning. Stefan G. I believe proposed a solution in which all
>> property names be converted to lower case, but in essence this
>> really only
>> fixes half the problem right (the case of identifying that
>> "CONTENT_TYPE"
>> and "conTeNT-TyPE" and all the permutations are really the same).
>> What about
>> if named it "Content Type", or "ContentType"?
>> 
>>  I propose that a way to correct this would be to create a standard
>> set of
>> named Strings in the ParseData class that the protocol framework
>> and the
>> parsing framework could use to identify common properties such as
>> "Content-type", "Creator", "Language", etc.
>> 
>>  The properties would be defined at the top of the ParseData class,
>> something like:
>> 
>>  public class ParseData{
>> 
>>.
>> 
>> public static final String CONTENT_TYPE = "content-type";
>> public static final String CREATOR = "creator";
>> 
>>
>> 
>> }
>> 
>> 
>> In this fashion, users could at least know what the name of the
>> standard
>> properties that they can obtain from the ParseData are, for example by
>> making a call to ParseData.getMetadata().get
>> (ParseData.CONTENT_TYPE) to get
>> the content type or a call to
>> ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
>> course,
>> this wouldn't preclude users from doing what they are currently
>> doing, it
>> would just provide a standard method of obtaining some of the more
>> common,
>> critical metadata without pouring over the code base to figure out
>> what they
>> are named.
>> 
>> What do you all think? If you guys think that this is a good
>> solution, I'll
>> create an issue in JIRA about it and contribute a patch near the
>> end of the
>> week.
>> 
>> Cheers,
>>   Chris
>> 
>> __
>> Chris A. Mattmann
>> [EMAIL PROTECTED]
>> Staff Member
>> Modeling and Data Management Systems Section (387)
>> Data Management Systems and Technologies Group
>> 
>> _
>> Jet Propulsion LaboratoryPasadena, CA
>> Office: 171-266BMailstop:  171-246
>> ___
>> 
>> Disclaimer:  The opinions presented within are my own and do not
>> reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>> 
>> 
> 
> ---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
> 
> 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Idea about aliases in the parse-plugins.xml file

2005-12-13 Thread Chris Mattmann

Hi Folks,

  Jerome and I have been talking about an idea to address the current issue
raised by Stefan G. about having a mapping of mimeType->list of pluginIds
rather than mimeType->list of extensionIds in the parse-plugins.xml file.
We've come up with the following proposed update that would seemingly fix
this problem.

  We propose to have the concept of "aliases" in the parse-plugins.xml file,
defined at the end of the file, something lie:

 


   
  
   

.
  
   
   

   
   
   
   
   




What do you guys think? This approach would be flexible enough to allow the
mapping of extensionIds to mimeTypes, but without impacting the current
"pluginId" concept.

Comments welcome.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Guys,

 Okay, that makes sense then. I will create an issue in JIRA later today
describing the update, and then begin working on this over the next few
days.

Thanks for your responses and reviews.

Cheers,
  Chris

On 12/13/05 12:45 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:

>> I agree, too. Perhaps we should use the names as they appear in the
>> Dublin Core for those properties that are defined there
> 
> A big YES!
> 
> 
>> - just prepended
>> them with "X-nutch-" in order to avoid name-clashes with other
>> properties (e.g. blindly copied from the protocol headers).
> 
> Another big YES!
> 
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

bug in parse-rtf?

2005-12-16 Thread Chris Mattmann

Hi Folks,

 

 Anybody been experiencing problems building the parse-rtf plugin? I just
noticed while working on NUTCH-139 that there's a line at the end of
RTFParser.java in parse-rtf that returns a new ParseImpl, however, the
constructor for ParseData uses the old ParseData constructor (pre Andrzej's
protocol and parsing updates), and it doesn't pass the ParseStatus as the
first parameter. 

 

 I've updated this in my locally checked out project, and will include the
fix to it in NUTCH-139.

 

Thanks,

  Chris

 

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann

Hi Folks,
 
  I've tried removing the 5 copies of the comment, however I can't find a
button on JIRA to remove comments. Maybe an administrator for Nutch can do
it? Anyways, the dang thing is running so slow right now, it may just have
to wait until the server stops returning the 503 service unavailable
messages. Sorry again...

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Thursday, January 05, 2006 8:28 PM
> To: nutch-dev@lucene.apache.org
> Subject: RE: [jira] Commented: (NUTCH-139) Standard metadata property
> names in the ParseData metadata
> 
> Guys,
> 
>  My apologies for the spamming comments -- I tried to submit my comment
> through JIRA one time and it kept giving me service unavailable. So I
> resubmitted like 5 times, on the fifth time it finally went through -- but
> I
> guess the other comments went through too. I'll try and remove them right
> away.
> 
>  Sorry again.
> 
> Cheers,
>   Chris
> 
> 
> __
> Chris A. Mattmann
> [EMAIL PROTECTED]
> Staff Member
> Modeling and Data Management Systems Section (387)
> Data Management Systems and Technologies Group
> 
> _
> Jet Propulsion LaboratoryPasadena, CA
> Office: 171-266BMailstop:  171-246
> ___
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> > -Original Message-
> > From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, January 05, 2006 8:04 PM
> > To: nutch-dev@incubator.apache.org
> > Subject: [jira] Commented: (NUTCH-139) Standard metadata property names
> in
> > the ParseData metadata
> >
> > [ http://issues.apache.org/jira/browse/NUTCH-
> > 139?page=comments#action_12361922 ]
> >
> > Doug Cutting commented on NUTCH-139:
> > 
> >
> > One more thing.  Content length should also not need to be stored in the
> > metadata as an x-nutch value.  The content length is simply the length
> of
> > the Content's data.  The protocol may have truncated the content, in
> which
> > case perhaps we need an x-nutch-truncated-content metadata property or
> > something, but we should not be overwriting the HTTP "Content-Length"
> > header, nor should we trust that it reflects the length of the data
> > actually fetched.
> >
> >
> > > Standard metadata property names in the ParseData metadata
> > > --
> > >
> > >  Key: NUTCH-139
> > >  URL: http://issues.apache.org/jira/browse/NUTCH-139
> > >  Project: Nutch
> > > Type: Improvement
> > >   Components: fetcher
> > > Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> > >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> > RAM, although bug is independent of environment
> > > Reporter: Chris A. Mattmann
> > > Assignee: Chris A. Mattmann
> > > Priority: Minor
> > >  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> > >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> > NUTCH-139.jc.review.patch.txt
> > >
> > > Currently, people are free to name their string-based properties
> > anything that they want, such as having names of "Content-type",
> "content-
> > TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> > proposed a solution in which all property names be converted to lower
> > case, but in essence this really only fixes half the problem right (the
> > case of identifying that "CONTENT_TYPE"
> > > and "conTeNT_TyPE" and all the permutations are really the same). What
> > about
> > > if I named it "Content Type", or "ContentType"?
> > >  I propose that a way to correct this would be to create a standard
> set
> > of named Strings in the ParseData class that the protocol framework and
> > the parsing framework could use to identify common properties such as
> > "Content-type", "Creator", "Language", etc.
> > >  The properties would be defined at the top of the ParseData class,
> > something like:
> > >  public class ParseData{
> > >.
> > > public static final String CONTENT_TYPE = "content-type";
> > > public static final String CREATOR = "creator";
> > >
> > > }
> > >

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann

Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -Original Message-
> From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Thursday, January 05, 2006 8:04 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
> the ParseData metadata
> 
> [ http://issues.apache.org/jira/browse/NUTCH-
> 139?page=comments#action_12361922 ]
> 
> Doug Cutting commented on NUTCH-139:
> 
> 
> One more thing.  Content length should also not need to be stored in the
> metadata as an x-nutch value.  The content length is simply the length of
> the Content's data.  The protocol may have truncated the content, in which
> case perhaps we need an x-nutch-truncated-content metadata property or
> something, but we should not be overwriting the HTTP "Content-Length"
> header, nor should we trust that it reflects the length of the data
> actually fetched.
> 
> 
> > Standard metadata property names in the ParseData metadata
> > --
> >
> >  Key: NUTCH-139
> >  URL: http://issues.apache.org/jira/browse/NUTCH-139
> >  Project: Nutch
> > Type: Improvement
> >   Components: fetcher
> > Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> >  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
> RAM, although bug is independent of environment
> > Reporter: Chris A. Mattmann
> > Assignee: Chris A. Mattmann
> > Priority: Minor
> >  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> >  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> NUTCH-139.jc.review.patch.txt
> >
> > Currently, people are free to name their string-based properties
> anything that they want, such as having names of "Content-type", "content-
> TyPe", "CONTENT_TYPE" all having the same meaning. Stefan G. I believe
> proposed a solution in which all property names be converted to lower
> case, but in essence this really only fixes half the problem right (the
> case of identifying that "CONTENT_TYPE"
> > and "conTeNT_TyPE" and all the permutations are really the same). What
> about
> > if I named it "Content Type", or "ContentType"?
> >  I propose that a way to correct this would be to create a standard set
> of named Strings in the ParseData class that the protocol framework and
> the parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> >  The properties would be defined at the top of the ParseData class,
> something like:
> >  public class ParseData{
> >.
> > public static final String CONTENT_TYPE = "content-type";
> > public static final String CREATOR = "creator";
> >
> > }
> > In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by
> making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
> get the content type or a call to
> ParseData.getMetadata().set(ParseData.CONTENT_TYPE, "text/xml"); Of
> course, this wouldn't preclude users from doing what they are currently
> doing, it would just provide a standard method of obtaining some of the
> more common, critical metadata without pouring over the code base to
> figure out what they are named.
> > I'll contribute a patch near the end of the this week, or beg. of next
> week that addresses this issue.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>http://www.atlassian.com/software/jira

Nutch Deployment

2006-01-06 Thread Chris Mattmann

Hi Folks,

  Jerome and I have been thinking a bit about the whole issue of "static"
NutchConf, versus removing it and making it a constructor parameter, etc. I
personally think that a lot of this issue stems from the fact that the
actual source code for nutch, and the what I would call "source
distribution", is in the same location at the actual "deployment"
distribution. For example, we have a directory structure like:

$NUTCH_HOME:

src/
build/
lib/
bin/
...

Which is where you run Nutch, and also where you build Nutch. I think that
this would be drastically improved by defining the notion of a "Nutch
Deployment". For example, in a lot of my projects at JPL, we check out the
source code of our projects from CM, then we construct a "build" of our
project. This build becomes what we then "deploy" to a particular deployment
environment, or location, and that's where the system is run from. A simple
example would be:

I have project A, here is A's source code structure:

/path/to/A/src/java/my/package/Test.java
/path/to/A/build.xml
/path/to/A/


Then, when I type: ant deploy, the following structure is created:

/path/to/A/build/distribution/lib/
/path/to/A/build/distribution/bin/
/path/to/A/build/distribution/LICENSE.txt
/path/to/A/build/distribution/conf/
...and so on

Then, a user could take the /path/to/A/build/distribution folder, and then
copy it to a "deployment" directory, and then, that's the deployment of the
system, which is separate from source code, thereby untying the source
distribution and the deployment distribution. If we had this concept
currently in Nutch, I think a lot of the static Nutch conf issues dissapear,
correct, because we have the concept of separate deployments, instead of
just relying on the same deployment to run a whole bunch of distributed
processes out of.

I may be misunderstanding this whole conversation, but if I'm right, then I
would propose that we formalize a notion of a "deployment" of Nutch versus
the actual "source distribution", instead of co-mingling them. Thoughts?


Thanks!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Chris Mattmann

Hi Gail,

 Check out:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

That's the way that the parser factory currently works. Also added, but not
described in that proposal is the ability to call a parser by its id, which
is a method present in ParseUtil.java.

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -Original Message-
> From: Gal Nitzan (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Sunday, January 15, 2006 4:10 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a
> parser plugin not just based on content type
> 
>  [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
> 
> Gal Nitzan updated NUTCH-179:
> -
> 
> Description:
> Sorry, please close this issue.
> 
> I figured that if I set my parse plugin first. I can always be called
> first and than decide if I want to parse or not.
> 
>   was:
> Somtime there are requirements of the "real world" (usually your boss)
> where a special parse is required for a certain site. Though the content
> type is text/html, a specialized parser is needed.
> 
> Sample: I am required to crawl certain sites where some of them are
> partners sites. when fetching from the partners site I need to look for
> certain entries in the text and boost the score.
> 
> Currently the ParserFactory looks for a plugin based only on the content
> type.
> 
> Facing this issue myself I noticed that it would give a very easy
> implementation for others if ParserFactory could use NutchConf to check
> for certain properties and if matched to use the correct plugin based on
> the url and not just the content type.
> 
> The implementation shouldn be to complicated.
> 
> Looking to hear more ideas.
> 
> 
> > Proposition: Enable Nutch to use a parser plugin not just based on
> content type
> > 
> ---
> >
> >  Key: NUTCH-179
> >  URL: http://issues.apache.org/jira/browse/NUTCH-179
> >  Project: Nutch
> > Type: Improvement
> >   Components: fetcher
> > Versions: 0.8-dev
> > Reporter: Gal Nitzan
> 
> >
> > Sorry, please close this issue.
> > I figured that if I set my parse plugin first. I can always be called
> first and than decide if I want to parse or not.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>http://www.atlassian.com/software/jira

ignore eclipse .project and .classpath

2006-02-07 Thread Chris Mattmann

Hi Folks,

 

 Just wondering if someone could add to the svn:ignore property for Nutch
the files:

 

.classpath

.project

 

I happen to use eclipse to do Nutch development and always ignore these
files in my other eclipse projects as well.

 

Cheers,

  Chris

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: ignore eclipse .project and .classpath

2006-02-09 Thread Chris Mattmann

Thanks a lot!

Cheers,
  Chris



On 2/9/06 12:13 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Done.
> 
> - Original Message 
> From: Stefan Groschupf <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Wed 08 Feb 2006 03:15:15 PM EST
> Subject: Re: ignore eclipse .project and .classpath
> 
> +1
> 
> Am 08.02.2006 um 06:16 schrieb Chris Mattmann:
> 
>> Hi Folks,
>> 
>> 
>> 
>>  Just wondering if someone could add to the svn:ignore property for
>> Nutch
>> the files:
>> 
>> 
>> 
>> .classpath
>> 
>> .project
>> 
>> 
>> 
>> I happen to use eclipse to do Nutch development and always ignore
>> these
>> files in my other eclipse projects as well.
>> 
>> 
>> 
>> Cheers,
>> 
>>   Chris
>> 
>> 
>> 
>> __
>> Chris A. Mattmann
>> [EMAIL PROTECTED]
>> Staff Member
>> Modeling and Data Management Systems Section (387)
>> 
>> Data Management Systems and Technologies Group
>> 
>> _
>> Jet Propulsion LaboratoryPasadena, CA
>> Office: 171-266BMailstop:  171-246
>> ___
>> 
>> Disclaimer:  The opinions presented within are my own and do not
>> reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>> 
>> 
> 
> ---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
> 
> 
> 
> 
>

Re: duplicate libs

2006-02-13 Thread Chris Mattmann

Hey Doug,

  I think that at least in the case of parse-rss, parse-pdf, and the nutch
core if there's probably some utility in having lib-xxx plugins (or at least
putting these jars in the $NUTCH_HOME/lib) for:

commons-httpclient
log4j
xerces

Then, protocol-httpclient, parse-pdf and the rest of the nutch core classes
could all reference these libraries. I'm working on NUTCH-140 right now, but
if there is need for this, I can create an issue in JIRA and then work on it
as well...

Cheers,
  Chris

On 2/13/06 3:26 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> There are a number of duplicated libs in the plugins, namely:
> 
> commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
> commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
> 
> log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
> log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
> log4j-1.2.9.jar   src/plugin/parse-pdf/lib
> 
> nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
> nekohtml-0.9.4.jarsrc/plugin/parse-html/lib
> 
> xerces-2_6_2.jar  lib
> xercesImpl.jarsrc/plugin/parse-rss/lib
> 
> Are there any known reasons to keep multiple versions of things, or
> should we move these each into their own plugin that can be shared?
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: duplicate libs

2006-02-13 Thread Chris Mattmann

Hi Andrzej,


> > commons-httpclient-3.0-beta1.jar  src/plugin/parse-rss/lib
> > commons-httpclient-3.0.jarsrc/plugin/protocol-httpclient/lib
> 
> Not sure what was the reason to use the beta1, perhaps no reason except
> that it was the latest available at the moment...

Yup, I think that was exactly the reason in the case of parse-rss...

> 
> >
> > log4j-1.2.11.jar  src/plugin/clustering-carrot2/lib
> > log4j-1.2.6.jar 1 src/plugin/parse-rss/lib
> > log4j-1.2.9.jar   src/plugin/parse-pdf/lib
> >
> > nekohtml-0.9.2.jarsrc/plugin/clustering-carrot2/lib
> > nekohtml-0.9.4.jarsrc/plugin/parse-html/lib
> 
> The differences here AFAIK are purely accidental, and I believe we can
> just keep the latest releases.

Agreed.

> 
> >
> > xerces-2_6_2.jar  lib
> > xercesImpl.jarsrc/plugin/parse-rss/lib
> 
> Not sure about these ones, but Xerces APIs are pretty stable, so I'd
> risk removing xercesImpl.jar .

I think that xercesImpl.jar contains classes that are required by parse-rss
to function. I haven't investigated in a while, but don't xerces-2_6_2.jar
and xercesImpl.jar contain different classes?

> 
> >
> > Are there any known reasons to keep multiple versions of things, or
> > should we move these each into their own plugin that can be shared?
> 
> The latter is what I advocated for log4j and various xml-related high
> level API libs (jdom, dom4j, jaxen).

+1

Cheers,
 Chris

> 
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Hi Stefan,

> after a short time I already had 1602 time this lines in my
> tasktracker log files.
> 060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
> file:/home/joa/nutch/conf/parse-plugins.xml
> 
> Sounds like this file is loaded 1602 (after lets say 3 minutes) I
> guess that wasn't the goal or do I oversee anything?

It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the
following line in the ParserFactory.java class:

  /** List of parser plugins. */
  private static final ParsePluginList PARSE_PLUGIN_LIST =
  new ParsePluginsReader().parse();


(see revision 326889)

Looking at the revision history for the ParserFactory file, after the
application of NUTCH-169, the above changes to:


  private ParsePluginList parsePluginList;

//... code here

public ParserFactory(NutchConf nutchConf) {
this.nutchConf = nutchConf;
this.extensionPoint = nutchConf.getPluginRepository().getExtensionPoint(
Parser.X_POINT_ID);
this.parsePluginList = new ParsePluginsReader().parse(nutchConf);

if (this.extensionPoint == null) {
  throw new RuntimeException("x point " + Parser.X_POINT_ID + " not
found.");
}
if (this.parsePluginList == null) {
  throw new RuntimeException(
  "Parse Plugins preferences could not be loaded.");
}
  }


Thus, every time the ParserFactory is constructed, the parse-plugins.xml
file is read (it's the result of the call to
ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times,
I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm
wondering why the parse-plugins.xml configuration parameters aren't declared
as final static anymore?

> That could be a serious performance improvement to just load this
> file once.

Yup, I think that's the reason we made it final static. If there is no
reason to not have it final static, I would suggest that it be put back to
final static. There may be a problem however, now since NUTCH-169, the
loading requires an existing Configuration object I believe. So, we may need
a static Configuration object as well. Thoughts? 

> I was not able to find the code that is logging this statement, has
> anyone a idea where this happens?

The statement gets logged within the ParsePluginsReader.java class, line 98:

ppInputStream = conf.getConfResourceAsInputStream(
  conf.get(PP_FILE_PROP));

HTH,
  Chris


> 
> Thanks.
> Stefan
> -
> blog: http://www.find23.org
> company: http://www.media-style.com

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Hi Stefan,


> Hi Chris,
> thanks for the clarification.

No probs. 

> Do you think we can we somehow cache it in the nutchConf instance,
> since this is the way we doing this on other places as well?

Yeah I think we can. Here is a small patch to the ParserFactory that should
do the trick. Give it a test and let me know if it works. If it does, I
would say +1 to the committers to get this into the sources ASAP, no?

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,13 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject("parsePluginList") != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject("parsePluginList");
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);

+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException("x point " + Parser.X_POINT_ID + " not
found.");


Cheers,
  Chris

> Cheers,
> Stefan
> 
> Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
> 
> > Hi Stefan,
> >
> >> after a short time I already had 1602 time this lines in my
> >> tasktracker log files.
> >> 060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
> >> file:/home/joa/nutch/conf/parse-plugins.xml
> >>
> >> Sounds like this file is loaded 1602 (after lets say 3 minutes) I
> >> guess that wasn't the goal or do I oversee anything?
> >
> > It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
> > had the
> > following line in the ParserFactory.java class:
> >
> >   /** List of parser plugins. */
> >   private static final ParsePluginList PARSE_PLUGIN_LIST =
> >   new ParsePluginsReader().parse();
> >
> >
> > (see revision 326889)
> >
> > Looking at the revision history for the ParserFactory file, after the
> > application of NUTCH-169, the above changes to:
> >
> >
> >   private ParsePluginList parsePluginList;
> >
> > //... code here
> >
> > public ParserFactory(NutchConf nutchConf) {
> > this.nutchConf = nutchConf;
> > this.extensionPoint = nutchConf.getPluginRepository
> > ().getExtensionPoint(
> > Parser.X_POINT_ID);
> > this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
> >
> > if (this.extensionPoint == null) {
> >   throw new RuntimeException("x point " + Parser.X_POINT_ID + "
> > not
> > found.");
> > }
> > if (this.parsePluginList == null) {
> >   throw new RuntimeException(
> >   "Parse Plugins preferences could not be loaded.");
> > }
> >   }
> >
> >
> > Thus, every time the ParserFactory is constructed, the parse-
> > plugins.xml
> > file is read (it's the result of the call to
> > ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded
> > 1602 times,
> > I'd guess that the ParserFactory is loaded 1602 times?
> > Additionally, I'm
> > wondering why the parse-plugins.xml configuration parameters aren't
> > declared
> > as final static anymore?
> >
> >> That could be a serious performance improvement to just load this
> >> file once.
> >
> > Yup, I think that's the reason we made it final static. If there is no
> > reason to not have it final static, I would suggest that it be put
> > back to
> > final static. There may be a problem however, now since NUTCH-169, the
> > loading requires an existing Configuration object I believe. So, we
> > may need
> > a static Configuration object as well. Thoughts?
> >
> >> I was not able to find the code that is logging this statement, has
> >> anyone a idea where this happens?
> >
> > The statement gets logged within the ParsePluginsReader.java class,
> > line 98:
> >
> > ppInputStream = conf.getConfResourceAsInputStream(
> >   conf.get(PP_FILE_PROP));
> >
> > HTH,
> >   Chris
> >
> >
> >>
> >> Thanks.
> >> Stefan
> >> -
> >> blog: http://www.find23.org
> >> company: http://www.media-style.com
> >
> >
> >
> 
> -
> blog: http://www.find23.org
> company: http://www.media-style.com

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Sorry,

 My last patch was missing one line. Here's the update:

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,14 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject("parsePluginList") != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject("parsePluginList");
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);
+conf.setObject("parsePluginList", this.parsePluginList);
+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException("x point " + Parser.X_POINT_ID + " not
found.");


> -Original Message-
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 06, 2006 7:51 PM
> To: 'nutch-dev@lucene.apache.org'
> Subject: RE: found resource parse-plugins.xm?
> 
> Hi Stefan,
> 
> 
> > Hi Chris,
> > thanks for the clarification.
> 
> No probs.
> 
> > Do you think we can we somehow cache it in the nutchConf instance,
> > since this is the way we doing this on other places as well?
> 
> Yeah I think we can. Here is a small patch to the ParserFactory that
> should do the trick. Give it a test and let me know if it works. If it
> does, I would say +1 to the committers to get this into the sources ASAP,
> no?
> 
> Index: src/java/org/apache/nutch/parse/ParserFactory.java
> ===
> --- src/java/org/apache/nutch/parse/ParserFactory.java(revision
> 383463)
> +++ src/java/org/apache/nutch/parse/ParserFactory.java(working
copy)
> @@ -55,7 +55,13 @@
>  this.conf = conf;
>  this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
>  Parser.X_POINT_ID);
> -this.parsePluginList = new ParsePluginsReader().parse(conf);
> +
> +if(conf.getObject("parsePluginList") != null){
> + this.parsePluginList =
> (ParsePluginList)conf.getObject("parsePluginList");
> +}
> +else{
> +this.parsePluginList = new ParsePluginsReader().parse(conf);
> 
> +}
> 
>  if (this.extensionPoint == null) {
>throw new RuntimeException("x point " + Parser.X_POINT_ID + " not
> found.");
> 
> 
> Cheers,
>   Chris
> 
> > Cheers,
> > Stefan
> >
> > Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
> >
> > > Hi Stefan,
> > >
> > >> after a short time I already had 1602 time this lines in my
> > >> tasktracker log files.
> > >> 060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
> > >> file:/home/joa/nutch/conf/parse-plugins.xml
> > >>
> > >> Sounds like this file is loaded 1602 (after lets say 3 minutes) I
> > >> guess that wasn't the goal or do I oversee anything?
> > >
> > > It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
> > > had the
> > > following line in the ParserFactory.java class:
> > >
> > >   /** List of parser plugins. */
> > >   private static final ParsePluginList PARSE_PLUGIN_LIST =
> > >   new ParsePluginsReader().parse();
> > >
> > >
> > > (see revision 326889)
> > >
> > > Looking at the revision history for the ParserFactory file, after the
> > > application of NUTCH-169, the above changes to:
> > >
> > >
> > >   private ParsePluginList parsePluginList;
> > >
> > > //... code here
> > >
> > > public ParserFactory(NutchConf nutchConf) {
> > > this.nutchConf = nutchConf;
> > > this.extensionPoint = nutchConf.getPluginRepository
> > > ().getExtensionPoint(
> > > Parser.X_POINT_ID);
> > > this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
> > >
> > > if (this.extensionPoint == null) {
> > >   throw new RuntimeException("x point " + Parser.X_POINT_ID + "
> > > not
> > > found.");
> > > }
> > > if (this.parsePluginList == null) {
> > >   throw new RuntimeException(
> > >   "Parse Plugins preferences could not be loaded.");
> > > }
> > >   }

Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann

Hi Folks,

  I updated to the latest SVN revision (385691) today, and I am now seeing a
Null Pointer exception in the AnalyzerFactory.java class. It seems that in
some cases, the method:

  private Extension getExtension(String lang) { Extension extension =
(Extension) this.conf.getObject(lang);if (extension == null) {
extension = findExtension(lang);  if (extension != null) {
this.conf.setObject(lang, extension);  }}return extension;  }


Has a null "lang" parameter passed to it, which causes a NullPointer
exception at line: 81 in
src/java/org/apache/nutch/analyzer/AnalyzerFactory.java

I found that if I checked for null in the lang variable, and returned null
if lang == null, that my crawl finished. Here is a small patch that will fix
the crawl:

Index: 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java ===
--- 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(revision 385691) +++
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(working copy) @@ -78,14 +78,19 @@private Extension
getExtension(String lang) { -Extension extension = (Extension)
this.conf.getObject(lang); -if (extension == null) { -  extension =
findExtension(lang); -  if (extension != null) { -
this.conf.setObject(lang, extension); -  } -} -return extension;
+if(lang == null){ +return null; +} +else{ +
Extension extension = (Extension) this.conf.getObject(lang); +if
(extension == null) { +  extension = findExtension(lang); +
if (extension != null) { +this.conf.setObject(lang, extension);
+  } +} +return extension;+}   }
private Extension findExtension(String lang) {


NOTE: not sure if returning null is the right thing to do here, but hey, at
least it made my crawl finish! :-)

Cheers,
  Chris



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246

___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann

Thanks Jerome! :-)

Cheers,
  Chris

On 3/13/06 4:02 PM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:

>>   I updated to the latest SVN revision (385691) today, and I am now seeing
>> a
>> Null Pointer exception in the AnalyzerFactory.java class.
> 
> Fixed (r385702). Thanks Chris.
> 
> 
>> NOTE: not sure if returning null is the right thing to do here, but hey,
>> at
>> least it made my crawl finish! :-)
> 
> It is the right thing to do.
> 
> Cheers,
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-06 Thread Chris Mattmann

+1 for a release sooner rather than later. Several interesting features
contributed since the 0.7 branch I believe are now tested and
production-worthy, at least in my environment. Hats off to the folks who
were able to split the MapReduce and NDFS into Hadoop -- I'm going to be
experimenting with that portion of the code over the next few weeks on a 16
node, 32 processor Opteron cluster at JPL that will be used as the
development machine for a large scale earth science data processing mission.
Because the Hadoop code is in its own project now, I can leverage and test
the Hadoop processing and HDFS capability without having to include all the
search engine specific stuff. Ya! :-)

Cheers,
  Chris



On 4/6/06 12:59 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Doug Cutting wrote:
>> TDLN wrote:
>>> I mean, how do others keep uptodate with the main codeline? Do you
>>> advice updating everyday?
>> 
>> Should we make a 0.8.0 release soon?  What features are still missing
>> that we'd like to get into this release?
> 
> I think we should make a release soon - instabilities related to Hadoop
> split are mostly gone now, and we need to endorse the new architecture
> more officially...
> 
> The "adaptive fetch" and "scoring API" functionality are the top
> priority for me. While the scoring API change is pretty innocuous, we
> just need to clean it up, the adaptive fetch changes have a big
> potential for wrecking the main re-fetch cycle ... ;)
> 
> We could do it in two ways: I could apply this patch and let people run
> with it for a while, fixing bugs as they pop up - but then it will be
> another 3-4 weeks I suppose. Or we could wait with this after the release.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann

+1

On 4/7/06 10:20 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> +1 for a release sooner rather than later.
> 
> I think this is a good plan.  There's no reason we can't do another
> release in a month.  If it is back-compatbible we can call it 0.8.x and
> if it's incompatible we can call it 0.9.0.
> 
> I'm going to make a Hadoop 0.1.1 release today that can be included in
> Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, with
> potential bugfix releases between when serious bugs are found.  The big
> bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.)
> 
> So we could aim for a Nutch 0.8.0 release sometime next week.  Does that
> work for folks?
> 
> Piotr, would you like to make this release, or should I?
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann

Hi Andrzej,

On 4/7/06 12:18 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Do you guys have any additional insights / suggestions whether NUTCH-240
> and/or NUTCH-61 should be included in this release?

Looking at the JIRA popular issues pane for Nutch (
http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.
system.project:popularissues-panel), I note that NUTCH-61 is the most
popular issue right now with 7 votes. Additionally, NUTCH-240 shares the 3rd
most votes (4) with NUTCH-134. So, all in all, there are 4 issues with >= 4
votes in JIRA. Of those 4 issues, 3 of them all have attached patches in
JIRA. Would it be safe to say that the committers should focus on committing
NUTCH-61, NUTCh-240, and NUTCH-48, since these 3 issues all have attached
patch files, and then freeze it for the 0.8.0 release? As for my own
opinion, I recently downloaded and reviewed NUTCH-61, and really like the
patch. +1 on my end. I haven't tried out NUTCH-240 yet, but it seems to be a
logical extension point for Nutch to be able to plug in different scoring
components. So, +1 from me.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

0.8 release?

2006-04-12 Thread Chris Mattmann

Hi Guys,

  Any progress on the 0.8 release? Was there any resolution about which JIRA
issues to complete before the 0.8 release? We had a bit of conversation
there and some ideas, but no definitive answer...

Thanks for your help, and sorry to pester ;)

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: plugin.dtd

2006-04-16 Thread Chris Mattmann

Hi Stefan,

  The DTD actually does allow for custom attributes: Jerome factored them
out of the form:

=""
=""
 .
>

Into the form:





...

See the difference? Using the parameter tags, we can have a generic DTD that
supports any parameter name and value. The other way, I had to go through
all the plugin.xml files, and then add them to the attlist for the
implementation tags, which probably isn't the way that we want to do it.
That way, you have to change the DTD every time you introduce a new
attribute to a plugin.xml file.

Thanks!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -Original Message-
> From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
> Sent: Sunday, April 16, 2006 6:03 AM
> To: nutch-dev@lucene.apache.org
> Subject: plugin.dtd
> 
> Hi,
> 
> Looks like the dtd does not allow custom attributes in extension
> definition nodes.
> This would be very contra productive for custom plugins and one idea
> of a plugin system is to have custom plugins.
> Can this be fixed by allowing any kind of custom attributes?
> 
> Thanks.
> Stefan
> 
> ---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
>

RE: [Proposal] New Lucene sub-project

2006-04-24 Thread Chris Mattmann

Hi Otis,

> This thread seems to have gotten very little attention.
> Jérôme - I'm all for extracting sub-libraries that can really live on its
> own and are substantial enough to warrant "their own identity".
> 
> Personally, I'm the most interested in Language Identifier plugin becoming
> a standalone, Nutch-independent piece.  Doug had suggested we move it to
> Lucene's contrib section.  If you think it makes sense to have some of
> these things lumped together, that's fine, too.  It looks like Language
> Identifier and Charset Detector may go well together.
> 
> Is this something you want/will push for and make happen?

Just to add to this, it's something that I would push for whole-heartedly.
In addition to Jerome, I would be happy to dedicate time to this
sub-project, and feel it's quite worthy of being its own Stand-alone
library. 

Just my two cents, thanks!

Cheers,
  Chris


> 
> Otis
> 
> - Original Message 
> From: Jérôme Charron <[EMAIL PROTECTED]>
> To: nutch-dev@lucene.apache.org
> Sent: Friday, April 7, 2006 4:26:54 AM
> Subject: [Proposal] New Lucene sub-project
> 
> Hi all,
> 
> While chatting with Chris Mattmann, it seems to be evident to us that
> there
> is a need for a new sub-project within Lucene.
> 
> For now, Lucene's sub-projects used in Nutch are :
> 1. Lucene-java - The basis for search technology
> 2. Hadoop - The distributed computing platform
> 3. Nutch - The search engine that relies on Lucene and Hadoop.
> 
> Since Nutch contains some value added pieces of code that focus on content
> analysis,
> we think it would be a good idea to split Nutch into a new sub-project
> based
> on content analysis
> manipulation. The components we have identified are :
> 
> 1. MimeType Repository
> 2. Language Identifier
> 3. Content Signature (MD5Signature / TextProfileSignature / ...)
> (4. Generic Meta Data Infrastructure)
> (5. Charset Detector)
> (6. Parse Plugins Framework)
> 
> The idea is to expose these pieces of codes into a standalone lib, since
> we
> are convinced they could be usefull
> in many other projects than Nutch.
> The benefits will be to have some code more widely used / tested /
> contributed.
> If this proposal is accepted, we have a candidate name for this new
> project:
> Tika (comes from my son  ;-) )
> 
> Any comment is welcome.
> 
> Jérôme
>

Re: Nutch Parser Bug

2006-04-25 Thread Chris Mattmann

Hi Alex,

 I also noticed this issue a while back. It's described here:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200510.mbox/%3c435
[EMAIL PROTECTED]

Cheers,
  Chris

On 4/25/06 2:41 PM, "Alex" <[EMAIL PROTECTED]> wrote:

> Hi there,
> 
> I'm fairly new to nutch and in working on the nutch search I realize that when
> I try to search for terms such as "#1 top item sales", the search seem to
> ignored everything after the "#" sign. I also tried with other symbols such as
> @, !, $, % , ^ , etc... those seem to be ignored. This seem to be a problem in
> the Query.parse method, Can this be add to the list of bug fix for the next
> build?  or is it something that's already been done?  Please adv. Thank you.
> 
> Alex
> 
> -
> Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+
> countries) for 2¢/min or less.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann

Folks,

 Before I (or someone else) reopens the issue, I think it's important to
understand the implications:

>1) Having a *side-effect* of the entire system stop processing after merely
> logging a message at a certain event level is a poor practice.

I'm not sure that the Fetcher quitting is a * side-effect * as you call it.
In fact, I think it's clearly stated as the behavior of the system, both
within the code, and in several mailing list conversations I've seen over
the course of the past two years (I can dig these up, if needed).

> In fact, I believe that this would make a fantastic anti-pattern.  If this
> kind of behavior is *really* wanted (and I argue that it should not be below),
> it should be done through an explicit mechanism, not as a side-effect.

Again, the use of side-effect here is strange to me: how is an explicit
check for any LOG messages to the SEVERE level before quitting a
"side-effect"? 

> For example, did you realize that since Hadoop hijacks and reassigns all log
> formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter
> static constructor that anyone using Nutch as a library and logs a SEVERE\
> error will suffer by having Nutch stop fetching?

I'm not convinced that having Nutch stop fetching when a SEVERE error is
logged is the wrong behavior. Let's think about what possible SEVERE errors
may typically be logged: Out of Memory error, potentially,
InterruptedExceptions in Threads (possibly), failure in any of the plugin
libraries critical to the fetch running (possibly), the list goes on and on.
So, in this case, you argue that the Fetcher should continue operating?

> 2) Moreover, having the system stop processing forever more by use of a
> static(!) flag makes the use of the Nutch system as a library within a server
> or service environment impossible.  Once this logging is done, no more Fetcher
> processing in this run *or any other* can take place.

I've been using Nutch in a server environment (JSPs and Tomcat) within a
large-scale data system at NASA for the course of the past year, and have
never been impeded by the behavior of the fetcher. Can you be more specific
here as to the exact use-case that's failing in your scenario? I've also
been watching the mailing lists for the better course of almost 2 years, and
have seen little traffic (outside of the aforementioned clarifications/etc.
above) about this issue. I may be out on an island here, but again, I'm not
convinced that this is a core issue.

Just my 2 cents. If the votes continue that this is an issue, however, I'll
have no problem opening it up (or one of the committers can do it as well).

Cheers,
  Chris

On 6/5/06 7:11 AM, "Stefan Groschupf (JIRA)" <[EMAIL PROTECTED]> wrote:

> [ 
> http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ]
> 
> Stefan Groschupf commented on NUTCH-258:
> 
> 
> Scott, 
> I agree with you. However we need a clean patch to solve the problem, we can
> not just comment things out of the code.
> So I vote for the issue and I vote to reopen this issue.
> 
>> Once Nutch logs a SEVERE log item, Nutch fails forevermore
>> --
>> 
>>  Key: NUTCH-258
>>  URL: http://issues.apache.org/jira/browse/NUTCH-258
>>  Project: Nutch
>> Type: Bug
> 
>>   Components: fetcher
>> Versions: 0.8-dev
>>  Environment: All
>> Reporter: Scott Ganyo
>> Priority: Critical
>>  Attachments: dumbfix.patch
>> 
>> Once a SEVERE log item is written, Nutch shuts down any fetching forevermore.
>> This is from the run() method in Fetcher.java:
>> public void run() {
>>   synchronized (Fetcher.this) {activeThreads++;} // count threads
>>   
>>   try {
>> UTF8 key = new UTF8();
>> CrawlDatum datum = new CrawlDatum();
>> 
>> while (true) {
>>   if (LogFormatter.hasLoggedSevere()) // something bad happened
>> break;// exit
>>   
>> Notice the last 2 lines.  This will prevent Nutch from ever Fetching again
>> once this is hit as LogFormatter is storing this data as a static.
>> (Also note that "LogFormatter.hasLoggedSevere()" is also checked in
>> org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
>> This must be fixed or Nutch cannot be run as any kind of long-running
>> service.  Furthermore, I believe it is a poor decision to rely on a logging
>> event to determine the state of the application - this could have any number
>> of side-effects that would be extremely difficult to track down.  (As it has
>> already for me.)

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPa

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann

Hi Andrzej,

> 
> The main problem, as Scott observed, is that the static flag affects all
> instances of the task executing inside the same JVM. If there are
> several Fetcher tasks (or any other tasks that check for SEVERE flag!),
> belonging to different jobs, all of them will quit. This is certainly
> not the intended behavior.
> 

Got it.

>>   
>>> In fact, I believe that this would make a fantastic anti-pattern.  If this
>>> kind of behavior is *really* wanted (and I argue that it should not be
>>> below),
>>> it should be done through an explicit mechanism, not as a side-effect.
>>> 
>> 
>>   
> 
> I have a proposal for a simple solution: set a flag in the current
> Configuration instance, and check for this flag. The Configuration
> instance provides a task-specific context persisting throughout the
> lifetime of a task - but limited only to that task. Voila - problem
> solved. We get rid of the dubious use of LogFormatter (I hope Chris that
> even you would agree that this pattern is slightly .. unusual ;) )

What, "unusual"? Huh? :-)

> and 
> we gain flexible mechanism limited in scope to the current task, which
> ensures isolation from other tasks in the same JVM. How about that?

+1

I like your proposed solution. I haven't used multiple fetchers really
inside the same process too, much however, I do have an application that
calls fetches in more of a sequential way in the same JVM. So, I guess I
just never ran across the behavior. The thing I like about the proposed
solution is its separation and isolation of a task context, which I think
that Nutch (now relying on Hadoop as the underlying architectural computing
platform) needed to address.

So, to summarize, the proposed resolution is:

* add flag field in Configuration instance to signify whether or not a
SEVERE error has been logged within a task's context

* check this field within the fetcher to determine whether or not to stop
the fetcher, just for that fetching task identified by its Configuration
(and no others)

Is this representative of what you're proposing Andrzej? If so, I'd like to
take the lead on contributing a small patch that handles this, and then it
would be great if people like Scott could test this out in their existing
environments where this error was manifesting itself.

Thanks!

Cheers,
  Chris

(BTW: would you like me to re-open the JIRA issue, or do you want to do it?)

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: Library for extracting text content from binaries

2006-07-24 Thread Chris Mattmann

Hi Jukka,

  Thanks for your email. Jerome Charron and I proposed a project with a
similar goal in mind that we wanted to dub "Tika". Tika would effectively be
a Lucene sub-project, and would factor out some of the capabilities you
mention below from Nutch, incl:

1. MimeType repository
2. Parser interface and Parser plugins
3. Metadata infrastructure
4. LanguageIdentifier

And a few others. Here is the mailing list thread discussion that we had a
few months back:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200604.mbox/%3cc82
[EMAIL PROTECTED]

Jerome and I have been quite busy lately, however, and we haven't had a
chance to draft the proposal to send to the Lucene PMC, although Doug (and a
few others) told us that if we garner enough support and feel that the
project would make a significant contribution as it's own Lucene
sub-project, to email the PMC and see what happens. If you're interested in
this idea, maybe it would be a good idea to contact Jerome and I off-list,
and maybe we could get going on a proposal.

Thanks!

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -Original Message-
> From: Jukka Zitting [mailto:[EMAIL PROTECTED]
> Sent: Monday, July 24, 2006 11:29 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Library for extracting text content from binaries
> 
> Hi,
> 
> Any interest in this? If not, is there some other Lucene project that
> I should approach?
> 
> BR,
> 
> Jukka Zitting
> 
> On 7/18/06, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I'm a committer of the Apache Jackrabbit project, and I've recently
> > been working on improving the full text indexing support in
> > Jackrabbit. We've used standard Lucene Java as the embedded full text
> > search engine in Jackrabbit, but created our own set of parsers for
> > extracting text content from binary files. So far our parser interface
> > TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> > proposal, TextExtractor, [2] aims for a generic solution that converts
> > a generic InputStream into a Reader for passing to Lucene Java.
> >
> > Before coming up with the proposal I tried looking for similar
> > solutions, but couldn't find any that would have satisfied my
> > requirement of no external dependencies other than the JRE. Your
> > o.a.nutch.parse.Parser interface however came quite close, and you
> > already have an extensive set of existing implementations, so I'd like
> > to leverage your work with the Parser implementations while finding a
> > way to avoid the full Nutch and Hadoop dependencies. I believe that
> > there are a number of other Lucene users who have similar needs.
> >
> > Thus I'd like to ask if there would be interest in making your Parser
> > interface and implementations more easily accessible to external
> > projects, perhaps as a separate library. If  you're interested, I'd be
> > happy to participate in such an effort.
> >
> > [1]
> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org
> /apache/jackrabbit/core/query/TextFilter.java?view=markup
> > [2] http://issues.apache.org/jira/browse/JCR-415
> >
> >
> > BR,
> >
> > Jukka Zitting
> >
> > --
> > Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
> > Software craftsmanship, JCR consulting, and Java development
> >

Re: parse-plugins.xml

2006-08-03 Thread Chris Mattmann

Hi Marko,

   Thanks for your question. Basically it was set up as a sort of "last
result" of getting at least * some * information from the PDF file, albeit
littered with garbage. If indeed the parse-text does not really make sense
in terms of a backup parser to handle PDF files and get at least some text
to index, then we may think of either (a) removing it from the default
parse-plugins.xml, or (b) writing a simple PdfParser that can handle
truncation as a backup to the existing PdfParser. Basically the philosophy
behind each mimeType entry in parse-plugins.xml is to try and map the set of
existing Nutch parse-plugins to the available content types, giving each
mimeType as many options as possible in terms of getting some content out of
them. 

Cheers,
  Chris

On 8/3/06 4:04 AM, "Marko Bauhardt" <[EMAIL PROTECTED]> wrote:

> Hi all,
> i have a question about the parse-plugins.xml and application/pdf.
> Why is the TextParse used for parsing pdf files? The mimiType
> "appliation/pdf" is mapped to "parse-pdf" and "parse-text". But the
> TextParser does not support pdf files.
> The problem is, if the pdf file is truncated the textparser "parse"
> this content and the indexer indexing "waste". So what is the reason
> to map "application/pdf" to the "parse-text" plugin?
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Thanks for hints,
> Marko
> 
>

Re: parse-plugins.xml

2006-08-03 Thread Chris Mattmann

Hey Andrzej,


On 8/3/06 8:19 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> Hi Marko,
>> 
>>Thanks for your question. Basically it was set up as a sort of "last
>> result" of getting at least * some * information from the PDF file, albeit
>> littered with garbage. If indeed the parse-text does not really make sense
>>   
> 
> IMO it doesn't make sense. PDF text content, even if it's available in
> plain text, is usually compressed. The percentage of non-compressed PDFs
> out there in my experience is negligible.
> 
>> in terms of a backup parser to handle PDF files and get at least some text
>> to index, then we may think of either (a) removing it from the default
>>   
> 
> +1

Okey dok, you'll find a quick patch this at:

http://issues.apache.org/jira/browse/NUTCH-338

I decided to create an issue to just keep track of the fact that we made
this change, and additionally because I tried pasting the quick patch into
my email program here on my Mac and it looked like it was coming out weird
:-)

> 
>> parse-plugins.xml, or (b) writing a simple PdfParser that can handle
>> truncation as a backup to the existing PdfParser. Basically the philosophy
>>   
> 
> I think that "simple PDF parser" is an oxymoron ... ;)

Heh, I agree with you on that one. If everyone would just move to XML
DocBook, then it would be great! ;)


Thanks!

Cheers,
  Chris

>

Patch Available status?

2006-08-15 Thread Chris Mattmann

Hi Guys,

 I've seen on the Hadoop mailing list recently that there was a new status
added for issues in JIRA called "Patch Available" to let committers know
that a patch is ready for review to commit. How about we add this to the
Nutch jira instance as well? I tried doing this, but I don't think I have
the permissions to do so.

 I've got 2 patches for issues that are attached in jira that I'd like to
set as having this new status :-)

https://issues.apache.org/jira/browse/NUTCH-338
https://issues.apache.org/jira/browse/NUTCH-258

Cheers,
  Chris

Re: Tika update

2006-08-16 Thread Chris Mattmann

Hi Jukka,

 Thanks for your email. Indeed, there was discussion on the Lucene PMC email
list, about the Tika project. It was decided by the powers that be to
discuss it more on the Nutch mailing list before moving forward with any
vote on making Tika a sub-project of Apache Lucene. With regards to that, my
action was to send the Tika proposal to the nutch-dev list, and help to
start up a discussion on Tika, to get feedback from the community. Seeing as
though you lighted the fire under this (thanks!), it's only appropriate for
me to send out the Tika project proposal sent to the Lucene PMC. So, here it
is, attached. I'd love to here feedback from the Nutch community on what it
thinks of such a project.

Cheers,
   Chris

On 8/16/06 4:06 AM, "Jukka Zitting" <[EMAIL PROTECTED]> wrote:

> Hi,

There was recently discussion on perhaps starting a new
> Lucene
sub-project, named Tika, to create a general-purpose library from
> the
parser components and other features in Nutch that might interest a
wider
> audience. To keep things rolling we've created a temporary
staging area for
> the project at http://code.google.com/p/tika/ on
Google Code, and I've started
> to flesh out a potential project
structure using Maven 2.

Note that the
> project materials in svn refer to the project as "Apache
Tika" even though the
> project has *not* been officially accepted. The
reason for this is   that the
> Google Code project is just a temporary
staging ground and I wanted to give a
> better idea of what the project
could look like if accepted. The jury is still
> out on whether to start
a project like this, so any comments and feedback on
> the idea are very
much welcome.

Most, if not all, code in Tika will be based
> on existing code from
Nutch and other Apache projects, so I'm not sure if the
> project needs
to go through the Incubator if accepted by the Lucene PMC.

So
> far the tika source tree contains just a modified version of my
TextExtractor
> code from the Apache Jackrabbit project, and Jérôme is
planning to add some of
> his stuff. The source tree at Google Code
should be considered just a
> playground for bringing things together
and discussing ideas, before migrating
> back to ASF infrastructure.

BR,

Jukka Zitting

--
Yukatan -
> http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting,
> and Java development

Re: Tika update

2006-08-16 Thread Chris Mattmann

Hmmm I guess the nutch-dev list doesn't like MS Word attachments. Here's the
content of the proposal, pasted in plain text:


Proposal for new Lucene Sub Project called "Tika"
 
Chris A. Mattmann, Jerome Charron
 
Overview
 
With its simple but efficient plugin system, Nutch is becoming more and more
of a search engine framework that can easily be tuned to many kinds of
domain-specific search applications (e.g., corporate, personal, internet,
vertical search, general search). Nutch is a standalone "library",
containing search engine tools such as a crawler, and tools for index
management. Nutch is also very much a component in its own right, exporting
its own API, and having the ability to be used as a plugin component in
other systems. 
 
However, the current Nutch software contains many "value-added" pieces of
code that are monolithically packaged together. If the services and
capabilities from the code were provided as separate, modular component
libraries, such services and capabilities could benefit many projects,
besides just Nutch. Ideally, one would not want to include the entire Nutch
jar file to take advantage of its content parsing tools. Additionally, with
the formulation of Hadoop, there is precedent for breaking Nutch down into
different component libraries. So far, the "valued-added" pieces of code
monolithically packaged that we have identified are of three kinds:
 
* Infrastructure : The Nutch plugin system : This plugin system, as a
standalone library can be reused in Lucene, Nutch, Solr and many others
projects to easily provide some extensible capabilities.
* Content analysis : the MimeType repository, the language identifier, the
summarizers, the signature implementations. These pieces of code could be
useful in any content related project.
* Content Parsing : All the Nutch's parse plugins. These plugins are
generally more or less some wrappers based on some external APIs. Their
added value is to provide a common API to access content of many type of
content. Again, it could be very useful in many content based projects
(Lucene based projects, Solr, ...)
 
It is our proposal that these identified pieces of code be extracted and
maintained in a separate library that we would dub "Tika". Tika would become
a sub-project of Lucene in a similar fashion to that of Hadoop, and similar
to its graduation out of Nutch that occurred recently. Tika would be a
framework and API for content analysis, and parsing in large
scale-distributed systems. It would also include Nutch's useful plugin
system, which could be easily reused across many projects, both within and
outside of Lucene.
 
Benefits of Extracting the Aforementioned Value Added Code Fragments into
their Own Library
 
* Avoid duplicating code over Lucene's subprojects.
* Better visibility of these pieces of code.
* Wider usage of these pieces of code.
* The two previous points will provide a better extension and maintenance of
these pieces of code.
 
RoadMap
 
* tika-0.1 : Simply gathers the easiest Nutch externalizable code (MimeType,
LanguageIdentifier, Summarizer, Signature)
* tika-0.2 : Provides a generic plugin mechanism
* tika-0.3 : Provides content parsing / analysis plugins
> * Operational Tika library
> * Nutch external dependency on Tika
* tika-0.4 and beyond: issues identified by community, more content parsing
plugins, graphical user interfaces, command line tools, and more
 
 
There has been some recent interest in a generic framework for content
analysis, and metadata management on the Nutch mailing lists recently. From
that interest, we have gathered the following list of candidate committers
who have expressed interested in our proposed project. The leader of the
Tika project would be Chris Mattmann. Chris works at NASA's Jet Propulsion
Laboratory as a Member of the Technical Staff in the Modeling and Data
Management Systems Section. Chris has contributed many patches to Nutch, and
a single patch to the Hadoop project as well. In addition to his work at
JPL, Chris is also a Ph.D. candidate at the University of Southern
California's Center for Software Engineering, where he works with his
advisor Dr. Nenad Medvidovic researching software architecture for
data-intensive systems. Chris's dissertation research investigates software
connectors and their properties in large-scale, distributed, data-intensive
systems. His expected date of defense is May 2007. The other "core" member
of the commit team would be Jerome Charron, one of Nutch's existing
committers. Jerome has contributed many useful patches to the Nutch system,
including the metadata analysis container and the mime type identification
system. Though Chris would be the lead of the project, the oversight and
vision for the project would be shared between Jerome and Chris. The full
list of candidate committers are as follows. This list is not meant to be
exha

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Chris Mattmann

Hi Steven,

On 8/16/06 7:36 AM, "steven shingler" <[EMAIL PROTECTED]> wrote:

> (This thread moved from the User List.)
> 
> OK Lukas, lets open it up to the dev list! :)
> 
> Particularly, does the group feel moving to Maven would be _a good thing_ ?

+1

I suggested this (however did not make any progress on realizing it ;) ) a
while back. I think it makes a * lot of sense *. Maven's dependency system
would significantly reduce the size of the CM'ed Nutch source code, as all
the jars required by Nutch could be referenced externally (plugins are a
different beast, but we're working on that). Additionally, Maven would allow
automatic generation of a soft of "nightly build" Nutch site, showing recent
commits, unit test results and more.

> 
> Even if so, what are the problems?

The main problem I see is the plugin system, and how to appropriate
represent plugin dependencies in Maven (or just neglect to elegantly handle
them, and treat them like invididual projects, like nutch, which requires
CM'ing jar files). Additionally, I think it will probably require writing
some custom Jelly scripts to do all the neat ant build stuff that Nutch does
on the side (e.g., unpack Hadoop, etc.).

> 
> There are currently two versions of Lucene in the Maven repos, but Hadoop
> would have to be added manually, I think.

It would probably make most sense to run a Maven repo explicitly for Nutch
off of the Lucene Nutch site. Something like
(http://lucene.apache.org/nutch/maven/) might be sensible.

Just my 2 cents.

Cheers,
  Chris

> 
> All thoughts gratefully received.
> Cheers
> Steven
> 
> On 8/16/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
>> 
>> Hi,
>> 
>> I would like to help. But first of all I would suggest to start wider
>> discussion in dev list to get more feedback/suggestions. I think one
>> problem can be that Nutch depends on both Lucene and Hadoop libraries
>> and it won't be easy to maintain these dependencies if recent versions
>> are not yet committed into some maven accesible repo.
>> 
>> Regards,
>> Lukas
>> 
>> On 8/16/06, steven shingler <[EMAIL PROTECTED]> wrote:
>>> Well I'm up for giving it a try. My current work has me looking at both
>>> Nutch and Maven, so what better way to understand both projects :)
>>> 
>>> I agree it is far from trivial - so if anyone here would like to
>> collaborate
>>> on it, that would be great.
>>> Cheers,
>>> Steven
>>> 
>>> 
>>> On 8/15/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:

 Hi,

 I would warmly appreciate this activity. At least it would help more
 people to understand/join this great project. But I don't think this
 will be an easy step (this reminds me what N.Armstrong said on moon:
 That's one small step for [a] man, one giant leap for mankind.)
 :-)

 Regards,
 Lukas

 On 8/15/06, Sami Siren <[EMAIL PROTECTED]> wrote:
> steven shingler wrote:
>> Hi all,
>> 
>> I know this has come up at least once before, but I just thought
>> I'd
 raise
>> the question again:
>> 
>> Are there any plans to move to building Nutch using Maven?
> 
> Haven't heard of such activities, however if you or somebody else
> can put such thing together and it proves to be a good thing to do
>> then
> I certainly don't have anything against it.
> 
> --
>   Sami Siren
> 

>>> 
>>> 
>>

Re: 0.8 not loading plugins

2006-08-17 Thread Chris Mattmann

Hi Chris,

 It seems from your email message that your plugin is located in
$NUTCH_HOME/build/custom-meta? Is this where your plugin * code * is
currently stored? If so, this is the wrong location and the most likely
reason that your plugin isn't being loaded.

 Plugin code should live in $NUTCH_HOME/src/plugins, so in your case, you'd
have /usr/local/nutch-0.8/src/plugin/custom-meta, with the underlying plugin
code dir structure underneath there. Then, to deploy your plugin to the
build directory (which is $NUTCH_HOME/build/plugins), you would type: ant
deploy.

Give this a shot and see if that fixes it.

Cheers,
  Chris



On 8/17/06 3:05 PM, "Chris Stephens" <[EMAIL PROTECTED]> wrote:

> Its definitely not trying to load my plugin, I added that debug setting
> and didn't see anything regarding my plugin.  One thing I noticed is
> that my plugin is not in the plugins directory.  At what point do the
> plugs get copied there?  Here is the output from my compile:
> 
> compile:
>  [echo] Compiling plugin: custom-meta
> [javac] Compiling 3 source files to
> /usr/local/nutch-0.8/build/custom-meta/classes
> [javac] Note: Some input files use or override a deprecated API.
> [javac] Note: Recompile with -Xlint:deprecation for details.
> 
> jar:
>   [jar] Building jar:
> /usr/local/nutch-0.8/build/custom-meta/custom-meta.jar
> 
> deps-test:
> 
> deploy:
>  [copy] Copying 1 file to /usr/local/nutch-0.8/build/plugins/custom-meta
> 
> HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
>> Did you check if your plugin.xml is read by putting the plugin package
>> in debug mode?
>> (put this in the log4j.properties)
>> log4j.logger.org.apache.nutch.plugin=DEBUG
>> 
>> 
>> -Original Message-
>> From: Chris Stephens [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, August 17, 2006 2:30 PM
>> To: nutch-dev@lucene.apache.org
>> Subject: Re: 0.8 not loading plugins
>> 
>> I have this line in src/plugin/build.xml under the deploy section:
>> 
>> 
>> 
>> The plugin is compiling ok.  I spent several days getting errors on
>> compile and investing how to port them to 0.8.
>> 
>> Jonathan Addison wrote:
>>   
>>> Hi Chris,
>>> 
>>> Chris Stephens wrote:
>>> 
 I think I finally have my plugin ported to 0.8, however I cannot get
 my plugin to load.
 
 My plugin.includes file in conf/nutch-site.xml has the following for
 its plugin.includes value:
 
 protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic
 |query-(basic|site|url)|summary-basic|scoring-opic|custom-meta>>>   
 My plugin is the 'custom-meta' entry at the end.
 
 My plugin never shows up in the Registered Plugins list in the
 hadoop.log, and lines in my plugin that run logger.info never show up
   
>> 
>>   
 as well.  Is there a step I am missing with 0.8, what should I do
 next to debug the problem?
   
>>> Have you also added your plugin to plugin/build.xml?
>>> 
 Thank you,
 
 Chris Stephens
 
 
   
>>> 
>> 
>> 
>>   
>

Re: Patch Available status?

2006-08-30 Thread Chris Mattmann

Hi Doug and Andrzej,

  +1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?

Cheers,
  Chris



On 8/30/06 3:14 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Doug Cutting wrote:
>> Sami Siren wrote:
>>> I am not able to do it either, or then I just don't know how, can
>>> Doug help us here?
>> 
>> This requires a change the the project's workflow.  I'd be happy to
>> move Nutch to use the workflow we use for Hadoop, which supports
>> "Patch Available".
>> 
>> This workflow has one other non-default feature, which is that bugs,
>> once closed, cannot be re-opened.  This works as follows: Only project
>> administrators are allowed to close issues.  Bugs are resolved as
>> they're fixed, and only closed when a release is made.  This keeps the
>> release notes Jira generates from changing after a release is made.
>> 
>> Would you like me to switch Nutch to use this Jira workflow?
> 
> +1, this would finally make sense with the "resolved" vs. "closed" ...

Re: Patch Available status?

2006-08-31 Thread Chris Mattmann

Hi Doug,

> 
> But the nutch-developers Jira group pretty closely corresponds to
> Nutch's committers, so perhaps all committers should be permitted to
> close, although this should be exercised with caution, only at releases,
> since closes cannot be undone in this workflow.
> 
> Another alternative would be to construct a new workflow that just adds
> the "Patch Available" status and still permits issues to be re-opened.
> 
> Which sounds best for Nutch?

Good question. Well, my personal preference would be for one that allows
issue closes to be undone, as I've seen several cases (even some recent ones
such as NUTCH-258) where someone in the nutch-developers group has closed an
issue (including myself) that users in fact don't believe is resolved.

So my +1 for having the 2nd option above: an alternative workflow to that of
the Hadoop one that simply adds the "Patch Available" status and still
permits issues to be re-opened.

Just my 2 cents.

Thanks!

Cheers,
  Chris
 
> 
> Doug

Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

Hi Folks,

 I noticed that Nutch now requires JDK 5 in order to compile, due to recent
changes to the PluginRepository and some other classes. I think that this is
a good move, however, I wasn't sure that I had seen any "official"
announcement that Nutch now requires 1.5...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

> 
> The switch to 1.5 format was also logged on jira issue
> http://issues.apache.org/jira/browse/NUTCH-360
> --
>  Sami Siren

Ahh, I didn't see this. Way to go Sami, I love it when people actually keep
records of changes! ;)

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

Hey Guys,

 Speaking of which, I noticed that Sami's issue below is a "Task" in JIRA,
which reminded me of a task that I input a long time ago that would be nice
to fix real quick (for those with JIRA permissions to do so):

http://issues.apache.org/jira/browse/NUTCH-304

We should really change the email address for JIRA to not use the Apache
incubator one anymore, and to use to Lucene one.

Sound good? If so, could someone with permissions please take care of it?
:-)

Cheers,
  Chris

On 10/3/06 9:04 AM, "Sami Siren" <[EMAIL PROTECTED]> wrote:

> Andrzej Bialecki wrote:
>> Chris Mattmann wrote:
>>> Hi Folks,
>>> 
>>>  I noticed that Nutch now requires JDK 5 in order to compile, due to
>>> recent
>>> changes to the PluginRepository and some other classes. I think that
>>> this is
>>> a good move, however, I wasn't sure that I had seen any "official"
>>> announcement that Nutch now requires 1.5...
>>>   
>> 
>> This is a proactive change - as soon as we upgrade to Hadoop 0.6.x we
>> will lose 1.4 compatibility anyway, so we may as well prepare in advance.
>> 
>> Also, "Now" refers to the unreleased 0.9, we will keep branch 0.8.x
>> compatible with 1.4.
>> 
> 
> The switch to 1.5 format was also logged on jira issue
> http://issues.apache.org/jira/browse/NUTCH-360
> --
>  Sami Siren

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Updated: (NUTCH-379) ParseUtil does not pass through the content's URL to the ParserFactory

2006-10-13 Thread Chris Mattmann

Hi Guys,

 Can we disable the selection of "released versions" within JIRA for issues
so that people like me don't continue to get confused?

Thanks!

Cheers,
  Chris



On 10/13/06 9:32 AM, "Sami Siren (JIRA)" <[EMAIL PROTECTED]> wrote:

>  [ http://issues.apache.org/jira/browse/NUTCH-379?page=all ]
> 
> Sami Siren updated NUTCH-379:
> -
> 
> Fix Version/s: (was: 0.8.1)
>(was: 0.8)
> 
> cannot fix released versions
> 
>> ParseUtil does not pass through the content's URL to the ParserFactory
>> --
>> 
>> Key: NUTCH-379
>> URL: http://issues.apache.org/jira/browse/NUTCH-379
>> Project: Nutch
>>  Issue Type: Bug
>>  Components: fetcher
>>Affects Versions: 0.8.1, 0.8, 0.9.0
>> Environment: Power Mac Dual G5, 2.0 Ghz, although fix is independent
>> of environment
>>Reporter: Chris A. Mattmann
>> Assigned To: Chris A. Mattmann
>> Fix For: 0.8.2, 0.9.0
>> 
>> Attachments: NUTCH-379.Mattmann.100406.patch.txt
>> 
>> 
>> Currently the ParseUtil class that is called by the Fetcher to actually
>> perform the parsing of content does not forward thorugh the content's url for
>> use in the ParserFactory. A bigger issue, however, is that the url (and for
>> that matter, the pathSuffix) is no longer used to determine which parsing
>> plugin should be called. My colleague at JPL discovered that more major bug
>> and will soon input a JIRA issue for it. However, in the meantime, this small
>> patch at least sets up the forwarding of the content's URL to the
>> ParserFactory.

Re: What's the status of Nutch-GUI?

2006-11-21 Thread Chris Mattmann

Hi Armel,

On 11/20/06 1:44 PM, "Armel T. Nene" <[EMAIL PROTECTED]> wrote:

> Hi Chris,
> 
> I am trying to extend parse-xml to enable the creation of lucene fields
> straight from an xml file. For example, a database table that has been parse
> as an XML file should be stored in the index with the relevant fields, i.e.
> customer name, address and so on. This file will not have a namespace
> associated with it and should not be stored as "xmlcontent" in the database.
> Currently, parse-xml looks for known fields in the document and stores the
> associated values with the field name. I have added an extra conditions as
> if the known fields are not present in the current document, the element or
> node in the document should be the new field stored in the index with their
> value.

I think that this is fine.
> 
> Therefore, when parse-xml receives an xml document with no namespace
> available, it will parse the document and store it element name as new field
> in the index and the element associated value.
> 
> Let me know if I am on the right track because I know I don't have to write
> a separate plugin for this feature but just extending ( or modifying)
> parse-xml.

I think that parse-xml will support what you are talking about. In terms of
the "check" that you are doing to see if a field exists or not before adding
another value for it in the index, as I understood Lucene, I believe that
you could just omit this check and add the field regardless. If you add
multiple values for the same field in a Document, e.g:

Document doc = new Document();

doc.add(new Field("fieldname", "fieldvalue", ...));
doc.add(new Field("fieldname", "fieldvalue2",...));

Both the values "fieldvalue" and "fieldvalue2" will both get stored in the
index for the key "fieldname". So, if I understand you correctly (which I
may not ;) ), then I think you can omit the check that you are talking about
above and just go with adding the same field name 2x.

HTH,
  Chris

> 
> Cheers,
> 
> Armel
> 
> 
> -Original Message-
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: 20 November 2006 18:40
> To: nutch-dev@lucene.apache.org
> Subject: Re: What's the status of Nutch-GUI?
> 
> Hi Sami and Scott,
> 
>  This is on my TO-DO list as one of the items that I will begin working on
> getting into the sources as a committer. Additionally, I plan on integrating
> and testing the parse-xml plugin into the source tree. As soon as I get my
> Apache account and SVN access, I will start working on this.
> 
> Thanks!
> 
> Cheers,
>   Chris
> 
> 
> 
> On 11/20/06 9:24 AM, "Sami Siren" <[EMAIL PROTECTED]> wrote:
> 
>> scott green wrote:
>>> Hi
>>> 
>>> Is nutch-gui dead? why i cannot find any source in svn repo?
>> 
>> Unfortunately the sources for the admin gui never got into svn. It would
>> be great if someone could pick it up and bring it up to date to get it
>> integrated.
>> 
>> --
>>   Sami Siren
>> 
> 
> 
> 
> 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Closed: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread Chris Mattmann

Hi Sami,

On 11/23/06 9:45 AM, "Sami Siren" <[EMAIL PROTECTED]> wrote:

> Couple of points:
> 
> 1. You used tabs

I just installed a new version of Eclipse, and forgot to change the default
preference for using tabs versus just whitespaces. I've went ahead and
changed this in my Eclipse and will commit an update that uses whitespaces
instead of tabs shortly.

> 2. You left some unneccessary comments on source, bug history is
> allready in jira and commit logs

I would disagree with this statement: no comment is "unnecessary". What if
the users don't look into JIRA, or don't scan through the commit logs? The
change that we just made was critical, though subtle, and a user could gloss
over the fact that only non-null values get written now. BTW, I'm a fan of
more comments, rather than less ;)

> 3. Why not addition to testcase?

Good point. I'll add a testcase for this in TestMetadata.

> 4. Issue could have been iterated in jira a bit further so all these
> could have been catched before a commit.

This is true: however, I thought that the point of bringing in new people
was to move forward on some of these critical issues that keep moving their
way down the priority stack? The issues that you raise above (e.g.,
whitespace v. tabs, and "unnecessary comments"), although relevant points,
really had nothing to do with the fix itself. I wanted to get the fix into
the sources before everyone went away for thanksgiving (at least here in the
U.S.), so that users could pull it down sooner rather than later. Is this
not the correct policy? I'm a n00b, so I dunno ;)

Cheers,
  Chris

> 
> --
>   Sami Siren
> 
> 
> 
> 
> Chris A. Mattmann (JIRA) wrote:
>>  [ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]
>> 
>> Chris A. Mattmann closed NUTCH-406.
>> ---
>> 
>> 
>> Patch applied to trunk:
>> 
>> http://svn.apache.org/viewvc?view=rev&revision=478619
>> 
>> 
>> 
>> 
>>> Metadata tries to write null values
>>> ---
>>> 
>>> Key: NUTCH-406
>>> URL: http://issues.apache.org/jira/browse/NUTCH-406
>>> Project: Nutch
>>>  Issue Type: Bug
>>>Affects Versions: 0.9.0
>>>Reporter: Doğacan Güney
>>> Assigned To: Chris A. Mattmann
>>> Fix For: 0.9.0
>>> 
>>> Attachments: NUTCH-406.patch, NUTCH-406.patch
>>> 
>>> 
>>> During parsing, some urls (especially pdfs, it seems) may create >> null> pairs in ParseData's parseMeta.
>>> When Metadata.write() tries to write such a pair, it causes an NPE.
>>> Stack trace will be something like this:
>>> at org.apache.hadoop.io.Text.encode(Text.java:373)
>>> at org.apache.hadoop.io.Text.encode(Text.java:354)
>>> at org.apache.hadoop.io.Text.writeString(Text.java:394)
>>> at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
>>> I can consistently reproduce this using the following url:
>>> http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf
>> 
>

Re: Welcome Chris Mattmann as Nutch committer

2006-11-23 Thread Chris Mattmann

Thanks, Andrzej, thanks to the rest of the folks who voted me in! I really
appreciate the honor and pledge to help maintain the high quality of the
Nutch source code.

Best wishes and happy holidays to all the folks on the list!

Cheers,
  Chris

On 11/23/06 4:10 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Hi all,
> 
> Some time ago I proposed to Lucene PMC that Chris should become a Nutch
> committer. And I am pleased to announce now that Lucene PMC members
> unanimously voted in his favor. :)
> 
> Chris will continue to provide mailing list support, as he did before,
> but now he will also be taking care of some of the accumulated JIRA issues.
> 
> Welcome, Chris!

Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann

Hi Sami,

On 12/9/06 2:27 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Author: siren
> Date: Sat Dec  9 14:27:07 2006
> New Revision: 485076
> 
> URL: http://svn.apache.org/viewvc?view=rev&rev=485076
> Log:
> Optimize SpellCheckedMetadata further by taking into account the fact that it
> is used only for http-headers.
> 
> I am starting to believe that spellchecking should just be an utility method
> used by http protocol plugins.

I think that right now I'm -1 for this change. I would make note of all the
comments on NUTCH-139, from which this code was born. In the end, I think
what we all realized was that the spell checking capabilities is necessary,
but not everywhere, as you point out. However, I don't think it's limited
entirely to HTTP headers (what you've currently changed the code to). I
think it should be implemented as a protocol layer service, also providing
spell checking support to other protocol plugins, like protocol-file, etc.,
where field headers run the risk of being misspelled as well. What's to stop
someone from implementing protocol-file++ that returns different file header
keys than that of protocol-file? Just b/c HTTP is the most pervasively used
plugin right now, I think it's convenient to assume that only HTTP protocol
field keys may need spell checking services.

Just my 2 cents...

Cheers,
  Chris

Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann

Hi Sami,

 Indeed, I see your point. I guess what I was advocating for was more of a
ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we
could update the code that you have below to use ProtocolHeaders.class
rather than HttpHeaders.class. We would then make ProtocolHeaders extend
HttpHeaders, so that it by default inherits all of the HttpHeaders, while
still allowing more ProtocolHeader met keys (e.g., we could have an
interface for FileHeaders, etc.).

 What do you think about that? Alternatively we could just create a
ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
the met key fields from HttpHeaders, and it would be the place that the met
key fields for FileHeaders, etc. could go into.

Let me know what you think, and thanks!

Cheers,
  Chris

On 12/9/06 3:53 PM, "Sami Siren" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> Hi Sami,
>> 
>> On 12/9/06 2:27 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
>> 
>>> Author: siren
>>> Date: Sat Dec  9 14:27:07 2006
>>> New Revision: 485076
>>> 
>>> URL: http://svn.apache.org/viewvc?view=rev&rev=485076
>>> Log:
>>> Optimize SpellCheckedMetadata further by taking into account the fact that
>>> it
>>> is used only for http-headers.
>>> 
>>> I am starting to believe that spellchecking should just be an utility method
>>> used by http protocol plugins.
>> 
>> I think that right now I'm -1 for this change. I would make note of all the
>> comments on NUTCH-139, from which this code was born. In the end, I think
>> what we all realized was that the spell checking capabilities is necessary,
>> but not everywhere, as you point out. However, I don't think it's limited
>> entirely to HTTP headers (what you've currently changed the code to). I
>> think it should be implemented as a protocol layer service, also providing
>> spell checking support to other protocol plugins, like protocol-file, etc.,
> 
> In protocol file all headers are artificial an generated in nutch code
> so if there's spelling mistake there then we should fix the code
> generating the headers and not rely on spellchecking in the first place.
> 
>> where field headers run the risk of being misspelled as well. What's to stop
>> someone from implementing protocol-file++ that returns different file header
>> keys than that of protocol-file? Just b/c HTTP is the most pervasively used
>> plugin right now, I think it's convenient to assume that only HTTP protocol
>> field keys may need spell checking services.
> 
> If there's a real need for spell checking on other keys one can just add
> more classes to the array no big deal.
> 
> --
>  Sami Siren
>

Re: Next Nutch release

2007-01-16 Thread Chris Mattmann

Folks,

 When would you like to make the release? I've been working on NUTCH-185,
but got a bit bogged down with other work. If there is interest in having
NUTCH-185 included in the release, I could make a push to get out a patch by
week's end...

 As for the rest, my +1 for NUTCH-61 being included sooner rather than
later. It seems that the patch has garnered enough use and attention that
folks would like to see it in the release. I think the email from the user
trying to manage a terabyte of data a few days back was particularly
telling.

Cheers,
  Chris



On 1/16/07 8:19 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Sami Siren wrote:
>> Hello,
>> 
>> It has been a while from a previous release (0.8.1) and looking at the
>> great fixes done in trunk I'd start thinking about baking a new release
>> soon.
>> 
>> Looking at the jira roadmaps there are 1 blocking issues (fixing the
>> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
>> which I think NUTCH-233 is safe to put in.
>>   
> 
> Agreed. The replacement regex mentioned in the original comment seems
> safe enough, and simpler.
> 
>> The top 10 voted issues are currently:
>> 
>> NUTCH-61Adaptive re-fetch interval. Detecting umodified content
>>   
> 
> Well ... I'm of a split mind on this. I can bring this patch up to date
> and apply it before 0.9.0, if we understand that this is a "0" release
> ... ;) Otherwise I'd prefer to wait with it right after the release.
> 
> I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus
> some changes I made in the meantime), since I'd like to expose the new
> fetcher to a broader audience, and it doesn't affect the existing
> implementation.
> 
> 
>> NUTCH-48  "Did you mean" query enhancement/refignment feature
>> NUTCH-251  Administration GUI
>> NUTCH-289  CrawlDatum should store IP address
>>   
> 
> I'm still not entirely convinced about this - and there is already a
> mechanism in place to support it if someone really wishes to keep this
> particular info (CrawlDatum.metaData).
> 
>> NUTCH-36  Chinese in Nutch
>> NUTCH-185  XMLParser is configurable xml parser plugin.   NUTCH-59  meta
>> data support in webdb
>> NUTCH-92  DistributedSearch incorrectly scores results   NUTCH-68  
> 
> This is too intrusive to fix just before the release - and needs
> additional discussion.
> 
> 
>> NUTCH-68 A
>> tool to generate arbitrary fetchlists  
> 
> Easy to port this to 0.9.0 - I can do this.
> 
> 
>> NUTCH-87  Efficient
>> site-specific crawling for a large number of sites
>>   
> 
>

Re: How to Become a Nutch Developer

2007-01-21 Thread Chris Mattmann

Hi Dennis,

On 1/21/07 11:47 AM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:

> All,
> 
> I am working on a "How to Become a Nutch Developer" document for the
> wiki and I need some input.
> 
> I need an overview of how the process for JIRA works?  If I am a
> developer new to Nutch and just starting to look at the JIRA and I want
> to start working on some piece of functionality or to help with bug
> fixes where would I look.

JIRA provides a lot of search facilities: it's actually kind of nice. The
starting point for browsing bugs and other types of issues is:

http://issues.apache.org/jira/browse/NUTCH

(in general, for all Apache projects that use JIRA, you'll find that their
issue tracking system boils down to:

http://issues.apache.org/jira/browse/
)

>From there, you can access canned filters for open issues like:
Blocker
Critical
Major
Minor
Trivial

For more detailed search capabilities, click on the "Find Issues" button at
the top breadcrumb bar. Search capabilities there include the ability to
look for issues by developer, status, issue type, and to combine such fields
using AND, and OR. Additionally, you can issue a free text query across all
issues by using the free text box there.

> 
> Would I just choose something that is unscheduled and begin working on it?

That's a good starting point: additionally, high priority issues marked as
"Blockers", "Critical" and "Major" are always good because the sooner we
(the committers) get a patch for those, the sooner we'll be testing it for
inclusion into the sources.

> 
> What if I see something that I want to work on but it is scheduled to
> somebody else?

Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you
don't have to do that. ;) Just speak up on the mailing list, and volunteer
your support. One of the people listed in the group "nutch-developers" in
JIRA (e.g., the committers) can reassign the issue to you so long as the
other gent it was assigned to doesn't mind...

> 
> Are items only scheduled to committers or can they be scheduled to
> developers as well?  If they can be scheduled to regular developers how
> does someone get their name on the list to be scheduled items?

Items can be scheduled to folks listed in the nutch-developers group within
JIRA. Most of these folks are the committers, however, not all of them are.
I'm not entirely sure how folks get into that group (maybe Doug?), however,
that's the real criteria for having a JIRA issue officially assigned to you.
However, that doesn't mean that you can't work on things in lieu of that. If
there's an issue that you'd like to contribute to, please, prepare a patch,
attach it to JIRA, and then speak up on the mailing list. Chances are, with
the recent busy schedules of the committers (including myself) besides Sami,
and Andrzej, the committers don't have time to prepare patches for the issue
assigned to them. If you contribute a great patch, the committer will pick
it up, test it, apply it, and you'll get the same effect as if the issue
were directly assigned to you.
> 
> Should I submit a JIRA and/or notify the list before I start working on
> something?  What is the common process for this?

Yup, that's pretty much it. Voice your desire to work on a particular task
on the nutch-dev list. Many of the developers on that list have been around
for a while now, and they know what's been discussed, and implemented
before.
> 
> When I submit a JIRA is there anything else I need to do either in the
> JIRA system or with the mailing lists, committers, etc?

Nope: the nutch-dev list is automatically notified by all JIRA issue
submissions, and the committers (and rest of the folks) will pick up on this
and act accordingly.

> 
> Getting this information together in one place will go a long way toward
> helping others to start contributing more and more.  Thanks for all your
> input.

No probs, glad to be of service :-)

Cheers,
  Chris

> 
> Dennis Kubes

Re: Reviving Nutch 0.7

2007-01-22 Thread Chris Mattmann

 
> Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
> please consider the following:
> 
> One option would be re factoring the code in a way that the parts that are
> usable to other projects like protocols?, parsers (this actually was
> proposed by
> Jukka Zitting some time last year) and stuff would be modified to be
> independent
> of nutch (and hadoop) code. Yeah, this is easy to say, but would require
> significant amount of work.
> 
> The "more focused",smaller chunks of nutch would probably also get bigger
> audience (perhaps also outside nutch land) and that way perhaps more people
> willing to work for them.
> 
> Don't know about others but at least I would be more willing to work towards
> this goal than the one where there would be practically many separate
> projects,
> each sharing common functionality but different code base.

+1 ;)

This was actually the project proposed by Jerome Charron and myself, called
"Tika". We went so far as to create a project proposal, and send it out to
the nutch-dev list, as well as the Lucene PMC for potential Lucene
sub-project goodness. I could probably dig up the proposal should the need
arise.

Good ol' Jukka then took that effort and created us a project within Google
code, that still lives in there in fact:

http://code.google.com/p/tika/

There hasn't be active development on it because:

1. None of us (I'm speaking for Jerome, and myself here) ended up having the
time to shepherd it going forward

2. There was little, if any response, from the proposal to the nutch-dev
list, and folks willing to contribute (besides people like Jukka)

3. I think, as you correctly note above, most people thought it to be too
much of a Herculean effort that wouldn't pay the necessary dividends in the
end to undertake it


In any case, I think that, if we are going to maintain separate branches of
the source, in fact, really parallel projects, then an undertaking such as
Tika is properly needed ...

Cheers,
   Chris




> 
> --
>  Sami Siren

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann

Hi Doug,

 So, does this render the patch that I wrote obsolete?

Cheers,
  Chris

On 1/25/07 10:08 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Scott Ganyo (JIRA) wrote:
>>  ... since Hadoop hijacks and reassigns all log formatters (also a bad
>> practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...
> 
> FYI, Hadoop no longer does this.
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann

> It's at least out-of-date and perhaps obsolete.  A quick read of
> Fetcher.java looks like there might be a case where a "fatal" error is
> logged but the fetcher doesn't exit, in FetcherThread#output().
> 

So this raises an interesting question:

People (such as Scott G.) out there -- are you folks still experiencing
similar problems? Do the recent Hadoop changes alleviate the bad behavior
you were experiencing? If so, then maybe this issue should be closed...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

 I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

 The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
  Chris



On 1/30/07 6:01 PM, "kauu" <[EMAIL PROTECTED]> wrote:

> Hi folks :
> 
>What’s I want to do is to separate a rss file into several pages .
> 
>   Just as what has been discussed before. I want fetch a rss page and index
> it as different documents in the index. So the searcher can search the
> Item’s info as a individual hit.
> 
>  What’s my opinion create a protocol for fetch the rss page and store it as
> several one which just contain one ITEM tag .but the unique key is the url ,
> so how can I store them with the ITEM’s link tag as the unique key for a
> document.
> 
>   So my question is how to realize this function in nutch-.0.8.x.
> 
>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
> find the code where to store a page to a document. I want to separate the
> rss page to several ones before storing it as a document but several ones.
> 
>   So any one can give me some hints?
> 
> Any reply will be appreciated !
> 
>  
> 
>  
> 
>   ITEM’s structure
> 
>  
> 
> 
> 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
> 
> 
> 暴风雪横扫欧洲，导致多次航班延误 1月24日，几架民航客机在德
> 国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部的慕尼黑机场
> 清扫飞机跑道上的积雪。　　据报道，迟来的暴风雪连续两天横扫中...
> 
> 
> 
> 
> 
> 
> http://news.sohu.com/20070125
>  /n247833568.shtml link>
> 
> 
> 搜狐焦点图新闻
> 
> 
> [EMAIL PROTECTED]
> 
> 
> 
> Thu, 25 Jan 2007 11:29:11 +0800
> 
> 
> > http://comment.news.sohu.com
> 
> /comment/topic.jsp?id=247833847
> 
> 
>  
>  
>

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

On 1/30/07 7:00 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Chris,
> 
> I saw your name associated with the rss parser in nutch.  My understanding is
> that nutch is using feedparser.  I had two questions:
> 
> 1.  Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

> 2.  Any view on asynchronous communication as the underlying protocol?  I do
> not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
  Chris


> 
> Thanks
>   
> 
> -Original Message-
> From: Chris Mattmann <[EMAIL PROTECTED]>
> Date: Tue, 30 Jan 2007 18:16:44
> To:
> Subject: Re: RSS-fecter and index individul-how can i realize this function
> 
> Hi there,
> 
>  I could most likely be of assistance, if you gave me some more information.
> For instance: I'm wondering if the use case you describe below is already
> supported by the current RSS parse plugin?
> 
>  The current RSS parser, parse-rss, does in fact index individual items that
> are pointed to by an RSS document. The items are added as Nutch Outlinks,
> and added to the overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below? Or am I missing something?
> 
> Cheers,
>   Chris
> 
> 
> 
> On 1/30/07 6:01 PM, "kauu" <[EMAIL PROTECTED]> wrote:
> 
>> Hi folks :
>> 
>>What’s I want to do is to separate a rss file into several pages .
>> 
>>   Just as what has been discussed before. I want fetch a rss page and index
>> it as different documents in the index. So the searcher can search the
>> Item’s info as a individual hit.
>> 
>>  What’s my opinion create a protocol for fetch the rss page and store it as
>> several one which just contain one ITEM tag .but the unique key is the url ,
>> so how can I store them with the ITEM’s link tag as the unique key for a
>> document.
>> 
>>   So my question is how to realize this function in nutch-.0.8.x.
>> 
>>   I’ve check the code of the plug-in protocol-http’s code ,but I can’t
>> find the code where to store a page to a document. I want to separate the
>> rss page to several ones before storing it as a document but several ones.
>> 
>>   So any one can give me some hints?
>> 
>> Any reply will be appreciated !
>> 
>>  
>> 
>>  
>> 
>>   ITEM’s structure
>> 
>>  
>> 
>> 
>> 欧洲暴风雪后发制人 致航班延误交通混乱(组图)
>> 
>> 
>> 暴风雪横扫欧洲，导致多次航班延误 1月24日，几架民航客机在德
>> 国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部的慕尼黑机场
>> 清扫飞机跑道上的积雪。　　据报道，迟来的暴风雪连续两天横扫中...
>> 
>> 
>> 
>> 
>> 
>> 
>> http://news.sohu.com/20070125
>> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml> link>
>> 
>> 
>> 搜狐焦点图新闻
>> 
>> 
>> [EMAIL PROTECTED]
>> 
>> 
>> 
>> Thu, 25 Jan 2007 11:29:11 +0800
>> 
>> 
>> >> http://comment.news.sohu.com
>> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
>> /comment/topic.jsp?id=247833847
>> 
>> 
>> > 
>>  
>> 
> 
>

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:

> thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
> individual page .then when i search the some
thing for example "nutch-open
> source", the nutch return a hit which contain

   title : nutch-open source

> description : nutch nutch nutch nutch  nutch
   url :
> http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
> the plugin parse-rss can satisfy what i need?


nutch--open
> source
   
>
>nutch nutch nutch nutch
> nutch
> > 
> >
> >
> >
> http://lucene.apache.org/nutch
> >
> >
> > news
> 
> >
> >
> > kauu



On 1/31/07, Chris
> Mattmann <[EMAIL PROTECTED]> wrote:
>
> Hi there,
>
> I could most
> likely be of assistance, if you gave me some more
> information.
> For
> instance: I'm wondering if the use case you describe below is already
>
> supported by the current RSS parse plugin?
>
> The current RSS parser,
> parse-rss, does in fact index individual items
> that
> are pointed to by an
> RSS document. The items are added as Nutch Outlinks,
> and added to the
> overall queue of URLs to fetch. Doesn't this satisfy what
> you mention below?
> Or am I missing something?
>
> Cheers,
>   Chris
>
>
>
> On 1/30/07 6:01 PM,
> "kauu" <[EMAIL PROTECTED]> wrote:
>
> > Hi folks :
> >
> >What's I want to
> do is to separate a rss file into several pages .
> >
> >   Just as what has
> been discussed before. I want fetch a rss page and
> index
> > it as different
> documents in the index. So the searcher can search the
> > Item's info as a
> individual hit.
> >
> >  What's my opinion create a protocol for fetch the rss
> page and store it
> as
> > several one which just contain one ITEM tag .but
> the unique key is the
> url ,
> > so how can I store them with the ITEM's link
> tag as the unique key for a
> > document.
> >
> >   So my question is how to
> realize this function in nutch-.0.8.x.
> >
> >   I've check the code of the
> plug-in protocol-http's code ,but I can't
> > find the code where to store a
> page to a document. I want to separate
> the
> > rss page to several ones
> before storing it as a document but several
> ones.
> >
> >   So any one can
> give me some hints?
> >
> > Any reply will be appreciated !
> >
> >
> >
> >
>
> >
> >   ITEM's structure
> >
> >  
> >
> >
> > 欧洲暴风雪后发制人 致航班
> 延误交通混乱(组图)
> >
> >
> > 暴风雪横扫欧洲，导致多次航班延误 1
> 月24日，几架民航客机在德
> > 国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部
> 的慕尼黑机场
> > 清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...
> >
>
> >
> >
> > 
> >
> >
> >
> http://news.sohu.com/20070125
> >
> <http://news.sohu.com/20070125/n247833568.shtml> /n247833568.shtml >
> link>
> >
> >
> > 搜狐焦点图新闻
> >
> >
> >
> [EMAIL PROTECTED]
> > 
> >
> >
> > Thu, 25 Jan 2007
> 11:29:11 +0800
> >
> >
> >  >>
> http://comment.news.sohu.com
> >
> <http://comment.news.sohu.com/comment/topic.jsp?id=247833847>
> >
> /comment/topic.jsp?id=247833847
> >
> >
> >  >
> >
>
> >
>
>
>


--
www.babatu.com

Re: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread Chris Mattmann

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse "it" in the next fetch phase. Well, there are 2 options here for
what you refer to as "it":

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris

On 1/31/07 10:40 PM, "Gal Nitzan" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Many sites provide RSS feeds for several reasons, usually to save bandwidth,
> to give the users concentrated data and so forth.
> 
> Some of the RSS files supplied by sites are created specially for search
> engines where each RSS "item" represent a web page in the site.
> 
> IMHO the only thing "missing" in the parse-rss plugin is storing the data in
> the CrawlDatum and "parsing" it in the next fetch phase. Maybe adding a new
> flag to CrawlDatum, that would flag the URL as "parsable" not "fetchable"?
> 
> Just my two cents...
> 
> Gal.
> 
> -Original Message-
> From: Chris Mattmann [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, January 31, 2007 8:44 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: RSS-fecter and index individul-how can i realize this function
> 
> Hi there,
> 
>   With the explanation that you give below, it seems like parse-rss as it
> exists would address what you are trying to do. parse-rss parses an RSS
> channel as a set of items, and indexes overall metadata about the RSS file,
> including parse text, and index data, but it also adds each item (in the
> channel)'s URL as an Outlink, so that Nutch will process those pieces of
> content as well. The only thing that you suggest below that parse-rss
> currently doesn't do, is to allow you to associate the metadata fields
> category:, and author: with the item Outlink...
> 
> Cheers,
>   Chris
> 
> 
> 
> On 1/30/07 7:30 PM, "kauu" <[EMAIL PROTECTED]> wrote:
> 
>> thx for ur reply .
> mybe i didn't tell clearly .
>  I want to index the item as a
>> individual page .then when i search the some
> thing for example "nutch-open
>> source", the nutch return a hit which contain
> 
>title : nutch-open source
> 
>> description : nutch nutch nutch nutch  nutch
>url :
>> http://lucene.apache.org/nutch
>    category : news
>   author  : kauu
> 
> so , is
>> the plugin parse-rss can satisfy what i need?
> 
> 
> nutch--open
>> source
>
>> 
>>nutch nutch nutch nutch
>> nutch
>>> 
>>> 
>>> 
>>> 
>> http://lucene.apache.org/nutch
>>> 
>>> 
>>> news
>> 
>>> 
>>> 
>>> kauu
> 
> 
> 
> On 1/31/07, Chris
>> Mattmann <[EMAIL PROTECTED]> wrote:
>> 
>> Hi there,
>> 
>> I could most
>> likely be of assistance, if you gave me some more
>> information.
>> For
>> instance: I'm wondering if the use case you describe below is already
>> 
>> supported by the current RSS parse plugin?
>> 
>> The current RSS parser,
>> parse-rss, does in fact index individual items
>> that
>> are pointed to by an
>> RSS document. The items are added as Nutch Outlinks,
>> and added to the
>> overall queue of UR

Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Chris Mattmann

Hi Doug,

> Since the target of the link must still be indexed separately from the
> item itself, how much use is all this?  If the RSS document is
> considered a single page that changes frequently, and item's links are
> considered ordinary outlinks, isn't much the same effect achieved?

IMHO, yes. That's what it's been hard for me to understand the real use case
for what Gal et al. are talking about. I've been trying to wrap my head
around it, but it seems to me the capability they require is sort of already
provided...

Cheers,
  Chris

> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann

Guys,

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...

Cheers,
  Chris

On 2/7/07 9:58 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Renaud Richardet wrote:
>> I see. I was thinking that I could index the feed items without having
>> to fetch them individually.
> 
> Okay, so if Parser#parse returned a Map, then the URL for
> each parse should be that of its link, since you don't want to fetch
> that separately.  Right?
> 
> So now the question is, how much impact would this change to the Parser
> API have on the rest of Nutch?  It would require changes to all Parser
> implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But,
> as far as I can tell, most of these changes look straightforward.
> 
> Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann

Doug, Renaud,

 Got it. So, the logic behind this is, why bother waiting until the
following fetch to parse (and create ParseData objects from) the RSS items
out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
RSS metadata in it. However, it's perfectly acceptable to have feeds that
simply have a title, description, and link in it. I guess this is still
valuable metadata information to have, however, the only caveat is that the
implication of the proposed change is:

1. We won't have cached copies, or fetched copies of the Content represented
by the item links. Therefore, in this model, we won't be able to pull up a
Nutch cache of the page corresponding to the RSS item, because we are
circumventing the fetch step

2. It sounds like a pretty fundamental API shift in Nutch, to support a
single type of content, RSS. Even if there are more content types that
follow this model, as Doug and Renaud both pointed out, there aren't a
multitude of them (perhaps archive files, but can you think of any others)?

The other main thing that comes to mind about this for me is it prevents the
fetched Content for the RSS items from being able to provide useful
metadata, in the sense that it doesn't explicitly fetch the content. What if
we wanted to apply some super cool metadata extractor X that used
word-stemming, HTML design analysis, and other techniques to extract
metadata from the content pointed to by an RSS item link? In the proposed
model, we assume that the RSS xml item tag already contains all necessary
metadata for indexing, which in my mind, limits the model. Does what I am
saying make sense? I'm not shooting down the issue, I'm just trying to
brainstorm a bit here about the issue.

Cheers,
  Chris

On 2/7/07 11:11 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>>  Sorry to be so thick-headed, but could someone explain to me in really
>> simple language what this change is requesting that is different from the
>> current Nutch API? I still don't get it, sorry...
> 
> A Content would no longer generate a single Parse.  Instead, a Content
> could potentially generate many Parses.  For most types of content,
> e.g., HTML, each Content would still generate a single Parse.  But for
> RSS, a Content might generate multiple Parses, each indexed separately
> and each with a distinct URL.
> 
> Another potential application could be processing archives: the parser
> could unpack the archive and each item in it indexed separately rather
> than indexing the archive as a whole.  This only makes sense if each
> item has a distinct URL, which it does in RSS, but it might not in an
> archive.  However some archive file formats do contain URLs, like that
> used by the Internet Archive.
> 
> http://www.archive.org/web/researcher/ArcFileFormat.php
> 
> Does that help?
> 
> Doug

Re: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Chris Mattmann

Hi Doug,

  Okay, I see your points. It seems like this would be really useful for
some current folks, and for Nutch going forward. I see that there has been
some initial work today and preparing patches. I'd be happy to shepherd this
into the sources. I will begin reviewing what's required, and contacting the
folks who've begun work on this issue.

Thanks!

Cheers,
  Chris



On 2/7/07 1:31 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>>  Got it. So, the logic behind this is, why bother waiting until the
>> following fetch to parse (and create ParseData objects from) the RSS items
>> out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
>> RSS metadata in it. However, it's perfectly acceptable to have feeds that
>> simply have a title, description, and link in it.
> 
> Almost.  The feed may have less than the referenced page, but it's also
> a lot easier to parse, since the link could be an anchor within a large
> page, or could be a page that has lots of navigation links, spam
> comments, etc.  So feed entries are generally much more precise than the
> pages they reference, and may make for a higher-quality search experience.
> 
>> I guess this is still
>> valuable metadata information to have, however, the only caveat is that the
>> implication of the proposed change is:
>> 
>> 1. We won't have cached copies, or fetched copies of the Content represented
>> by the item links. Therefore, in this model, we won't be able to pull up a
>> Nutch cache of the page corresponding to the RSS item, because we are
>> circumventing the fetch step
> 
> Good point.  We indeed wouldn't have these URLs in the cache.
> 
>> 2. It sounds like a pretty fundamental API shift in Nutch, to support a
>> single type of content, RSS. Even if there are more content types that
>> follow this model, as Doug and Renaud both pointed out, there aren't a
>> multitude of them (perhaps archive files, but can you think of any others)?
> 
> Also true.  On the other hand, Nutch provides 98% of an RSS search
> engine.  It'd be a shame to have to re-invent everything else and it
> would be great if Nutch could evolve to support RSS well.
> 
> Could image search might also benefit from this?  One could generate a
> Parse for each image on a page whose text was from the page.  Product
> search too, perhaps.
> 
>> The other main thing that comes to mind about this for me is it prevents the
>> fetched Content for the RSS items from being able to provide useful
>> metadata, in the sense that it doesn't explicitly fetch the content. What if
>> we wanted to apply some super cool metadata extractor X that used
>> word-stemming, HTML design analysis, and other techniques to extract
>> metadata from the content pointed to by an RSS item link? In the proposed
>> model, we assume that the RSS xml item tag already contains all necessary
>> metadata for indexing, which in my mind, limits the model. Does what I am
>> saying make sense? I'm not shooting down the issue, I'm just trying to
>> brainstorm a bit here about the issue.
> 
> Sure, the RSS feed may contain less than the page it references, but
> that might be all that one wishes to index.  Otherwise, if, e.g., a blog
>   includes titles from other recent posts you're going to get lots of
> false positives.  Ideally Nutch should support various options:
> searching the feed only, searching the referenced page only, or perhaps
> searching both.
> 
> Doug

Re: log guards

2007-02-13 Thread Chris Mattmann

Hi Doug, and Jerome,

  Ah, yes, the log guard conversation. I remember this from a while back.
Hmmm, do you guys know what issue that this recorded as in JIRA? I have some
free time recently, so I will be able to add this to my list of Nutch stuff
to work on, and would be happy to take the lead on removing the guards where
needed, and reviewing whether or not the debug ones make sense where they
are. 

Cheers,
  Chris

On 2/13/07 11:17 AM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:

>> These guards were all introduced by a patch some time ago.  I complained
>> at the time and it was promised that this would be repaired, but it has
>> not yet been.
> 
> Yes, Sorry Doug that's my own fault
> I really don't have time to fix this   :-(
> 
> Best regards
> 
> Jérôme

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: log guards

2007-02-28 Thread Chris Mattmann

Hi Dennis,

  I'd be happy to: please contact me off list ([EMAIL PROTECTED]),
and let's chat :-)

Cheers,
  Chris



On 2/28/07 7:38 AM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:

> I can also work on this, Chris do you want me to do it or do you want to
> coordinate our efforts?
> 
> Dennis Kubes
> 
> Jérôme Charron wrote:
>> Hi Chris,
>> 
>> The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309
>> Thanks for your help.
>> 
>> Jérôme
>> 
>> On 2/13/07, Chris Mattmann <[EMAIL PROTECTED]> wrote:
>>> 
>>> Hi Doug, and Jerome,
>>> 
>>>   Ah, yes, the log guard conversation. I remember this from a while back.
>>> Hmmm, do you guys know what issue that this recorded as in JIRA? I have
>>> some
>>> free time recently, so I will be able to add this to my list of Nutch
>>> stuff
>>> to work on, and would be happy to take the lead on removing the guards
>>> where
>>> needed, and reviewing whether or not the debug ones make sense where they
>>> are.
>>> 
>>> Cheers,
>>>   Chris
>>> 
>>> 
>>> 
>>> On 2/13/07 11:17 AM, "Jérôme Charron" <[EMAIL PROTECTED]> wrote:
>>> 
>>>>> These guards were all introduced by a patch some time ago.  I
>>> complained
>>>>> at the time and it was promised that this would be repaired, but it
>>> has
>>>>> not yet been.
>>>> 
>>>> Yes, Sorry Doug that's my own fault
>>>> I really don't have time to fix this   :-(
>>>> 
>>>> Best regards
>>>> 
>>>> Jérôme
>>> 
>>> __
>>> Chris A. Mattmann
>>> [EMAIL PROTECTED]
>>> Staff Member
>>> Modeling and Data Management Systems Section (387)
>>> Data Management Systems and Technologies Group
>>> 
>>> _
>>> Jet Propulsion LaboratoryPasadena, CA
>>> Office: 171-266BMailstop:  171-246
>>> ___
>>> 
>>> Disclaimer:  The opinions presented within are my own and do not reflect
>>> those of either NASA, JPL, or the California Institute of Technology.
>>> 
>>> 
>>> 
>> 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Welcome Dennis Kubes as Nutch committer

2007-02-28 Thread Chris Mattmann

Dennis,

 I take my coffee black: with a single creamer ;) Okay, okay, sorry: I
thought we were talking about *real* hazing ;)

Cheers,
  Chris



On 2/28/07 12:31 PM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:

> Hi All,
> 
> Thank you Andrzej for your kind words. I am looking forward to working
> together with everyone and I hope I can continue to be too inquisitive.
> 
> I don't know if I can introduce myself shortly but I will try. ;) For
> those that don't know me I am based in Plano (Dallas), Texas.  I am 28
> and have been programming for about 12 years.
> 
> So as first commit I need to add my name and re-publish the website.
> Let the hazing begin.
> 
> Dennis Kubes
> 
> Andrzej Bialecki wrote:
>> Hi all,
>> 
>> Some time ago I proposed to Lucene PMC that Dennis should become a Nutch
>> committer.
>> 
>> Dennis has been found guilty of providing too many good quality patches,
>> sending too many supportive emails to the mailing lists, and generally
>> being too inquisitive in nature, which led to a constant stream of
>> comments, suggestions and patches. We weren't able to keep up -
>> something had to be done about it ... ;)
>> 
>> I'm glad to announce that Lucene PMC has voted in his favor.
>> Congratulations and welcome aboard!
>> 
>> (The tradition on Apache projects is that new committers should
>> (shortly) introduce themselves, and as the first commit they should put
>> their name in the Credits section of the website and re-publish the
>> website).
>> 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Issues pending before 0.9 release

2007-03-05 Thread Chris Mattmann

Hi Guys,

> Blocker
> 
> * NUTCH-400 (Update & add missing license headers) - I believe this is
> fixed and should be closed

+1, thanks to Sami for closing it.

> 
> * NUTCH-353 (pages that serverside forwards will be refetched every
> time) - this was partially fixed in NUTCH-273, but a more complete
> solution would require significant changes to LinkDb. As there are no
> patches implementing this, I left it open, but it's no longer as
> critical as it was before. I propose to move it to "Major" and address
> it in the next release.

+1

> 
> * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
> propose to apply the fix provided by Sean Dean and close this issue for now.

+1

> 
> Critical
> 
> * NUTCH-436 (Incorrect handling of relative paths when the embedded URL
> path is empty). There is no patch available yet. If someone could
> contribute a patch I'd like to see this fixed before the release.

Looks like Dennis is on this one

> 
> * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
> certainly not critical (as this is an optional new feature). I propose
> to change it to Major, and make a decision - do we want another plugin
> like parse-mp3 or parse-rtf, or not.

Let's hold off on this: it's not necessary for 0.9, and I don't think
there's been a bunch of traffic on the list identifying this as critical to
get into the sources for the release

> 
> * NUTCH-381 (Ignore external link not work as expected) - I'll try to
> reproduce it, and if I find an easy fix I'd like to apply it before the
> release.

+1

> 
> * NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able
> to reproduce it. If there is no updated information on this I propose to
> close it with "Can't reproduce".

+1, I had to do something similar with NUTCH-258

> 
> * NUTCH-167 (Observation of ) -
> there's a patch which I tested in a limited production env. If there are
> no objections I'd like to apply it before the release.

+1

> 
> Major
> =
> There are 84 major issues, but some of them are either invalid, or
> should be "minor", or no longer apply and should be closed. Please
> review them if you can and provide some comments or recommendations if
> you think you have some new information.

I will spend some time going through JIRA today and see if there's any
issues that I can find that:

1. Have a patch already
2. Sound like something quick, easy, and not so far-reaching across the
entire Nutch API

> 
> 
> One decision also that we need to make is which version of Hadoop should
> be included in the release. Current trunk uses 0.10.1, I have a set of
> production-tested patches that use 0.11.2, and today the Hadoop team
> released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
> before our release). The most conservative option is to stay with
> 0.10.1, but by the time people start using Nutch this will be a fairly
> old version already. I propose to upgrade to 0.11.2. We could use 0.12.1
> - but in this case with the expectation that we release less than stable
> version of Nutch to be soon followed by a minor stable release ...

I'd agree with the upgrade to 0.11.2, +1


Cheers,
  Chris

P.S. I am going to contact Pitor and coordinate with him: I'd like to be the
release manager for this Nutch release.

FW: Nutch release process help

2007-03-06 Thread Chris Mattmann

Folks,

 Attached below is the email that I sent to Piotr. Sorry for not including
everyone on the email originally; I thought it to be minutia, but you all
are welcome to be included.

 For that matter, I guess I never got all of the committers' permission to
be the release manager for Nutch 0.9, so if someone objects to me doing it,
or alternatively if someone else wants to do it, please speak up.

 In a way, that's sort of what I was asking when I said:

[...snip...]
> I'd like to be the release manager for this Nutch release.
[...snip...]

 Seeing as though I've never done the release before, and folks like Sami,
and Piotr and Doug have, I'd be happy for them to do it. Alternatively, if
someone else wants to do it (Dennis spoke up I believe), that's also fine by
me. I think it would be good to get some of us new guys (me and Dennis)
informed of the process so that we are able to do it in the future -- that's
all I was suggesting when I said that I would email Piotr. It's too bad that
this has turned out to be an issue that I've handled incorrectly, and for
that, I apologize. I will do my best to thoroughly vet all such discussions
on the nutch list in the future.


Cheers,
  Chris



-- Forwarded Message
From: Chris Mattmann <[EMAIL PROTECTED]>
Date: Mon, 05 Mar 2007 21:25:30 -0800
To: Piotr Kosiorowski <[EMAIL PROTECTED]>
Cc: Chris Mattmann <[EMAIL PROTECTED]>, Andrzej Bialecki
<[EMAIL PROTECTED]>
Conversation: Nutch release process help
Subject: Nutch release process help

Hi Piotr,

 I am going to manage the Nutch 0.9 release (as you probably saw me
volunteer for on the Nutch list). I've been told by Andrzej that you are the
release "master" :-) Could you please pass along your wisdom to me as to how
to begin the release process, and as to the steps needed to successfully
complete it?

 Thanks a lot for your help!

Cheers,
  Chris



-- End of Forwarded Message

0.9 release

2007-03-07 Thread Chris Mattmann

Hi Folks,

  As suggested by Sami, I'm moving this discussion to the nutch-dev list.
Seems like I am the guy that is going to do the Nutch 0.9 release :-)
However, it seems also that there are some issues that need to be sorted out
first. I'd like to follow up to Andrzej's email about loose ends before
moving forward with the release. So, here are my questions:

1. What remaining issues out there need to be applied to the sources, (or
have patches contributed, then applied) and make it into 0.9? There were
some discussions about this, however, I don't think we have a concrete set
yet. The answer I'm looking for would be something like:

A. NUTCH-XXX (has a patch), NUTCH-YYY (has a patch) before 0.9 is made
B. NUTCH-ZZZ (patch in progress) before 0.9 is made
C. We've got enough in 0.9-dev in the trunk right now to make a 0.9 release

2. Any outstanding things that need to get done that aren't really code that
needs to get committed, e.g., things we need to close the loop on

3. Release Manager: I've got this taken care of, as soon as you all give me
the green light. 

 So, please, committer-brethren, let me know what you think about 1-3, as it
would help me understand how to move forward.

Thanks!

Cheers,
  Chris

Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2007-03-08 Thread Chris Mattmann

Hi Andrzej,
 
  Yep, +1. I also want to make a small update, where instead of creating a
new NutchConf object, to just pass it through (maybe via the protocol
layer?). Does this make sense?

Cheers,
  Chris



On 3/8/07 1:47 PM, "Andrzej Bialecki  (JIRA)" <[EMAIL PROTECTED]> wrote:

> 
> [ 
> https://issues.apache.org/jira/browse/NUTCH-384?page=com.atlassian.jira.plugin
> .system.issuetabpanels:comment-tabpanel#action_12479442 ]
> 
> Andrzej Bialecki  commented on NUTCH-384:
> -
> 
> +1 - although the patch needs whitespace cleanup before committing
> (indentation should be 2 literal spaces, "if" keyword should be separated by
> one space from the parens).
> 
>> Protocol-file plugin does not allow the parse plugins framework to operate
>> properly
>> -
>> --
>> 
>> Key: NUTCH-384
>> URL: https://issues.apache.org/jira/browse/NUTCH-384
>> Project: Nutch
>>  Issue Type: Bug
>>Affects Versions: 0.8, 0.8.1, 0.9.0
>> Environment: All
>>Reporter: Paul Ramirez
>> Assigned To: Chris A. Mattmann
>> Attachments: file_protocol_mime_patch.diff
>> 
>> 
>> When using the file protocol one can not map a parse plugin to a content
>> type. The only way to get the plugin called is through the default plugin.
>> The issue is that the content type never gets mapped. Currently the content
>> type does not get set by the file protocol.

Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2007-03-08 Thread Chris Mattmann

Hi Andrzej,

 Ah, yep, you're right. I just did a cursory inspection, and hadn't applied
the patch (yet). I didn't notice it was in the main method. Kk, sounds good.
I am applying patch now, and will test later this afternoon, fix the
whitespace stuff, and then commit.

Thanks!

Cheers,
  Chris

On 3/8/07 1:55 PM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> Hi Andrzej,
>>  
>>   Yep, +1. I also want to make a small update, where instead of creating a
>> new NutchConf object, to just pass it through (maybe via the protocol
>> layer?). Does this make sense?
>>   
> 
> I'm not sure what you mean - the only place where this patch creates a
> Configuration object is in File.main(), which is innocuous.

Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann

Hi Dennis,

 Not to nit-pick, but the place where you inserted your change isn't at the
end (where they typically should be placed). You inserted in the middle of
the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
the unreleased changes section). Could you please append your changes to the
end of the file, and recommit?

 Thanks a lot!

Cheers,
  Chris



On 3/10/07 10:03 AM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Author: kubes
> Date: Sat Mar 10 10:03:07 2007
> New Revision: 516759
> 
> URL: http://svn.apache.org/viewvc?view=rev&rev=516759
> Log:
> Updated to reflect commits of NUTCH-233 and NUTCH-436.
> 
> Modified:
> lucene/nutch/trunk/CHANGES.txt
> 
> Modified: lucene/nutch/trunk/CHANGES.txt
> URL: 
> http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diff&rev=5167
> 59&r1=516758&r2=516759
> ==
> --- lucene/nutch/trunk/CHANGES.txt (original)
> +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
> @@ -50,6 +50,13 @@
>  
>  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
>  
> +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
> +Groschupf via kubes)
> +
> +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
> + path is empty (kubes)
> +
> +
>** WARNING !!! 
>* This upgrade breaks data format compatibility. A tool 'convertdb'   *
>* was added to migrate existing CrawlDb-s to the new format. Segment data *
> 
>

Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann

Dennis,

 No probs. Thanks, a lot!

Cheers,
  Chris



On 3/10/07 5:35 PM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:

> 
> 
> Chris Mattmann wrote:
>> Hi Dennis,
>> 
>>  Not to nit-pick, but the place where you inserted your change isn't at the
>> end (where they typically should be placed). You inserted in the middle of
>> the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
>> the unreleased changes section). Could you please append your changes to the
>> end of the file, and recommit?
>> 
>>  Thanks a lot!
>> 
>> Cheers,
>>   Chris
> 
> Sorry about that.  I say the warning message thinking it was a version
> break.  Everything should be fixed now.
> 
> Dennis Kubes
>> 
>> 
>> 
>> On 3/10/07 10:03 AM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
>> 
>>> Author: kubes
>>> Date: Sat Mar 10 10:03:07 2007
>>> New Revision: 516759
>>> 
>>> URL: http://svn.apache.org/viewvc?view=rev&rev=516759
>>> Log:
>>> Updated to reflect commits of NUTCH-233 and NUTCH-436.
>>> 
>>> Modified:
>>> lucene/nutch/trunk/CHANGES.txt
>>> 
>>> Modified: lucene/nutch/trunk/CHANGES.txt
>>> URL: 
>>> http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diff&rev=51
>>> 67
>>> 59&r1=516758&r2=516759
>>> 
>>> ==
>>> --- lucene/nutch/trunk/CHANGES.txt (original)
>>> +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
>>> @@ -50,6 +50,13 @@
>>>  
>>>  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
>>>  
>>> +18. NUTCH-233 - Wrong regular expression hangs reduce process forever
>>> (Stefan
>>> +Groschupf via kubes)
>>> +
>>> +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
>>> + path is empty (kubes)
>>> +
>>> +
>>>** WARNING !!!
>>> 
>>>* This upgrade breaks data format compatibility. A tool 'convertdb'
>>> *
>>>* was added to migrate existing CrawlDb-s to the new format. Segment data
>>> *
>>> 
>>> 
>> 
>>

FW: [jira] Created: (HADOOP-1147) remove all @author tags from source

2007-03-22 Thread Chris Mattmann

Hey Doug,

  Do you think we should do this in Nutch too? I'm in favor of doing this --
what does everyone else feel?

Thanks!

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

-- Forwarded Message
From: "Doug Cutting (JIRA)" <[EMAIL PROTECTED]>
Reply-To: 
Date: Thu, 22 Mar 2007 13:14:32 -0700 (PDT)
To: 
Subject: [jira] Created: (HADOOP-1147) remove all @author tags from source

remove all @author tags from source
---

 Key: HADOOP-1147
 URL: https://issues.apache.org/jira/browse/HADOOP-1147
 Project: Hadoop
  Issue Type: Improvement
Reporter: Doug Cutting
 Assigned To: Doug Cutting
Priority: Minor


We should remove @author tags from the source code.  We give contributors
credit in at least three places (Jira, subversion and CHANGES.txt).  Many
files have been substantially re-written by a range of contributors and
their @author tags are no longer accurate.  Also, @author tags imply
individual ownership, when we should rather strive for community ownership.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-- End of Forwarded Message

Initiation of 0.9 release process

2007-03-26 Thread Chris Mattmann

Hi Folks,

  As your friendly neighborhood 0.9 release manager, I just wanted to give
you all a heads up that I'd like to begin the release process today. If I
hear no objections by 00:00:00 UTC time, I will begin the release process
then. I will notify the list as soon as I'm done.

 Thanks!

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Initiation of 0.9 release process

2007-03-26 Thread Chris Mattmann

Hey Dennis,

 I'm basically going to follow the release process on the wiki (pointed to
by Doug), and the steps that I discussed with you and Sami (posted to the
dev list). In terms of help, if there's anything in those steps that I get
stuck on, I'll hollar at ya. Otherwise, if the process goes smoothly, I can
probably get it done on my own. Thanks for the offer: I'll be sure to call
on you if I get stuck. :-)

Cheers,
  Chris

On 3/26/07 10:06 AM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:

> Let me know if I can help in any way?
> 
> Dennis Kubes
> 
> Chris Mattmann wrote:
>> Hi Folks,
>> 
>>   As your friendly neighborhood 0.9 release manager, I just wanted to give
>> you all a heads up that I'd like to begin the release process today. If I
>> hear no objections by 00:00:00 UTC time, I will begin the release process
>> then. I will notify the list as soon as I'm done.
>> 
>>  Thanks!
>> 
>> Cheers,
>>   Chris
>> 
>> __
>> Chris A. Mattmann
>> [EMAIL PROTECTED]
>> Staff Member
>> Modeling and Data Management Systems Section (387)
>> Data Management Systems and Technologies Group
>> 
>> _
>> Jet Propulsion LaboratoryPasadena, CA
>> Office: 171-266BMailstop:  171-246
>> ___
>> 
>> Disclaimer:  The opinions presented within are my own and do not reflect
>> those of either NASA, JPL, or the California Institute of Technology.
>> 
>>

Nutch 0 .9 release progress update

2007-03-26 Thread Chris Mattmann

Hi Folks,

 Just to update everyone on progress. I've made it to Step 13 (waiting for
release to appear on mirrors) in the Release Process:

  http://wiki.apache.org/nutch/Release_HOWTO

 You can view a full log of the fun that I've been having by going to:

  http://people.apache.org/~mattmann/NUTCH_0.9_release_log.doc

 Tomorrow when I wade up (here in Los Angeles, Pacific Standard Time), I
will go ahead and wrap up the rest of the process. Thanks to all the folks
who've given me guidance along the way. It's been interesting figuring out
the process.

Thanks!

Cheers,
  Chris

Re: Nutch 0 .9 release progress update

2007-03-26 Thread Chris Mattmann

Hi Sami,

 Thanks for the heads up! :-) Okay, so I did the following:

1. Removed nutch-0.9.* from
people.apache.org:/www/www.apache.org/dist/lucene/nutch

2. Removed CHANGES-0.9.txt from the same place

 I will send out a separate email calling for a vote (thanks for the pointer
to the example!)

Thanks!

Cheers,
  Chris



On 3/26/07 10:22 PM, "Sami Siren" <[EMAIL PROTECTED]> wrote:

> Chris Mattmann wrote:
>> Hi Folks,
>> 
>>  Just to update everyone on progress. I've made it to Step 13 (waiting for
>> release to appear on mirrors) in the Release Process:
> 
> Chris, thanks for your work so far.
> 
> Seems like we're missing one important point in the rtfm: release review
> & vote.
> 
> Every apache release should be voted before it is made official. Three
> binding votes are required (I believe we now have enough active
> committers to do it this way?).
> 
> So please put the artifacts in a staging area and call a vote before
> going further. (there's a nice example here for a vote mail:
> http://www.mail-archive.com/dev@jackrabbit.apache.org/msg04641.html)
> 
> --
>  Sami Siren

[VOTE] Release Apache Nutch 0.9

2007-03-26 Thread Chris Mattmann

Hi Folks,

I have posted a candidate for the Apache Nutch 0.9 release at

 http://people.apache.org/~mattmann/nutch_0.9/

See the included CHANGES-0.9.txt file for details on release
contents and latest changes. The release was made from the 0.9-dev trunk.

Please vote on releasing these packages as Apache Nutch 0.9.
The vote is open for the next 72 hours. Only votes from Nutch
committers are binding, but everyone is welcome to check the release
candidate and voice their approval or disapproval. The vote  passes if
at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...

Thanks!

Cheers,
  Chris

Re: [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann

Hi Sami,

> A very limited acid test shows that I can do crawling and searching
> through web app so that part is ok.

Great! Similar tests of my own showed the same.

> 
> About signatures: I can't find your public gpg key anywhere (to verify
> the signature), not in KEYS file nor in keyservers I checked. Am i just
> blind?

Yeah, in my release log, I actually noted this. I was having a hard time
figuring out how to generate my public gpg key. Do you know what command to
run? I know where the KEYS file is in the dist directory, so I'm guessing I
just:

1. Generate my public gpg key (I already have my private one I guess)
2. Add that public gpg key to the KEYS file in the Nutch dist directory on
pepole.apache.org

Am I right about this? If so, could you tell me the command to run to
generate my public gpg key?

> 
> The md5 format used differs from rest of lucene sub projects.

According to the Apache sign and release guide (
http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step), I ran the
following command:

openssl md5 < nutch-0.9.tar.gz > nutch-0.9.tar.gz.md5

> To create
> it in similar format as the rest of lucene one could use
> 
>   md5sum  > .md5
> 
> We should probably adopt to same convention or wdot?

It's fine by me, but, just for my reference, what's the difference between
using the openssl md5 versus md5sum? If you want me to regenerate it, just
let me know...

Cheers,
  Chris


> 
> --
>  Sami Siren

Re: [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann

I've gone ahead and figured out how to generate my GPG public key :-) It
wasn't as hard as I thought. Anyways, I placed my gpg.txt file in:

~mattmann/gpg.txt

On people.apache.org. I've also added my GPG key to the KEYS file in the
nutch dist directory, /www/www.apache.org/dist/lucene/nutch/, using the same
convention as the others. To get the header, I did a gpg --list-keys.


Thanks!

Cheers,
  Chris



On 3/27/07 8:14 AM, "Chris Mattmann" <[EMAIL PROTECTED]> wrote:

> Hi Sami,
> 
>> A very limited acid test shows that I can do crawling and searching
>> through web app so that part is ok.
> 
> Great! Similar tests of my own showed the same.
> 
>> 
>> About signatures: I can't find your public gpg key anywhere (to verify
>> the signature), not in KEYS file nor in keyservers I checked. Am i just
>> blind?
> 
> Yeah, in my release log, I actually noted this. I was having a hard time
> figuring out how to generate my public gpg key. Do you know what command to
> run? I know where the KEYS file is in the dist directory, so I'm guessing I
> just:
> 
> 1. Generate my public gpg key (I already have my private one I guess)
> 2. Add that public gpg key to the KEYS file in the Nutch dist directory on
> pepole.apache.org
> 
> Am I right about this? If so, could you tell me the command to run to
> generate my public gpg key?
> 
>> 
>> The md5 format used differs from rest of lucene sub projects.
> 
> According to the Apache sign and release guide (
> http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step), I ran the
> following command:
> 
> openssl md5 < nutch-0.9.tar.gz > nutch-0.9.tar.gz.md5
> 
>> To create
>> it in similar format as the rest of lucene one could use
>> 
>>   md5sum  > .md5
>> 
>> We should probably adopt to same convention or wdot?
> 
> It's fine by me, but, just for my reference, what's the difference between
> using the openssl md5 versus md5sum? If you want me to regenerate it, just
> let me know...
> 
> Cheers,
>   Chris
> 
> 
>> 
>> --
>>  Sami Siren
> 
>

1 2 >

1 - 100 of 132 matches

Mail list logo