[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-11-15 Thread Renaud Richardet (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542819
 ] 

Renaud Richardet commented on NUTCH-444:


hi,
i am travelling and will be offline until january 2008. thanks for
your patience.
Renaud

bonjour,
je suis en voyage et ne serai pas atteignable par mail avant janvier
2008. merci de votre patience.
Renaud

-- 
renaudatoslutionsdotcom
www.oslutions.com


 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, 
 NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Issue Comment Edited: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-11-15 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542814
 ] 

musepwizard edited comment on NUTCH-444 at 11/15/07 9:08 AM:
--

Fixes errors in unit test on windows machines.  This is a trivial change so I 
went ahead and comitted it.

  was (Author: musepwizard):
Fixes errors in unit test on windows machines.
  
 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, 
 NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (NUTCH-552) Upgrade Nutch to Hadoop 0.15.x

2007-11-15 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-552.
--

Resolution: Fixed

This has now been fixed and comitted.

 Upgrade Nutch to Hadoop 0.15.x
 --

 Key: NUTCH-552
 URL: https://issues.apache.org/jira/browse/NUTCH-552
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-552-1.patch, NUTCH-552-2.patch, NUTCH-552-3.patch, 
 NUTCH-552-4.patch, NUTCH-552.1-1.patch


 Upgrade Nutch to Hadoop 0.15.x .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-11-15 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-444:
---

Attachment: NUTCH-444.1-1.patch

Fixes errors in unit test on windows machines.

 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.0.0

 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, 
 NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, 
 parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Commit Times for Issues

2007-11-15 Thread Dennis Kubes
So I have been talking with some of the other committers and I wanted to 
layout a suggestion for standardizing some of the nutch committer 
workflow processes in the hope of speeding up nutch development.


The first one I was hoping to tackle is time to commit.  At least for me 
it has been hard to know when to commit something, especially when it 
was trivial or no one commented on the issue.  Here is what is being 
proposed:


Trivial changes = immediate, this at the discretion of the committers
Minor changes = 24 hours from latest patch or 1 or more +1 from committers
Major and blocker changes = 4 days from latest patch or 2 or more +1 
from committers


This way if an issue has been active for some time but no one has taken 
a look at it, and it has passed all unit tests, then we can go ahead and 
commit it.  Also this should allow more of the smaller changes to be 
handled faster.


So these of course are just some suggestions would love to hear from 
others in the community.  What I think would be best is to come to a 
consensus on this and then have a wiki page describing this and other 
processes for committers.


Dennis Kubes


Re: Commit Times for Issues

2007-11-15 Thread Andrzej Bialecki

Dennis Kubes wrote:
So I have been talking with some of the other committers and I wanted to 
layout a suggestion for standardizing some of the nutch committer 
workflow processes in the hope of speeding up nutch development.


The first one I was hoping to tackle is time to commit.  At least for me 
it has been hard to know when to commit something, especially when it 
was trivial or no one commented on the issue.  Here is what is being 
proposed:


Trivial changes = immediate, this at the discretion of the committers
Minor changes = 24 hours from latest patch or 1 or more +1 from committers
Major and blocker changes = 4 days from latest patch or 2 or more +1 
from committers


This way if an issue has been active for some time but no one has taken 
a look at it, and it has passed all unit tests, then we can go ahead and 
commit it.  Also this should allow more of the smaller changes to be 
handled faster.


So these of course are just some suggestions would love to hear from 
others in the community.  What I think would be best is to come to a 
consensus on this and then have a wiki page describing this and other 
processes for committers.


I agree with the overall plan - we need to speed up the process and 
release the committers from worrying too much whether a patch is ripe 
enough to commit it.


Though I think that in case of minor changes, the 24 hours period is too 
short. By definition, since they are not trivial then it means they 
could use a peer review. Sometimes it's difficult to get a patch 
reviewed within 24 hours, and in the coding enthusiasm it's easy to be 
too quick ... I'd say 48 hours if no review, or less if the patch is 
reviewed and gets +1.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Nutch trunk js-parser problem with extremely long and meaningless Elements

2007-11-15 Thread Ned Rockson
I've run into the problem before that, while running the parser, it gets 
caught in really deep regex loops.  For a quick fix I changed 
urlfilter-prefix to not allow urls over 300 characters and to make sure 
none of the characters have ascii values 32 (control characters).  I 
just ran into another one today but it's in the js parser.  Take a look 
at the source for http://www.magic-cadeaux.fr/ when it lists the 
function swap(image, num).  If it weren't for all of the slashes then it 
is well formed javascript, but unfortunately the parse-js plugin doesn't 
deal with it correctly.  It just hangs in a very very deep loop.  A 
browser, such as firefox, however seems to deal with it okay.  Is there 
a way we can deal with these cases rather than limiting the size of an 
Element? 


about heritrix crawl,Who will tell me in this Nutch forum?thanks

2007-11-15 Thread xingjian

A.3. Mirroring .html Files Only in
http://crawler.archive.org/articles/user_manual/usecases.html

..
On the Setting screen, i'll want to set the following for the
NotMatchesFilePatternDecideRule:

decision: REJECT
use-preset-pattern: CUSTOM
regexp: .*(/|\.html)$


..

How to config above in Submodules of Heritrix ?I do't know.anyone help
me.Thanks

-- 
View this message in context: 
http://www.nabble.com/about-heritrix-crawl%2CWho-will-tell-me-in-this-Nutch-forum-thanks-tf4819146.html#a13787379
Sent from the Nutch - Dev mailing list archive at Nabble.com.