Re: dbunfetched URLs - team #32

2015-10-07 Thread Michael Joyce
That doesn't seem too unreasonable of a result count to me if you're running local. Assuming you're partitioning via host, all of those URLs are to the same host, and you have a 3 second politeness delay you should end up w/ a crawl lasting 21497 * 3 / 60 / 60 = 17.9 hours There's a wiki page on

[jira] [Commented] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-10-07 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947573#comment-14947573 ] Hudson commented on NUTCH-2124: --- SUCCESS: Integrated in Nutch-trunk #3287 (See

Re: Team 18 : Similarity scoring: goldstandard.txt, stopwords.txt contents

2015-10-07 Thread Christian Alan Mattmann
Sujen can you provide an example on the existing Scoring Similarity wiki page of what the gold standard file should have in it and how it should be formatted. + Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department

[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-07 Thread Michael Joyce (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947002#comment-14947002 ] Michael Joyce commented on NUTCH-2129: -- Fixed the unnecessary init that [~jnioche] caught. Thanks

Set up protocol-selenium

2015-10-07 Thread Huachao Zhang
Hi guys, I am trying to setup selenium plugin with this link: https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium When I execute this command sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 & The command line outputs a few lines of "Initializing built-in extension xxx", and

Re: Team 18 : Similarity scoring: goldstandard.txt, stopwords.txt contents

2015-10-07 Thread Sujen Shah
Hi Mithun, The goldstandard.txt is a file against which the parsed text of an html page coming from nutch will be checked. There is no particular format for that file, just plain text. For example: If you were to score pages which were more similar to a topic relating to Robotics, you would want

[jira] [Resolved] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-10-07 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2124. Resolution: Fixed Assignee: Sebastian Nagel Committed to trunk, r1707360. Thanks,