Varying Number of URLS Crawled.

2015-02-12 Thread Nagarjun Pola
Hi Everyone, I started to use Nutch 1.10 for my homework and I see that every time I perform a crawl using the same configuration and same seed urls I get a different number of fetched urls. This occurs even when the old crawl data is deleted. This way I would not be able to identify which URLs

[jira] [Commented] (NUTCH-1730) Scoring-depth optionally not to increment depth for external hosts

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317829#comment-14317829 ] Markus Jelsma commented on NUTCH-1730: -- Anything to add to this modificiation?

[jira] [Resolved] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1939. Resolution: Fixed Committed to trunk, v1659227. Thanks, [~leoyey]! Fetcher fails to

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319064#comment-14319064 ] Markus Jelsma commented on NUTCH-1925: -- Ja, ill check it in tomorrow. Any comments on

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A

Re: nutch subscribe

2015-02-12 Thread Tyler Palsulich
Hi, Please send a message to dev-subscr...@nutch.apache.org to subscribe to the list. Tyler On Feb 12, 2015 6:54 PM, Poojan Jhaveri pjhav...@usc.edu wrote:

Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed

2015-02-12 Thread Preetam Pradeepkumar Shingavi
Cool. Issue resolved now. Thanks Sebastian ! On Wed, Feb 11, 2015 at 12:21 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, the jetty-client-6.1.22.jar is a dependency needed only for testing. Consequently, it's placed in build/test/lib/ but only if you run the tests, resp.

nutch subscribe

2015-02-12 Thread Poojan Jhaveri

[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Leo Ye (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319277#comment-14319277 ] Leo Ye commented on NUTCH-1939: --- Good to see we fixed it. Thank you, [~wastl-nagel]

[jira] [Updated] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1323: - Fix Version/s: (was: 1.11) 1.10 AjaxNormalizer --

[jira] [Resolved] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1323. -- Resolution: Fixed Just in time for 1.10, Committed to trunk in revision 1659167.

[jira] [Commented] (NUTCH-1921) Optionally parse fetch_not_modified

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317826#comment-14317826 ] Markus Jelsma commented on NUTCH-1921: -- Anything to add to this optional settings?

[jira] [Commented] (NUTCH-1684) ParseMeta to be added before fetch schedulers are run

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317828#comment-14317828 ] Markus Jelsma commented on NUTCH-1684: -- Anything to add to this? I think this can go

[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317874#comment-14317874 ] Hudson commented on NUTCH-1913: --- SUCCESS: Integrated in Nutch-trunk #2971 (See

[jira] [Commented] (NUTCH-1323) AjaxNormalizer

2015-02-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317873#comment-14317873 ] Hudson commented on NUTCH-1323: --- SUCCESS: Integrated in Nutch-trunk #2971 (See

[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317815#comment-14317815 ] Markus Jelsma commented on NUTCH-1925: -- Committed to trunk in revision 1659168.

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye

[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317816#comment-14317816 ] Markus Jelsma commented on NUTCH-1913: -- Thanks Sebastian, committed to trunk in

[jira] [Resolved] (NUTCH-1913) LinkDB to implement db.ignore.external.links

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1913. -- Resolution: Fixed LinkDB to implement db.ignore.external.links

Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi Li, Shuo. You are so right. I finished installing and successfully run the butch with selenium and Firefox. I have a question though, does your Firefox plug out for always all the urls we crawled? Hi Prof Mattmann. I think here is the way we install selenium on MAC with OS higher than 10.6 I

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
This is great, Jiaxin, can you please make a wiki page on the Nutch wiki that has this information? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Sure. I will do it once I confirm it works... On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: This is great, Jiaxin, can you please make a wiki page on the Nutch wiki that has this information?

[jira] [Created] (NUTCH-1942) Remove TopLevelDomain

2015-02-12 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1942: Summary: Remove TopLevelDomain Key: NUTCH-1942 URL: https://issues.apache.org/jira/browse/NUTCH-1942 Project: Nutch Issue Type: Task Reporter:

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris

[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects

2015-02-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318099#comment-14318099 ] Hudson commented on NUTCH-1939: --- SUCCESS: Integrated in Nutch-trunk #2972 (See

[jira] [Updated] (NUTCH-1925) Upgrade Tika to version 1.7

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1925: - Attachment: NUTCH-1925-2x.patch Patch for 2.x, it seems to be working. Please confirm. Upgrade

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Jiaxin Ye
Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The

[jira] [Commented] (NUTCH-1942) Remove TopLevelDomain

2015-02-12 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318308#comment-14318308 ] Chris A. Mattmann commented on NUTCH-1942: -- Julien can you tell me more about

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Mattmann, Chris A (3980)
Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edumailto:jiaxi...@usc.edu wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann,

[jira] [Updated] (NUTCH-1724) LinkDBReader to support regex output filtering

2015-02-12 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1724: - Attachment: NUTCH-1724-trunk.patch Modified to adhere to Lewis' changes. Will commit shortly