Varying Number of URLS Crawled.
Hi Everyone, I started to use Nutch 1.10 for my homework and I see that every time I perform a crawl using the same configuration and same seed urls I get a different number of fetched urls. This occurs even when the old crawl data is deleted. This way I would not be able to identify which URLs had a problem being fetched and if it was resolved later or not. Any suggestions on how to solve this issue would be of great help. Thank You. Best, Nagarjun Pola University of Southern California
[jira] [Commented] (NUTCH-1730) Scoring-depth optionally not to increment depth for external hosts
[ https://issues.apache.org/jira/browse/NUTCH-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317829#comment-14317829 ] Markus Jelsma commented on NUTCH-1730: -- Anything to add to this modificiation? Scoring-depth optionally not to increment depth for external hosts -- Key: NUTCH-1730 URL: https://issues.apache.org/jira/browse/NUTCH-1730 Project: Nutch Issue Type: New Feature Affects Versions: 1.7 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.11 Attachments: NUTCH-1730-trunk.patch Currently, the plugin always increments depth, even when coming or going to external hosts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1939. Resolution: Fixed Committed to trunk, v1659227. Thanks, [~leoyey]! Fetcher fails to follow redirects - Key: NUTCH-1939 URL: https://issues.apache.org/jira/browse/NUTCH-1939 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Sebastian Nagel Fix For: 1.10 Attachments: NUTCH-1939.patch As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with http.redirect.max 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319064#comment-14319064 ] Markus Jelsma commented on NUTCH-1925: -- Ja, ill check it in tomorrow. Any comments on other minor issues on 1.10 before we decide an RC? Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Nutch-Selenium in Nutch 1.10
I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
Re: nutch subscribe
Hi, Please send a message to dev-subscr...@nutch.apache.org to subscribe to the list. Tyler On Feb 12, 2015 6:54 PM, Poojan Jhaveri pjhav...@usc.edu wrote:
Re: org.mortbay.proxy package not found in nutch 1.x, Ref Class - ProxyTestbed
Cool. Issue resolved now. Thanks Sebastian ! On Wed, Feb 11, 2015 at 12:21 PM, Sebastian Nagel wastl.na...@googlemail.com wrote: Hi, the jetty-client-6.1.22.jar is a dependency needed only for testing. Consequently, it's placed in build/test/lib/ but only if you run the tests, resp. call % ant resolve-test There is also a target % ant eclipse which writes a complete Eclipse project configuration. Sometimes, if dependencies change, you have to run it again. Of course, even with this config you have to run % ant resolve-default resolve-test after a clean to copy all dependencies into build/{lib,test/lib}/ Best, Sebastian On 02/11/2015 05:00 AM, Preetam Pradeepkumar Shingavi wrote: Hi, I am trying to configure Nutch 1.X on eclipse, and configured the build path to include all jars from the build-lib folder. There is a class ProxyTestbed.java which has a error in importing the following package : import *org.mortbay.proxy.*AsyncProxyServlet; (proxy package not found) I tried to figure out that this class file loads from *jetty-6.1.26.jar, *but is not actually present in this jar. Am I missing anything here ? Do I download any other jar ? Thanks in advance !
nutch subscribe
[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319277#comment-14319277 ] Leo Ye commented on NUTCH-1939: --- Good to see we fixed it. Thank you, [~wastl-nagel] Fetcher fails to follow redirects - Key: NUTCH-1939 URL: https://issues.apache.org/jira/browse/NUTCH-1939 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Sebastian Nagel Fix For: 1.10 Attachments: NUTCH-1939.patch As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with http.redirect.max 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1323) AjaxNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1323: - Fix Version/s: (was: 1.11) 1.10 AjaxNormalizer -- Key: NUTCH-1323 URL: https://issues.apache.org/jira/browse/NUTCH-1323 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL. https://developers.google.com/webmasters/ajax-crawling/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1323) AjaxNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1323. -- Resolution: Fixed Just in time for 1.10, Committed to trunk in revision 1659167. AjaxNormalizer -- Key: NUTCH-1323 URL: https://issues.apache.org/jira/browse/NUTCH-1323 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL. https://developers.google.com/webmasters/ajax-crawling/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1921) Optionally parse fetch_not_modified
[ https://issues.apache.org/jira/browse/NUTCH-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317826#comment-14317826 ] Markus Jelsma commented on NUTCH-1921: -- Anything to add to this optional settings? Optionally parse fetch_not_modified --- Key: NUTCH-1921 URL: https://issues.apache.org/jira/browse/NUTCH-1921 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.11 Attachments: NUTCH-1921-trunk.patch Records with fetch_not_modified are not parsed and are not passed through parse filters, index filters and are not being indexed. This is a huge problem if you modified parser filter, indexing filter or whatever behaviour in the pipe line because changes never show up in the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1684) ParseMeta to be added before fetch schedulers are run
[ https://issues.apache.org/jira/browse/NUTCH-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317828#comment-14317828 ] Markus Jelsma commented on NUTCH-1684: -- Anything to add to this? I think this can go in ParseMeta to be added before fetch schedulers are run - Key: NUTCH-1684 URL: https://issues.apache.org/jira/browse/NUTCH-1684 Project: Nutch Issue Type: Improvement Components: crawldb Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.11 Attachments: NUTCH-1684-trunk.patch FetchSchedulers cannot operate on parseMeta in the CrawlDatum because it is added after the schedulers have run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317874#comment-14317874 ] Hudson commented on NUTCH-1913: --- SUCCESS: Integrated in Nutch-trunk #2971 (See [https://builds.apache.org/job/Nutch-trunk/2971/]) NUTCH-1913 LinkDB to implement db.ignore.external.links (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659169) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java LinkDB to implement db.ignore.external.links Key: NUTCH-1913 URL: https://issues.apache.org/jira/browse/NUTCH-1913 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.11 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch LinkDB needs an option to ignore external links. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1323) AjaxNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317873#comment-14317873 ] Hudson commented on NUTCH-1323: --- SUCCESS: Integrated in Nutch-trunk #2971 (See [https://builds.apache.org/job/Nutch-trunk/2971/]) NUTCH-1323 AjaxNormalizer (markus: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659167) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/urlnormalizer-ajax * /nutch/trunk/src/plugin/urlnormalizer-ajax/build.xml * /nutch/trunk/src/plugin/urlnormalizer-ajax/ivy.xml * /nutch/trunk/src/plugin/urlnormalizer-ajax/plugin.xml * /nutch/trunk/src/plugin/urlnormalizer-ajax/src * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/java/org/apache/nutch/net/urlnormalizer/ajax/AjaxURLNormalizer.java * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax * /nutch/trunk/src/plugin/urlnormalizer-ajax/src/test/org/apache/nutch/net/urlnormalizer/ajax/TestAjaxURLNormalizer.java AjaxNormalizer -- Key: NUTCH-1323 URL: https://issues.apache.org/jira/browse/NUTCH-1323 Project: Nutch Issue Type: New Feature Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: NUTCH-1323-1.6-1.patch, NUTCH-1323-1.8.patch A two-way normalizer for Nutch able to deal with AJAX URL's, converting them to _escaped_fragment_ URL's and back to an AJAX URL. https://developers.google.com/webmasters/ajax-crawling/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317815#comment-14317815 ] Markus Jelsma commented on NUTCH-1925: -- Committed to trunk in revision 1659168. Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Nutch-Selenium in Nutch 1.10
Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: *error: package org.apache.nutch.storage does not exist* I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848
Re: Nutch-Selenium in Nutch 1.10
Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: *error: package org.apache.nutch.storage does not exist* I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848
[jira] [Commented] (NUTCH-1913) LinkDB to implement db.ignore.external.links
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317816#comment-14317816 ] Markus Jelsma commented on NUTCH-1913: -- Thanks Sebastian, committed to trunk in revision 1659169! LinkDB to implement db.ignore.external.links Key: NUTCH-1913 URL: https://issues.apache.org/jira/browse/NUTCH-1913 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.11 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch LinkDB needs an option to ignore external links. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1913) LinkDB to implement db.ignore.external.links
[ https://issues.apache.org/jira/browse/NUTCH-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1913. -- Resolution: Fixed LinkDB to implement db.ignore.external.links Key: NUTCH-1913 URL: https://issues.apache.org/jira/browse/NUTCH-1913 Project: Nutch Issue Type: New Feature Components: linkdb Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Trivial Fix For: 1.11 Attachments: NUTCH-1913-trunk-v2.patch, NUTCH-1913-trunk.patch LinkDB needs an option to ignore external links. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Nutch-Selenium in Nutch 1.10
Hi Li, Shuo. You are so right. I finished installing and successfully run the butch with selenium and Firefox. I have a question though, does your Firefox plug out for always all the urls we crawled? Hi Prof Mattmann. I think here is the way we install selenium on MAC with OS higher than 10.6 I think... 1. Download XQuatz, it's a dmp file, install it directly 2. Download Nutch 1.10 3. Download the patch and put it on the Nutch project directory 4. patch -p0 THE PATCH NAME 5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial in the github told you. The patch basically updated those .xml file for us. And the patch also installs lib-selenium and protocol selenium for us (Correct me if I am wrong) 6. Update tika dependency if needed 7. Go to the Nutch project directory and run ant runtime 8. Download Firefox 9. Open a new terminal and type xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you want...) There should be some errors after entering the command (for me at least). Manually sudo create a /tmp/.X11-unix folder, and then set the mode to 1777. Rerun the command. xvfb should be working. 10. Go to nutch runtime local and run the crawling command Hope it helps. :) Best, Jiaxin On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li sli...@usc.edu javascript:_e(%7B%7D,'cvml','sli...@usc.edu'); wrote: I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote: Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu'); wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
Re: Nutch-Selenium in Nutch 1.10
This is great, Jiaxin, can you please make a wiki page on the Nutch wiki that has this information? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 9:39 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Nutch-Selenium in Nutch 1.10 Hi Li, Shuo. You are so right. I finished installing and successfully run the butch with selenium and Firefox. I have a question though, does your Firefox plug out for always all the urls we crawled? Hi Prof Mattmann. I think here is the way we install selenium on MAC with OS higher than 10.6 I think... 1. Download XQuatz, it's a dmp file, install it directly 2. Download Nutch 1.10 3. Download the patch and put it on the Nutch project directory 4. patch -p0 THE PATCH NAME 5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial in the github told you. The patch basically updated those .xml file for us. And the patch also installs lib-selenium and protocol selenium for us (Correct me if I am wrong) 6. Update tika dependency if needed 7. Go to the Nutch project directory and run ant runtime 8. Download Firefox 9. Open a new terminal and type xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you want...) There should be some errors after entering the command (for me at least). Manually sudo create a /tmp/.X11-unix folder, and then set the mode to 1777. Rerun the command. xvfb should be working. 10. Go to nutch runtime local and run the crawling command Hope it helps. :) Best, Jiaxin On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li sli...@usc.edu javascript:_e(%7B%7D,'cvml','sli...@usc.edu'); wrote: I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote: Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu'); wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov'); wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We
Re: Nutch-Selenium in Nutch 1.10
Sure. I will do it once I confirm it works... On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: This is great, Jiaxin, can you please make a wiki page on the Nutch wiki that has this information? ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov javascript:; WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu javascript:; Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Date: Thursday, February 12, 2015 at 9:39 PM To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Subject: Nutch-Selenium in Nutch 1.10 Hi Li, Shuo. You are so right. I finished installing and successfully run the butch with selenium and Firefox. I have a question though, does your Firefox plug out for always all the urls we crawled? Hi Prof Mattmann. I think here is the way we install selenium on MAC with OS higher than 10.6 I think... 1. Download XQuatz, it's a dmp file, install it directly 2. Download Nutch 1.10 3. Download the patch and put it on the Nutch project directory 4. patch -p0 THE PATCH NAME 5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial in the github told you. The patch basically updated those .xml file for us. And the patch also installs lib-selenium and protocol selenium for us (Correct me if I am wrong) 6. Update tika dependency if needed 7. Go to the Nutch project directory and run ant runtime 8. Download Firefox 9. Open a new terminal and type xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you want...) There should be some errors after entering the command (for me at least). Manually sudo create a /tmp/.X11-unix folder, and then set the mode to 1777. Rerun the command. xvfb should be working. 10. Go to nutch runtime local and run the crawling command Hope it helps. :) Best, Jiaxin On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li sli...@usc.edu javascript:; javascript:_e(%7B%7D,'cvml',' sli...@usc.edu javascript:;'); wrote: I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:; javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov javascript:;'); wrote: Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu javascript:; javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu javascript:;'); wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov javascript:; javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov javascript:;'); wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov javascript:; WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu javascript:; Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu
[jira] [Created] (NUTCH-1942) Remove TopLevelDomain
Julien Nioche created NUTCH-1942: Summary: Remove TopLevelDomain Key: NUTCH-1942 URL: https://issues.apache.org/jira/browse/NUTCH-1942 Project: Nutch Issue Type: Task Reporter: Julien Nioche Priority: Minor Fix For: 1.11 We should leverage the domain related utilities from crawler-commons instead of duplicating them in the `org.apache.nutch.util.domain` package. For instance we could deprecate TopLevelDomain and call the corresponding class in CC instead. The resources in CC are more up-to-date and it is less code to maintain. This would be a good task for someone willing to get to know the Nutch codebase better and impress us all with the extent of his/her skills. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Nutch-Selenium in Nutch 1.10
You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
[jira] [Commented] (NUTCH-1939) Fetcher fails to follow redirects
[ https://issues.apache.org/jira/browse/NUTCH-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318099#comment-14318099 ] Hudson commented on NUTCH-1939: --- SUCCESS: Integrated in Nutch-trunk #2972 (See [https://builds.apache.org/job/Nutch-trunk/2972/]) NUTCH-1939 Fetcher fails to follow redirects (snagel: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1659227) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java Fetcher fails to follow redirects - Key: NUTCH-1939 URL: https://issues.apache.org/jira/browse/NUTCH-1939 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Sebastian Nagel Fix For: 1.10 Attachments: NUTCH-1939.patch As reported by [~leoyey] in NUTCH-1735 which introduced the regression: with http.redirect.max 0 Fetcher does not follow redirects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1925) Upgrade Tika to version 1.7
[ https://issues.apache.org/jira/browse/NUTCH-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1925: - Attachment: NUTCH-1925-2x.patch Patch for 2.x, it seems to be working. Please confirm. Upgrade Tika to version 1.7 --- Key: NUTCH-1925 URL: https://issues.apache.org/jira/browse/NUTCH-1925 Project: Nutch Issue Type: Improvement Components: build Reporter: Tyler Palsulich Assignee: Markus Jelsma Priority: Blocker Fix For: 1.10, 2.3.1 Attachments: NUTCH-1925-2x.patch, NUTCH-1925.palsulich.patch, NUTCH-1925.palsulich.v2.patch Hi Folks. Nutch currently uses version 1.6 of Tika. There were no significant API changes between 1.6 and 1.7. So, this should be a one line update. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Nutch-Selenium in Nutch 1.10
Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov javascript:; WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu javascript:; Reply-To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org javascript:; dev@nutch.apache.org javascript:; Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu javascript:; wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu javascript:; wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu javascript:; wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu javascript:; wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
[jira] [Commented] (NUTCH-1942) Remove TopLevelDomain
[ https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318308#comment-14318308 ] Chris A. Mattmann commented on NUTCH-1942: -- Julien can you tell me more about crawler-commons? You are part of that project, right? Remove TopLevelDomain -- Key: NUTCH-1942 URL: https://issues.apache.org/jira/browse/NUTCH-1942 Project: Nutch Issue Type: Task Reporter: Julien Nioche Priority: Minor Labels: newbie Fix For: 1.11 We should leverage the domain related utilities from crawler-commons instead of duplicating them in the `org.apache.nutch.util.domain` package. For instance we could deprecate TopLevelDomain and call the corresponding class in CC instead. The resources in CC are more up-to-date and it is less code to maintain. This would be a good task for someone willing to get to know the Nutch codebase better and impress us all with the extent of his/her skills. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Nutch-Selenium in Nutch 1.10
Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edumailto:jiaxi...@usc.edu wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.govjavascript:; WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edujavascript:; Reply-To: dev@nutch.apache.orgjavascript:; dev@nutch.apache.orgjavascript:; Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.orgjavascript:; dev@nutch.apache.orgjavascript:; Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edujavascript:; wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edujavascript:; wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edujavascript:; wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edujavascript:; wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
[jira] [Updated] (NUTCH-1724) LinkDBReader to support regex output filtering
[ https://issues.apache.org/jira/browse/NUTCH-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1724: - Attachment: NUTCH-1724-trunk.patch Modified to adhere to Lewis' changes. Will commit shortly unless objected to. LinkDBReader to support regex output filtering -- Key: NUTCH-1724 URL: https://issues.apache.org/jira/browse/NUTCH-1724 Project: Nutch Issue Type: Improvement Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.11 Attachments: NUTCH-1724-trunk.patch, NUTCH-1724-trunk.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)