Re: linkdb/current/part-00000/data does not exist
I was using ./bin/crawl and not incremental crawling at that time. This file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will provide more information if I can reproduce this error. Thanks =) On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: What command are you using to crawl? Are you using bin/crawl, and/or doing incremental crawling? Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Shuo Li sli...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Friday, February 20, 2015 at 3:26 PM To: dev@nutch.apache.org dev@nutch.apache.org Subject: linkdb/current/part-0/data does not exist Hi, I'm trying to crawl NSF ACADIS with nutch-selenium. I meet a problem with linkdb/current/part-0/data does not exist. I checked my directory and my files during crawling, and it appears this file sometimes exist and sometimes disappear. This is quite weird and stranger. Another problem is when we crawl NSIDC ADE, it will give us a 403 forbidden error. Does this mean NSIDC ADE is blocking us? The log of first error is in the bottom of this email. Any help would be appreciated. Regards, Shuo Li LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb LinkDb: java.io.FileNotFoundException: File file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part- 0/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j ava:402) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java: 255) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn putFormat.java:47) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20 8) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation. java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10
Yop, Here's a correct ivy.xml. I think there may be some mistakes when we install the patch. It will generate some duplicate ?xml? tag. You may need to delete them manually. If anybody could provide a complete tutorial or a correct patch that'd be great. PS0: I didn't read the whole conversation. I hope this helped. PS1: Please remove all the lines in that patch about ivy.xml and replace with the attachment. Regards, Shuo Li On Sat, Feb 21, 2015 at 11:43 AM, Nikunj Gala nikun...@usc.edu wrote: Hey you are correct I see fails while patching ivy.xml on the latest GitHub Nutch Trunk The patch longs are as follows: --- patching file build.xml patching file ivy/ivy.xml Hunk #3 FAILED at 59. 1 out of 3 hunks FAILED -- saving rejects to file ivy/ivy.xml.rej patching file src/plugin/build.xml Hunk #2 succeeded at 148 (offset 2 lines). patching file src/plugin/lib-selenium/build.xml patching file src/plugin/lib-selenium/ivy.xml patching file src/plugin/lib-selenium/plugin.xml patching file src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java patching file src/plugin/lib-selenium/src/pom.xml patching file src/plugin/protocol-selenium/.idea/.name patching file src/plugin/protocol-selenium/.idea/compiler.xml patching file src/plugin/protocol-selenium/.idea/copyright/profiles_settings.xml patching file src/plugin/protocol-selenium/.idea/encodings.xml patching file src/plugin/protocol-selenium/.idea/misc.xml patching file src/plugin/protocol-selenium/.idea/modules.xml patching file src/plugin/protocol-selenium/.idea/scopes/scope_settings.xml patching file src/plugin/protocol-selenium/.idea/vcs.xml patching file src/plugin/protocol-selenium/.idea/workspace.xml patching file src/plugin/protocol-selenium/build.xml patching file src/plugin/protocol-selenium/ivy.xml patching file src/plugin/protocol-selenium/plugin.xml patching file src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java patching file src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java patching file src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html patching file src/plugin/protocol-selenium/src/pom.xml patching file src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html --- Trying to understand and fix the patch now. Has anybody else done any changes in the patch? ?xml version=1.0 ? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- ivy-module version=1.0 info organisation=org.apache.nutch module=nutch license name=Apache 2.0 url=http://www.apache.org/licenses/LICENSE-2.0.txt/; / ivyauthor name=Apache Nutch Team url=http://nutch.apache.org; / description homepage=http://nutch.apache.org;Nutch is an open source web-search software. It builds on Hadoop, Tika and Solr, adding web-specifics, such as a crawler, a link-graph database etc. /description /info configurations include file=${basedir}/ivy/ivy-configurations.xml / /configurations publications !--get the artifact from our module name -- artifact conf=master / /publications dependencies dependency org=org.slf4j name=slf4j-api rev=1.6.1 conf=*-master / dependency org=org.slf4j name=slf4j-log4j12 rev=1.6.1 conf=*-master / dependency org=log4j name=log4j rev=1.2.15 conf=*-master / dependency org=commons-lang name=commons-lang rev=2.6 conf=*-default / dependency org=commons-collections name=commons-collections rev=3.1 conf=*-default / dependency org=commons-httpclient name=commons-httpclient rev=3.1 conf=*-master / dependency org=commons-codec name=commons-codec rev=1.3 conf=*-default / dependency org=org.apache.hadoop name=hadoop-core rev=1.2.0 conf=*-default exclude org=hsqldb name=hsqldb / exclude org=net.sf.kosmosfs name=kfs / exclude org=net.java.dev.jets3t name=jets3t / exclude org
linkdb/current/part-00000/data does not exist
Hi, I'm trying to crawl NSF ACADIS with nutch-selenium. I meet a problem *with linkdb/current/part-0/data does not exist. *I checked my directory and my files during crawling, and it appears this file sometimes exist and sometimes disappear. This is quite weird and stranger. Another problem is when we crawl NSIDC ADE, it will give us a 403 forbidden error. Does this mean NSIDC ADE is blocking us? The log of first error is in the bottom of this email. Any help would be appreciated. Regards, Shuo Li LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb LinkDb: java.io.FileNotFoundException: File file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0/data does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
Vagrant Crushed When using Nutch-Selenium
Hey guys, I'm trying to use Nutch-Selenium to crawl nutch.apache.org. However, my vagrant seems crushed after a few minutes. I forced it to shut down and it turns out it only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu Trusty, 14.04. Is there anything I can provide to you guys? Or is there anybody have the same issue? Or 59 websites is the complete crawling? Any suggestion would be appreciated. Regards, Shuo Li
Re: Vagrant Crushed When using Nutch-Selenium
Hey guys, After change my RAM to 2GB, everything works fine. My bad. Thanks for your help. Regards, Shuo Li On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Thank you Mo. I sincerely appreciate your guidance and contribution. I will work to get your nutch selenium grid plugin contributed to work with Nutch 1.x. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Mo Omer beancinemat...@gmail.com Date: Friday, February 13, 2015 at 11:10 AM To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov Cc: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Vagrant Crushed When using Nutch-Selenium Hey all, When I had run nutch-selenium, it was in a config such that zombies were created from closing Firefox windows and they couldn't be reaped (again, due to the docker configuration I had). In a normal setup, it should not be an issue - if you're running 20 threads in nutch that's potentially 20 open FF windows which isn't good for 512mb. Selenium grid is much more efficient, in that browsers are opened, but tabs are used to fetch sites - and only those are closed. Additionally, ensure you're using Nutch 2.2.1. Feel free to fork patch and tinker and PR as needed. Chris, if you want to be added to contribs on the GitHub project, that's cool with me! Wish I could dedicate more time to this, but I don't foresee using Nutch again in the near future, and am now working on projects that require lots of reading and possibly patches to Caffe and opencl r-CNN projects. Tl;dr: - no, this shouldn't be typical unless you're creating zombies like crazy and they're not being reaped (too many open file descriptors), running out of memory, or similar resource constraint. - selenium grid is TONs more efficient, but a bit more difficult to set up. I used it to crawl 100ks of sites. - unfortunately I can't commit more time to this, but if I can assist in any admin way, let me know. Thank you, Mo This message was drafted on a tiny touch screen; please forgive brevity tpyos On Feb 13, 2015, at 12:41 PM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Oh yes, please up your memory to like at least 2Gb.. ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Shuo Li sli...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Friday, February 13, 2015 at 10:38 AM To: dev@nutch.apache.org dev@nutch.apache.org Cc: Mo Omer beancinemat...@gmail.com Subject: Re: Vagrant Crushed When using Nutch-Selenium Hey Mo and Prof Mattmann, I will try to crawl the 3 websites in the homework tonight (NASA AMD, NSF ACADIS and NSIDC Arctic Data Explorer). I will let you know what's going on. Is memory an issue? My vagrant only has 512MB of memory. Regards, Shuo Li On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Hi Shuo, Thanks for your email. I wonder if using selenium grid would help? Please see this plugin: https://github.com/momer/nutch-selenium-grid-plugin I’m CC’ing Mo the author of the plugin to see if he experienced this while running the original selenium plugin - Mo did using selenium grid help the issue that Shuo is experiencing below? Mo: are you cool with portion the grid plugin, or if Lewis or I do it to trunk (with full credit to you of course?) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann
Re: Nutch-Selenium in Nutch 1.10
I think I have possibly finished installing. What you need to do: 0. git status and checkout what you have modified. 1. patch -p0 YOUR_PATCH_FILE 2. ant clean jar 3. ant runtime Will try crawling using selenium later on. Hope this helped. _ On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: Yes I believe you need to install X11 - why don't you try and report back what you find thanks. Sent from my iPhone On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi professor, but can we use Selenium on Mac? On Thursday, February 12, 2015, Mattmann, Chris A (3980) chris.a.mattm...@jpl.nasa.gov wrote: You need Selenium Jiaxin, in order to crawl dynamic pages in the polar dataset you have been assigned in my CSCI 572 search engines class. The instructions for integrating Selenium with Nutch 1.10-trunk are here: https://issues.apache.org/jira/browse/NUTCH-1933 Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Jiaxin Ye jiaxi...@usc.edu Reply-To: dev@nutch.apache.org dev@nutch.apache.org Date: Thursday, February 12, 2015 at 12:46 AM To: dev@nutch.apache.org dev@nutch.apache.org Subject: Re: Nutch-Selenium in Nutch 1.10 Well, good choice. I am thinking changing to ubuntu now. The thing is why do we need Selenium anyway? Just easier to perform crawling? On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li sli...@usc.edu wrote: Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: error: package org.apache.nutch.storage does not exist I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848 tel:%2B1%20650-307-9848
Re: Nutch-Selenium in Nutch 1.10
Interestingly, I'm a mac user but I don't want to screw my laptop so I'm using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be installed properly. The issue would be I don't know how to integrate Selenium with Nutch 1.10. On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote: Hi all, Anyone here knows where to find the setup tutorial for Selenium on Mac ?? I find it difficult to install Xvfb on mac. Best, Jiaxin On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu wrote: Hi Shuo Li, We were facing a similar issue. Prof. Mattman suggested we look into this patch for Selenium on Nutch 1.10 : https://issues.apache.org/jira/browse/NUTCH-1933. Hope this helps! Thanks, Sapna On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote: Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: *error: package org.apache.nutch.storage does not exist* I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li -- Graduate Student MS in CS (Data Science) Viterbi School of Engineering University of Southern California Phone: +1 650-307-9848
Nutch-Selenium in Nutch 1.10
Yop, I'm trying to install selenium in Nutch 1.10. However, this error pops out: *error: package org.apache.nutch.storage does not exist* I can only find this package in Nutch 2.x. Is there a way to use Selenium in 1.10? Any advice would be appreciated. Regards, Shuo Li