Re: linkdb/current/part-00000/data does not exist

2015-02-22 Thread Shuo Li
I was using ./bin/crawl and not incremental crawling at that time. This
file appears after I start crawling *.gif, *.jpg, *.mov, etc. I will
provide more information if I can reproduce this error.

Thanks =)

On Sun, Feb 22, 2015 at 4:47 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 What command are you using to crawl? Are you using bin/crawl, and/or
 doing incremental crawling?

 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Shuo Li sli...@usc.edu
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Friday, February 20, 2015 at 3:26 PM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: linkdb/current/part-0/data does not exist

 Hi,
 
 
 I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
 with linkdb/current/part-0/data does not exist. I checked my
 directory and my files during crawling, and it appears this file
 sometimes exist and sometimes disappear. This is quite weird and stranger.
 
 
 Another problem is when we crawl NSIDC ADE, it will give us a 403
 forbidden error. Does this mean NSIDC ADE is blocking us?
 
 
 The log of first error is in the bottom of this email. Any help would be
 appreciated.
 
 
 Regards,
 Shuo Li
 
 
 
 
 
 
 
 
 
 
 LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
 LinkDb: java.io.FileNotFoundException: File
 file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-
 0/data does not exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.j
 ava:402)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:
 255)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileIn
 putFormat.java:47)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:20
 8)
 at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
 at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
 at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.
 java:1190)
 at
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
 at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)
 




Re: Problem installing Selenium on Ubuntu with Nutch trunk 1.10

2015-02-21 Thread Shuo Li
Yop,

Here's a correct ivy.xml. I think there may be some mistakes when we
install the patch. It will generate some duplicate ?xml? tag. You may
need to delete them manually. If anybody could provide a complete tutorial
or a correct patch that'd be great.

PS0: I didn't read the whole conversation. I hope this helped.
PS1: Please remove all the lines in that patch about ivy.xml and replace
with the attachment.

Regards,
Shuo Li

On Sat, Feb 21, 2015 at 11:43 AM, Nikunj Gala nikun...@usc.edu wrote:

 Hey you are correct  I see fails while patching ivy.xml on the latest
 GitHub Nutch Trunk
 The patch longs are as follows:

 ---
 patching file build.xml
 patching file ivy/ivy.xml
 Hunk #3 FAILED at 59.
 1 out of 3 hunks FAILED -- saving rejects to file ivy/ivy.xml.rej
 patching file src/plugin/build.xml
 Hunk #2 succeeded at 148 (offset 2 lines).
 patching file src/plugin/lib-selenium/build.xml
 patching file src/plugin/lib-selenium/ivy.xml
 patching file src/plugin/lib-selenium/plugin.xml
 patching file
 src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 patching file src/plugin/lib-selenium/src/pom.xml
 patching file src/plugin/protocol-selenium/.idea/.name
 patching file src/plugin/protocol-selenium/.idea/compiler.xml
 patching file
 src/plugin/protocol-selenium/.idea/copyright/profiles_settings.xml
 patching file src/plugin/protocol-selenium/.idea/encodings.xml
 patching file src/plugin/protocol-selenium/.idea/misc.xml
 patching file src/plugin/protocol-selenium/.idea/modules.xml
 patching file src/plugin/protocol-selenium/.idea/scopes/scope_settings.xml
 patching file src/plugin/protocol-selenium/.idea/vcs.xml
 patching file src/plugin/protocol-selenium/.idea/workspace.xml
 patching file src/plugin/protocol-selenium/build.xml
 patching file src/plugin/protocol-selenium/ivy.xml
 patching file src/plugin/protocol-selenium/plugin.xml
 patching file
 src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java
 patching file
 src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
 patching file
 src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html
 patching file src/plugin/protocol-selenium/src/pom.xml
 patching file
 src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html

 ---

 Trying to understand and fix the patch now.
 Has anybody else done any changes in the patch?

?xml version=1.0 ?

!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
	license agreements. See the NOTICE file distributed with this work for additional 
	information regarding copyright ownership. The ASF licenses this file to 
	You under the Apache License, Version 2.0 (the License); you may not use 
	this file except in compliance with the License. You may obtain a copy of 
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
	by applicable law or agreed to in writing, software distributed under the 
	License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS 
	OF ANY KIND, either express or implied. See the License for the specific 
	language governing permissions and limitations under the License. --

ivy-module version=1.0
	info organisation=org.apache.nutch module=nutch
		license name=Apache 2.0
			url=http://www.apache.org/licenses/LICENSE-2.0.txt/; /
		ivyauthor name=Apache Nutch Team url=http://nutch.apache.org; /
		description homepage=http://nutch.apache.org;Nutch is an open source web-search
			software. It builds on
			Hadoop, Tika and Solr, adding web-specifics,
			such as a crawler, a link-graph
			database etc.
		/description
	/info
	
	configurations
		include file=${basedir}/ivy/ivy-configurations.xml /
	/configurations
	
	publications
		!--get the artifact from our module name --
		artifact conf=master /
	/publications
	
	dependencies
		dependency org=org.slf4j name=slf4j-api rev=1.6.1
			conf=*-master /
		dependency org=org.slf4j name=slf4j-log4j12 rev=1.6.1
			conf=*-master /
		
		dependency org=log4j name=log4j rev=1.2.15 conf=*-master /
		
		dependency org=commons-lang name=commons-lang rev=2.6
			conf=*-default /
		dependency org=commons-collections name=commons-collections
			rev=3.1 conf=*-default /
		dependency org=commons-httpclient name=commons-httpclient
			rev=3.1 conf=*-master /
		dependency org=commons-codec name=commons-codec rev=1.3
			conf=*-default /
		
		dependency org=org.apache.hadoop name=hadoop-core rev=1.2.0
			conf=*-default
			exclude org=hsqldb name=hsqldb /
			exclude org=net.sf.kosmosfs name=kfs /
			exclude org=net.java.dev.jets3t name=jets3t /
			exclude org

linkdb/current/part-00000/data does not exist

2015-02-20 Thread Shuo Li
Hi,

I'm trying to crawl  NSF ACADIS with nutch-selenium. I meet a problem
*with linkdb/current/part-0/data
does not exist. *I checked my directory and my files during crawling, and
it appears this file sometimes exist and sometimes disappear. This is quite
weird and stranger.

Another problem is when we crawl NSIDC ADE, it will give us a 403 forbidden
error. Does this mean NSIDC ADE is blocking us?

The log of first error is in the bottom of this email. Any help would be
appreciated.

Regards,
Shuo Li





LinkDb: merging with existing linkdb: nsfacadis3Crawl/linkdb
LinkDb: java.io.FileNotFoundException: File
file:/vagrant/nutch/runtime/local/nsfacadis3Crawl/linkdb/current/part-0/data
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:402)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:255)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:47)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:208)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:316)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:276)


Vagrant Crushed When using Nutch-Selenium

2015-02-13 Thread Shuo Li
Hey guys,

I'm trying to use Nutch-Selenium to crawl nutch.apache.org. However, my
vagrant seems crushed after a few minutes. I forced it to shut down and it
turns out it only crawled 59 websites. My nutch version is 1.10 and my OS
is Ubuntu Trusty, 14.04.

Is there anything I can provide to you guys? Or is there anybody have the
same issue? Or 59 websites is the complete crawling?

Any suggestion would be appreciated.

Regards,
Shuo Li


Re: Vagrant Crushed When using Nutch-Selenium

2015-02-13 Thread Shuo Li
Hey guys,

After change my RAM to 2GB, everything works fine. My bad. Thanks for your
help.

Regards,
Shuo Li

On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Thank you Mo. I sincerely appreciate your guidance and contribution.

 I will work to get your nutch selenium grid plugin contributed
 to work with Nutch 1.x.

 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Mo Omer beancinemat...@gmail.com
 Date: Friday, February 13, 2015 at 11:10 AM
 To: Chris Mattmann chris.a.mattm...@jpl.nasa.gov
 Cc: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Vagrant Crushed When using Nutch-Selenium

 Hey all,
 
 When I had run nutch-selenium, it was in a config such that zombies were
 created from closing Firefox windows and they couldn't be reaped (again,
 due to the docker configuration I had).
 
 In a normal setup, it should not be an issue - if you're running 20
 threads in nutch that's potentially 20 open FF windows which isn't good
 for 512mb.
 
 Selenium grid is much more efficient, in that browsers are opened, but
 tabs are used to fetch sites - and only those are closed.
 
 Additionally, ensure you're using Nutch 2.2.1.
 
 Feel free to fork patch and tinker and PR as needed.
 
 Chris, if you want to be added to contribs on the GitHub project, that's
 cool with me! Wish I could dedicate more time to this, but I don't
 foresee using Nutch again in the near future, and am now working on
 projects that require lots of reading and possibly patches to Caffe and
 opencl r-CNN projects.
 
 Tl;dr:
 - no, this shouldn't be typical unless you're creating zombies like crazy
 and they're not being reaped (too many open file descriptors), running
 out of memory, or similar resource constraint.
 - selenium grid is TONs more efficient, but a bit more difficult to set
 up. I used it to crawl 100ks of sites.
 - unfortunately I can't commit more time to this, but if I can assist in
 any admin way, let me know.
 
 Thank you,
 
 Mo
 
 This message was drafted on a tiny touch screen; please forgive brevity 
 tpyos
 
  On Feb 13, 2015, at 12:41 PM, Mattmann, Chris A (3980)
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  Oh yes, please up your memory to like at least 2Gb..
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Shuo Li sli...@usc.edu
  Reply-To: dev@nutch.apache.org dev@nutch.apache.org
  Date: Friday, February 13, 2015 at 10:38 AM
  To: dev@nutch.apache.org dev@nutch.apache.org
  Cc: Mo Omer beancinemat...@gmail.com
  Subject: Re: Vagrant Crushed When using Nutch-Selenium
 
  Hey Mo and Prof Mattmann,
 
 
  I will try to crawl the 3 websites in the homework tonight (NASA AMD,
 NSF
  ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
 going
  on.
 
 
  Is memory an issue? My vagrant only has 512MB of memory.
 
 
  Regards,
  Shuo Li
 
 
  On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
  chris.a.mattm...@jpl.nasa.gov wrote:
 
  Hi Shuo,
 
  Thanks for your email. I wonder if using selenium grid would
  help?
 
  Please see this plugin:
 
  https://github.com/momer/nutch-selenium-grid-plugin
 
 
  I’m CC’ing Mo the author of the plugin to see if he experienced
  this while running the original selenium plugin - Mo did using
  selenium grid help the issue that Shuo is experiencing below?
 
  Mo: are you cool with portion the grid plugin, or if Lewis or
  I do it to trunk (with full credit to you of course?)
 
  Cheers,
  Chris
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann

Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
I think I have possibly finished installing.

What you need to do:
0. git status and checkout what you have modified.
1. patch -p0  YOUR_PATCH_FILE
2. ant clean jar
3. ant runtime

Will try crawling using selenium later on. Hope this helped. _

On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Yes I believe you need to install X11 - why don't you try and report
 back what you find thanks.

 Sent from my iPhone

 On Feb 12, 2015, at 8:28 AM, Jiaxin Ye jiaxi...@usc.edu wrote:

  Hi professor, but can we use Selenium on Mac?

 On Thursday, February 12, 2015, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 You need Selenium Jiaxin, in order to crawl dynamic pages in the
 polar dataset you have been assigned in my CSCI 572 search engines class.

 The instructions for integrating Selenium with Nutch 1.10-trunk
 are here:

 https://issues.apache.org/jira/browse/NUTCH-1933


 Cheers,
 Chris


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Jiaxin Ye jiaxi...@usc.edu
 Reply-To: dev@nutch.apache.org dev@nutch.apache.org
 Date: Thursday, February 12, 2015 at 12:46 AM
 To: dev@nutch.apache.org dev@nutch.apache.org
 Subject: Re: Nutch-Selenium in Nutch 1.10

 Well, good choice. I am thinking changing to ubuntu now. The thing is why
 do we need Selenium anyway? Just easier to perform crawling?
 
 On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
 sli...@usc.edu wrote:
 
 Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
 using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
 be installed properly. The issue would be I don't know how to integrate
 Selenium with Nutch 1.10.
 
 On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
 jiaxi...@usc.edu wrote:
 
 Hi all,
 
 
 Anyone here knows where to find the setup tutorial for Selenium on Mac ??
 I find it difficult to install Xvfb on mac.
 
 
 Best,
 Jiaxin
 
 
 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
 sapna...@usc.edu wrote:
 
 Hi Shuo Li,
 
 
 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.
 
 
 Hope this helps!
 
 
 Thanks,
 Sapna
 
 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
 sli...@usc.edu wrote:
 
 Yop,
 
 
 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:
 
 
 error: package org.apache.nutch.storage does not exist
 
 
 
 I can only find this package in Nutch 2.x. Is there a way to use Selenium
 in 1.10?
 
 
 Any advice would be appreciated.
 
 
 Regards,
 Shuo Li
 
 
 
 
 
 
 
 
 
 
 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California
 
 
 Phone:
 +1 650-307-9848 tel:%2B1%20650-307-9848
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




Re: Nutch-Selenium in Nutch 1.10

2015-02-12 Thread Shuo Li
Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still be
installed properly. The issue would be I don't know how to integrate
Selenium with Nutch 1.10.

On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye jiaxi...@usc.edu wrote:

 Hi all,

 Anyone here knows where to find the setup tutorial for Selenium on Mac ??
 I find it difficult to install Xvfb on mac.

 Best,
 Jiaxin

 On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh sapna...@usc.edu
 wrote:

 Hi Shuo Li,

 We were facing a similar issue. Prof. Mattman suggested we look into this
 patch for Selenium on Nutch 1.10 :
 https://issues.apache.org/jira/browse/NUTCH-1933.

 Hope this helps!

 Thanks,
 Sapna

 On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li sli...@usc.edu wrote:

 Yop,

 I'm trying to install selenium in Nutch 1.10. However, this error pops
 out:

 *error: package org.apache.nutch.storage does not exist*

 I can only find this package in Nutch 2.x. Is there a way to use
 Selenium in 1.10?

 Any advice would be appreciated.

 Regards,
 Shuo Li




 --
 Graduate Student
 MS in CS (Data Science)
 Viterbi School of Engineering
 University of Southern California

 Phone: +1 650-307-9848





Nutch-Selenium in Nutch 1.10

2015-02-10 Thread Shuo Li
Yop,

I'm trying to install selenium in Nutch 1.10. However, this error pops out:

*error: package org.apache.nutch.storage does not exist*

I can only find this package in Nutch 2.x. Is there a way to use Selenium
in 1.10?

Any advice would be appreciated.

Regards,
Shuo Li