Ignoring errors in crawl

2014-07-17 Thread Adam Estrada
All, I am coming across a few pages that are not responsive at all which is causing Nutch to #failwhale before finishing the current crawl. I have increased http.timeout and it still crashes. How can I get Nutch to skip over unresponsive URLs that are causing the entire thing to bail? Thanks, Ada

Re: Ignoring errors in crawl

2014-07-17 Thread Adam Estrada
) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441) On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada wrote: > All, > > I am coming across a

Re: Ignoring errors in crawl

2014-07-21 Thread Adam Estrada
Julien, I just bumped it up from 2 gigs to 4. Let's see how it goes. Thanks! Adam On Thu, Jul 17, 2014 at 1:40 PM, Adam Estrada wrote: > Julien and Markus, > > The logs report that a couple of threads hung while processing certain > URLs. Below that was the out of memory WAR

Segment already parsed!

2014-07-21 Thread Adam Estrada
All, I have been crawling the web now for a few days without any issues. All of the sudden today I came across this error. Exception in thread "main" java.io.IOException: Segment already parsed! at org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89) at org.apache

Re: Segment already parsed!

2014-07-22 Thread Adam Estrada
false, which means that a separate parsing step is required after fetching is finished. Maybe you could shed some light on why this property exists so that other folks reading this thread can benefit? Thanks again! Adam On Mon, Jul 21, 2014 at 4:21 PM, Adam Estrada wrote: > All, > &g

Re: Segment already parsed!

2014-07-22 Thread Adam Estrada
You're right! Thanks Julien. I am using 4gigs of RAM now and it seems to be cruising right along! I think I'll increase it even more on my next run. Adam On Tue, Jul 22, 2014 at 9:40 AM, Adam Estrada wrote: > Sebastian, > > Thanks so much for the quick response. You

Archiving Audio and Video

2011-01-25 Thread Adam Estrada
Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would require some kind of transcription which may be out of reach on this project. Thanks, A

Re: Archiving Audio and Video

2011-01-26 Thread Adam Estrada
Another example would be the content embedded in this flash movie. http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/ Adam On Wed, Jan 26, 2011 at 1:02 AM, Gora Mohanty wrote: > On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada > wrote: >> Curious...I have been using Nutch

[Example] Configuration for a Hadoop Cluster

2011-01-26 Thread Adam Estrada
Does anyone have any information on this for use with Nutch? Thanks, Adam

Re: Archiving Audio and Video

2011-01-26 Thread Adam Estrada
Thank you very much for the info! Adam On Wed, Jan 26, 2011 at 11:37 AM, Gora Mohanty wrote: > On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada > wrote: >> Another example would be the content embedded in this flash movie. >> >> http://digitalmedia.worldbank.org/SSP

Minimum Deployment Files

2011-01-31 Thread Adam Estrada
All, I am now using Nutch 1,2 and am curious as to what the minimum files are to run the app. is there a bare bones diagram or something that I can use to deploy the application? adam

Re: How to speed up nutch crawling!

2011-02-02 Thread Adam Estrada
Try Hadoop'in it up... http://wiki.apache.org/nutch/NutchHadoopTutorial. The version of Nutch in trunk is dependent on a project called Gora which is supposed to help speed things up as well but I have yet to make it work...I'd stick with the tagged version 1.2 and go the Hadoop route. Best, Adam

Stupid Question

2011-02-10 Thread Adam Estrada
But is there any way to programmatically modify the config files behind Nutch? I am talking specifically about crawl-urlfilter.txt and the Solr mapping file. My inquiring mind wants to know ;-) Regards, Adam

Re: search result page

2011-02-15 Thread Adam Estrada
I would add the -solr parameter and then add your crawled data to your Solr instance. For the client, there is a PHP version out there on Google Code. http://code.google.com/p/solr-php-client/ Adam On Feb 15, 2011, at 11:14 AM, Muwonge Ronald wrote: > Hi all, > I need someone to advise on how

Re: Problems crawling specific site

2011-02-20 Thread Adam Estrada
Can you post the full command line you ran and what you have entered in the crawl-urlfilter.txt file? Thanks, Adam On Sun, Feb 20, 2011 at 2:17 PM, McGibbney, Lewis John wrote: > Hello list, > > Whilst using Nutch-1.2 on ubuntu 10.04 and undertaking a crawl either using > crawl command or separ

Strange ERROR: Exception in thread "main" java.lang.NoClassDefFoundError: Studio

2011-04-21 Thread Adam Estrada
All, I downloaded the Nutch 1.2 binaries from here http://www.bizdirusa.com/mirrors/apache//nutch/ and get the following error when running it from a Cygwin console on a Windows 7 machine. $ bin/nutch crawl urls -depth 50 -threads 10 -topN 50 -solr http://localhost:8983/solr Exception in thread "

Re: Strange ERROR: Exception in thread "main" java.lang.NoClassDefFoundError: Studio

2011-04-22 Thread Adam Estrada
t; whereis java > locate java > > and if none of those come back, how about java -version? > > Cheers, > Chris > > On Apr 21, 2011, at 7:01 PM, Adam Estrada wrote: > > > All, > > > > I downloaded the Nutch 1.2 binaries from here > > http://www.

Solr Indexer with Nutch 1.2 and 1.3

2011-04-24 Thread Adam Estrada
something? Thanks, Adam Estrada

Re: Solr Indexer with Nutch 1.2 and 1.3

2011-04-25 Thread Adam Estrada
On Sun, Apr 24, 2011 at 10:35 PM, Adam Estrada < estrada.adam.gro...@gmail.com> wrote: > All, > > I use Nutch to crawl selected websites and then store the results in Solr. > In Nutch 1.1, I was able to do this using the -solr > http://localhost:8983/solr command. This does not

Recursively searching through web dirs

2011-08-24 Thread Adam Estrada
All, I have a root domain and a couple directories deep I have some files that I want to index. The problem is that they are not referenced on the main page using a hyperlink or anything like that. http://www.geoglobaldomination.org/kml/temp/ I want to be able to crawl down in to /kml/temp/ with

Re: Recursively searching through web dirs

2011-08-25 Thread Adam Estrada
i under command line options > > On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada < > estrada.adam.gro...@gmail.com > > wrote: > > > All, > > > > I have a root domain and a couple directories deep I have some files that > I > > want to index. The problem