All,
I am coming across a few pages that are not responsive at all which is
causing Nutch to #failwhale before finishing the current crawl. I have
increased http.timeout and it still crashes. How can I get Nutch to
skip over unresponsive URLs that are causing the entire thing to bail?
Thanks,
Ada
)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1468)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1441)
On Thu, Jul 17, 2014 at 10:06 AM, Adam Estrada wrote:
> All,
>
> I am coming across a
Julien,
I just bumped it up from 2 gigs to 4. Let's see how it goes.
Thanks!
Adam
On Thu, Jul 17, 2014 at 1:40 PM, Adam Estrada wrote:
> Julien and Markus,
>
> The logs report that a couple of threads hung while processing certain
> URLs. Below that was the out of memory WAR
All,
I have been crawling the web now for a few days without any issues.
All of the sudden today I came across this error.
Exception in thread "main" java.io.IOException: Segment already parsed!
at
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:89)
at org.apache
false,
which means
that a separate parsing step is required after fetching is
finished.
Maybe you could shed some light on why this property exists so that
other folks reading this thread can benefit?
Thanks again!
Adam
On Mon, Jul 21, 2014 at 4:21 PM, Adam Estrada wrote:
> All,
>
&g
You're right! Thanks Julien. I am using 4gigs of RAM now and it seems
to be cruising right along! I think I'll increase it even more on my
next run.
Adam
On Tue, Jul 22, 2014 at 9:40 AM, Adam Estrada wrote:
> Sebastian,
>
> Thanks so much for the quick response. You
Curious...I have been using Nutch for a while now and have never tried to index
any audio or video formats. Is it feasible to grab the audio out of both forms
of media and then index it? I believe this would require some kind of
transcription which may be out of reach on this project.
Thanks,
A
Another example would be the content embedded in this flash movie.
http://digitalmedia.worldbank.org/SSP/lac/investment-in-haiti/
Adam
On Wed, Jan 26, 2011 at 1:02 AM, Gora Mohanty wrote:
> On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
> wrote:
>> Curious...I have been using Nutch
Does anyone have any information on this for use with Nutch?
Thanks,
Adam
Thank you very much for the info!
Adam
On Wed, Jan 26, 2011 at 11:37 AM, Gora Mohanty wrote:
> On Wed, Jan 26, 2011 at 7:38 PM, Adam Estrada
> wrote:
>> Another example would be the content embedded in this flash movie.
>>
>> http://digitalmedia.worldbank.org/SSP
All,
I am now using Nutch 1,2 and am curious as to what the minimum files
are to run the app. is there a bare bones diagram or something that I
can use to deploy the application?
adam
Try Hadoop'in it up...
http://wiki.apache.org/nutch/NutchHadoopTutorial. The version of Nutch
in trunk is dependent on a project called Gora which is supposed to
help speed things up as well but I have yet to make it work...I'd
stick with the tagged version 1.2 and go the Hadoop route.
Best,
Adam
But is there any way to programmatically modify the config files behind Nutch?
I am talking specifically about crawl-urlfilter.txt and the Solr mapping file.
My inquiring mind wants to know ;-)
Regards,
Adam
I would add the -solr parameter and then add your crawled data to your Solr
instance. For the client, there is a PHP version out there on Google Code.
http://code.google.com/p/solr-php-client/
Adam
On Feb 15, 2011, at 11:14 AM, Muwonge Ronald wrote:
> Hi all,
> I need someone to advise on how
Can you post the full command line you ran and what you have entered
in the crawl-urlfilter.txt file?
Thanks,
Adam
On Sun, Feb 20, 2011 at 2:17 PM, McGibbney, Lewis John
wrote:
> Hello list,
>
> Whilst using Nutch-1.2 on ubuntu 10.04 and undertaking a crawl either using
> crawl command or separ
All,
I downloaded the Nutch 1.2 binaries from here
http://www.bizdirusa.com/mirrors/apache//nutch/ and get the following error
when running it from a Cygwin console on a Windows 7 machine.
$ bin/nutch crawl urls -depth 50 -threads 10 -topN 50 -solr
http://localhost:8983/solr
Exception in thread "
t; whereis java
> locate java
>
> and if none of those come back, how about java -version?
>
> Cheers,
> Chris
>
> On Apr 21, 2011, at 7:01 PM, Adam Estrada wrote:
>
> > All,
> >
> > I downloaded the Nutch 1.2 binaries from here
> > http://www.
something?
Thanks,
Adam Estrada
On Sun, Apr 24, 2011 at 10:35 PM, Adam Estrada <
estrada.adam.gro...@gmail.com> wrote:
> All,
>
> I use Nutch to crawl selected websites and then store the results in Solr.
> In Nutch 1.1, I was able to do this using the -solr
> http://localhost:8983/solr command. This does not
All,
I have a root domain and a couple directories deep I have some files that I
want to index. The problem is that they are not referenced on the main page
using a hyperlink or anything like that.
http://www.geoglobaldomination.org/kml/temp/
I want to be able to crawl down in to /kml/temp/ with
i under command line options
>
> On Wed, Aug 24, 2011 at 9:03 PM, Adam Estrada <
> estrada.adam.gro...@gmail.com
> > wrote:
>
> > All,
> >
> > I have a root domain and a couple directories deep I have some files that
> I
> > want to index. The problem
21 matches
Mail list logo