Susam,
Oh really! I thought contents are dynamically GET/POSTed by this method.
Then displaying of HTML on the web page is taking time probably in my
application. Actually I am printing the HTML pages that I got in the search
results. Every page is taking 3-4 seconds to get displayed.
Thanks for
On Tue, May 12, 2009 at 10:56 AM, Gaurang Patel wrote:
> Thanks Susam,
>
> This worked perfectly for me. Thanks for reply.
>
> *One more concern:*
> Does this method fetch the contents(source code) of the web page
> dynamically? For example, if the hit for which we want source code of is "
> www.y
Thanks Susam,
This worked perfectly for me. Thanks for reply.
*One more concern:*
Does this method fetch the contents(source code) of the web page
dynamically? For example, if the hit for which we want source code of is "
www.yahoo.com", this method will send GET/POST reques to the yahoo server
d
On Tue, May 12, 2009 at 8:50 AM, Gaurang Patel wrote:
> Hi All,*
>
> *Can anyone help me with this problem?*
>
> Here is my problem:*
>
> I want to get the source code of the hits I get using nutch crawler. I am
> not sure whether nutch stores the content of a web page(i.e actual source
> code for
Hi All,*
*Can anyone help me with this problem?*
Here is my problem:*
I want to get the source code of the hits I get using nutch crawler. I am
not sure whether nutch stores the content of a web page(i.e actual source
code for web page) in the crawled results. I am afraid if it does not!
If nut
Hi Alexander,
I tried to debug by placing some xml parsing errors in the config files and
found that nutch wasn't looking at the rignt location. Once, that was fixed
some urls were still skipped and it turns out it was due to robots.txt file
forbidding nutch to crawl certain urls.
Thanks. kena
Thanks for all the responses. I observed one more related issue
Last month before this post, I noticed that if hadoop runs out of tmp space,
nutch goes into a
very long loop in the reduce phase of fetch. Dont have exact string.
basically it says
on node A, available space = 4G, expected usage
Hi,
The message is a little bit old (Jan 2009), so I don't know if your problem
is now fixed or not... in any case, I have had the same problem trying to
run NutchBean from an standalone Java application... in my case, I fixed the
problem by adding to the source code of my application the followi
Hi Andrzej,
thanks for your advice!
I've introduced a new metadatum which I've added to the CrawlDb for each record
using CrawlDatum.getMetaData().put(new Text("_fft_"), new
Text(String.valueOf(firstFoundLong)));
This works as expected and during debugging I see all the metadata in each
Cra
I've set up nutch on Windows with Tomcat and it works well. The only problem
I have it that I want to do a regular reindex of the web sites but the
indexer can't overwrite the crawl data directory (or maybe files within it)
because it is somehow locked by the nutch web app. I have to shutdown
rest
On Mon, May 11, 2009 at 2:12 PM, Raymond Balmès
wrote:
> as a side question, is it possible to control where tmp files are stored ?
> I bumped recently on a FS full situation so I whish to move those files to a
> bigger partition.
>
> -Ray-
You can use the 'hadoop.tmp.dir' property in hadoop-site
as a side question, is it possible to control where tmp files are stored ?
I bumped recently on a FS full situation so I whish to move those files to a
bigger partition.
-Ray-
2009/5/11 Andrzej Bialecki
> ravi jagan wrote:
>
>> Cluster Summary
>>
>> I am running a crawl on about 1 Million web d
12 matches
Mail list logo