Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
Susam, Oh really! I thought contents are dynamically GET/POSTed by this method. Then displaying of HTML on the web page is taking time probably in my application. Actually I am printing the HTML pages that I got in the search results. Every page is taking 3-4 seconds to get displayed. Thanks for

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Susam Pal
On Tue, May 12, 2009 at 10:56 AM, Gaurang Patel wrote: > Thanks Susam, > > This worked perfectly for me. Thanks for reply. > > *One more concern:* > Does this method fetch the contents(source code) of the web page > dynamically? For example, if the hit for which we want source code of is " > www.y

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
Thanks Susam, This worked perfectly for me. Thanks for reply. *One more concern:* Does this method fetch the contents(source code) of the web page dynamically? For example, if the hit for which we want source code of is " www.yahoo.com", this method will send GET/POST reques to the yahoo server d

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Susam Pal
On Tue, May 12, 2009 at 8:50 AM, Gaurang Patel wrote: > Hi All,* > > *Can anyone help me with this problem?* > > Here is my problem:* > > I want to get the source code of the hits I get using nutch crawler. I am > not sure whether nutch stores the content of a web page(i.e actual source > code for

Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
Hi All,* *Can anyone help me with this problem?* Here is my problem:* I want to get the source code of the hits I get using nutch crawler. I am not sure whether nutch stores the content of a web page(i.e actual source code for web page) in the crawled results. I am afraid if it does not! If nut

Re: Registered plugin never invoked and urls skipped

2009-05-11 Thread kazam
Hi Alexander, I tried to debug by placing some xml parsing errors in the config files and found that nutch wasn't looking at the rignt location. Once, that was fixed some urls were still skipped and it turns out it was due to robots.txt file forbidding nutch to crawl certain urls. Thanks. kena

Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

2009-05-11 Thread ravi jagan
Thanks for all the responses. I observed one more related issue Last month before this post, I noticed that if hadoop runs out of tmp space, nutch goes into a very long loop in the reduce phase of fetch. Dont have exact string. basically it says on node A, available space = 4G, expected usage

Re: Nutch on Linux: common-terms.utf8 not found

2009-05-11 Thread nordez
Hi, The message is a little bit old (Jan 2009), so I don't know if your problem is now fixed or not... in any case, I have had the same problem trying to run NutchBean from an standalone Java application... in my case, I fixed the problem by adding to the source code of my application the followi

AW: Add new field to CrawlDatum

2009-05-11 Thread Koch Martina
Hi Andrzej, thanks for your advice! I've introduced a new metadatum which I've added to the CrawlDb for each record using CrawlDatum.getMetaData().put(new Text("_fft_"), new Text(String.valueOf(firstFoundLong))); This works as expected and during debugging I see all the metadata in each Cra

Re-indexing with a live tomcat web app

2009-05-11 Thread golfman
I've set up nutch on Windows with Tomcat and it works well. The only problem I have it that I want to do a regular reindex of the web sites but the indexer can't overwrite the crawl data directory (or maybe files within it) because it is somehow locked by the nutch web app. I have to shutdown rest

Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

2009-05-11 Thread Susam Pal
On Mon, May 11, 2009 at 2:12 PM, Raymond Balmès wrote: > as a side question, is it possible to control where tmp files are stored ? > I bumped recently on a FS full situation so I whish to move those files to a > bigger partition. > > -Ray- You can use the 'hadoop.tmp.dir' property in hadoop-site

Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

2009-05-11 Thread Raymond Balmès
as a side question, is it possible to control where tmp files are stored ? I bumped recently on a FS full situation so I whish to move those files to a bigger partition. -Ray- 2009/5/11 Andrzej Bialecki > ravi jagan wrote: > >> Cluster Summary >> >> I am running a crawl on about 1 Million web d