date:20060524

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-24 Thread Dawid Weiss (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413220 ] 

Dawid Weiss commented on NUTCH-265:
---

If you just mean the user interface, then you can simply take the XSLT 
stylesheet from Carrot2 and reuse it in Nutch with the opensearch XML -- I 
believe there is even an example in Carrot2 of using opensearch, so you 
shouldn't have much troubles.

Now, the phrases you wish to see on your screen won't always be so beautiful 
because search results clustering works on snippets extracted from search 
results. If you want clean and accurate labels then you'd need to use a 
predefined ontology or something -- I can't help you with that. 

Try playing around with Carrot2 demo and see if the results satisfy your needs. 
If so, then rewriting Nutch's user interface to suit your needs shouldn't be a 
problem. If your expectations are more demanding then you'll need to think of 
some other solution.


> Getting Clustered results in better form.
> -
>
>  Key: NUTCH-265
>  URL: http://issues.apache.org/jira/browse/NUTCH-265
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.7.2
> Reporter: Kris K

>
> The cluster results are coming with title and link to URL. For improvement it 
> should be clustered keyword phrases (Like  Vivisimo type). Any person can 
> share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

2006-05-24 Thread Kris K (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-265?page=comments#action_12413214 ] 

Kris K commented on NUTCH-265:
--

Dear Dawid, Yeah I want the same interface as you showed me before but I am not 
able to do that. My area of concern is like that: If I am searching for keyword 
"Java" the clustered results should come in the following format
Java SourceCode
Java Book
Java Compiler
Java Programming
Java Server
Java Technology
Java Servlets
Java Applets
JavaScript .

I really appreciate you for your help to provide me the right direction. 



> Getting Clustered results in better form.
> -
>
>  Key: NUTCH-265
>  URL: http://issues.apache.org/jira/browse/NUTCH-265
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.7.2
> Reporter: Kris K

>
> The cluster results are coming with title and link to URL. For improvement it 
> should be clustered keyword phrases (Like  Vivisimo type). Any person can 
> share their views on it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

2006-05-24 Thread Andrzej Bialecki (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-285?page=all ]
 
Andrzej Bialecki  closed NUTCH-285:
---

Fix Version: 0.8-dev
 Resolution: Fixed

Fixed, I also applied the same fix in CrawlDb, which suffered from the same 
problem.

> LinkDb Fails rename doesn't create parent directories
> -
>
>  Key: NUTCH-285
>  URL: http://issues.apache.org/jira/browse/NUTCH-285
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS
> Reporter: Dennis Kubes
>  Fix For: 0.8-dev
>  Attachments: create_linkdb_dirs.patch
>
> The LinkDb install method fails to correctly rename (move) the LinkDb working 
> directory to the final directory if the parent directories do not exist.
> For example if I am creating a linkdb by the name of crawl/linkdb the install 
> method trys to rename the working linkdb directory (something like 
> linkdb-20060523 in root of DFS) to crawl/linkdb/current.  But if the 
> crawl/linkdb directory does not already exist then the rename fails and the 
> linkdb-20060523 working directory stays in the root directory of the DFS for 
> the user.
> The attached patch adds a mkdirs command to the install method to ensure that 
> the parent directories exist before trying to rename.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-286) Handling common error-pages as 404

2006-05-24 Thread Stefan Neufeind (JIRA)

Handling common error-pages as 404
--

 Key: NUTCH-286
 URL: http://issues.apache.org/jira/browse/NUTCH-286
 Project: Nutch
Type: Improvement

Reporter: Stefan Neufeind


Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
even though a specific page could not be found. Example I just found  is:
http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
That's a typo3-page explaining in it's standard-layout and wording: "The 
requested page did not exist or was inaccessible."

So I had the idea if somebody might create a plugin that could find commonly 
used formulations for "page does not exist" etc. and turn the page into a 404 
before feeding them  into the nutch-index  - although the server responded with 
status 200 ok.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Andrzej Bialecki


Ken Krugler wrote:

Jeremy Bensley wrote:
There are posts every three or four days to the nutch-agent 
regarding bots
submitting empty forms to websites. I don't think I've seen any 
regular devs

reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch 
submitting
forms? I could find no bug listings in JIRA for this.  If it is 
known and

resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).


I think the biggest issue is following links for a form POST. This 
definitely seems wrong to me, and thus should never be done.


I don't think this is happening anymore, there is an explicit check for 
POST method in DOMContentUtils that should prevent this. However, some 
horribly broken HTML may be fooling Neko or TagSoup, so that they lose 
the 'method' attribute (in which case it defaults to GET).




There's a separate issue re whether it's OK to follow form links that 
do a GET, since that's what the guy complained to us about recently. 
He agreed that his form should be doing a POST, since it triggers a 
massive build process, but he also said that no other crawl besides 
Nutch was following these links.


I could see making that a configurable option, where it was false by 
default. But we'd probably need to modify this setting to be 
domain-specific, ie some sites we crawl require us to follow these 
types of links to get at content, but in general we'd want to not 
follow them.


For now I modified the code to skip form action URLs, depending on a 
boolean option. I'll commit this in a moment.



This brings up an issue I've been thinking about. It might make sense 
to require everybody set the user-agent string, versus it having 
default values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to do 
this were explicit, this wouldn't be much of a hardship for anybody 
trying it out.


I could write up some quick text for the Wiki re what a good user 
agent string should contain, and what should be on the web page that 
it refers to, since we also went through that same process not too 
long ago.


I like this idea. I know that I've been guilty of this in the past, out 
of pure laziness ...


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Extract infos from documents and query external sites

2006-05-24 Thread Stefan Neufeind

HellSpawn wrote:
> Hi all, I'm new :)
> 
> I have to extract some informations from an address book in my site
> (example: names and surnames) and then use it to build queries on sites like
> scholar.google.com, indexing the result page with my crawler. Can I do it?
> How?

Not "out of the box". You'd have to figure out building query-strings (I
assume they use GET-parameters) from your addressbook, and you could
then "index" those URLs.

For me the question though remains why you'd want to do that - but you
could :-)

  Stefan

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Ken Krugler


Jeremy Bensley wrote:

There are posts every three or four days to the nutch-agent regarding bots
submitting empty forms to websites. I don't think I've seen any regular devs
reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch submitting
forms? I could find no bug listings in JIRA for this.  If it is known and
resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).


I think the biggest issue is following links for a form POST. This 
definitely seems wrong to me, and thus should never be done.


There's a separate issue re whether it's OK to follow form links that 
do a GET, since that's what the guy complained to us about recently. 
He agreed that his form should be doing a POST, since it triggers a 
massive build process, but he also said that no other crawl besides 
Nutch was following these links.


I could see making that a configurable option, where it was false by 
default. But we'd probably need to modify this setting to be 
domain-specific, ie some sites we crawl require us to follow these 
types of links to get at content, but in general we'd want to not 
follow them.



2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer such as
myself it appears as though webmasters are not getting many replies to their
inqueries.


I can speak for myself only .. I'm not tracking that list. What about others?


I did respond to John Masone at MacFixer.net, to get the URL to the 
form where Nutch was triggering a submit. So just FYI for testing the 
fix, it's:


http://www.macfixer.net/contact


I don't mean to be alarmist, but I think it is in the community's best
interests to make sure that these kinds of complaints get resolved such that
nutch is a good 'citizen' and isn't blacklisted from searching sites.
Of course you are right, there is no ill will here on our part, just 
a long queue of issues to address ... but it seems we have to 
prioritize this one.


This brings up an issue I've been thinking about. It might make sense 
to require everybody set the user-agent string, versus it having 
default values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to 
do this were explicit, this wouldn't be much of a hardship for 
anybody trying it out.


I could write up some quick text for the Wiki re what a good user 
agent string should contain, and what should be on the web page that 
it refers to, since we also went through that same process not too 
long ago.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

[jira] Updated: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

2006-05-24 Thread Dennis Kubes (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-285?page=all ]

Dennis Kubes updated NUTCH-285:
---

Attachment: create_linkdb_dirs.patch

Patch to add mkdirs call to install method of LinkDb to ensure parent 
directories exist before attempting rename.

> LinkDb Fails rename doesn't create parent directories
> -
>
>  Key: NUTCH-285
>  URL: http://issues.apache.org/jira/browse/NUTCH-285
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS
> Reporter: Dennis Kubes
>  Attachments: create_linkdb_dirs.patch
>
> The LinkDb install method fails to correctly rename (move) the LinkDb working 
> directory to the final directory if the parent directories do not exist.
> For example if I am creating a linkdb by the name of crawl/linkdb the install 
> method trys to rename the working linkdb directory (something like 
> linkdb-20060523 in root of DFS) to crawl/linkdb/current.  But if the 
> crawl/linkdb directory does not already exist then the rename fails and the 
> linkdb-20060523 working directory stays in the root directory of the DFS for 
> the user.
> The attached patch adds a mkdirs command to the install method to ensure that 
> the parent directories exist before trying to rename.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

2006-05-24 Thread Dennis Kubes (JIRA)

LinkDb Fails rename doesn't create parent directories
-

 Key: NUTCH-285
 URL: http://issues.apache.org/jira/browse/NUTCH-285
 Project: Nutch
Type: Bug

Versions: 0.8-dev
 Environment: Windows XP Media Center 2005, Fedora Core 5, Java 5, DFS
Reporter: Dennis Kubes
 Attachments: create_linkdb_dirs.patch

The LinkDb install method fails to correctly rename (move) the LinkDb working 
directory to the final directory if the parent directories do not exist.

For example if I am creating a linkdb by the name of crawl/linkdb the install 
method trys to rename the working linkdb directory (something like 
linkdb-20060523 in root of DFS) to crawl/linkdb/current.  But if the 
crawl/linkdb directory does not already exist then the rename fails and the 
linkdb-20060523 working directory stays in the root directory of the DFS for 
the user.

The attached patch adds a mkdirs command to the install method to ensure that 
the parent directories exist before trying to rename.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Andrzej Bialecki


Jeremy Bensley wrote:
There are posts every three or four days to the nutch-agent regarding 
bots
submitting empty forms to websites. I don't think I've seen any 
regular devs

reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch submitting
forms? I could find no bug listings in JIRA for this.  If it is known and
resolved, what versions of the bot exhibit this behavior?


Yes, there was a discussion on the list about this - I'm afraid this 
behavior is present in both 0.7.x and 0.8. I'm going to remove the 
offending code (or to put it as an option, turned off by default).




2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer 
such as
myself it appears as though webmasters are not getting many replies to 
their

inqueries.


I can speak for myself only .. I'm not tracking that list. What about 
others?




I don't mean to be alarmist, but I think it is in the community's best
interests to make sure that these kinds of complaints get resolved 
such that

nutch is a good 'citizen' and isn't blacklisted from searching sites.

Of course you are right, there is no ill will here on our part, just a long 
queue of issues to address ... but it seems we have to prioritize this one.

--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-24 Thread Jeremy Bensley


There are posts every three or four days to the nutch-agent regarding bots
submitting empty forms to websites. I don't think I've seen any regular devs
reply in-list to these issues, and am just wondering if these cases are
being analyzed.

1. Is there a known (resolved or current) bug regarding Nutch submitting
forms? I could find no bug listings in JIRA for this.  If it is known and
resolved, what versions of the bot exhibit this behavior?

2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer such as
myself it appears as though webmasters are not getting many replies to their
inqueries.

I don't mean to be alarmist, but I think it is in the community's best
interests to make sure that these kinds of complaints get resolved such that
nutch is a good 'citizen' and isn't blacklisted from searching sites.

Thanks,

Jeremy

[jira] Created: (NUTCH-284) NullPointerException during index

2006-05-24 Thread Stefan Neufeind (JIRA)

NullPointerException during index
-

 Key: NUTCH-284
 URL: http://issues.apache.org/jira/browse/NUTCH-284
 Project: Nutch
Type: Bug

  Components: indexer  
Versions: 0.8-dev
Reporter: Stefan Neufeind


For  quite a few this "reduce > sort" has been going on. Then it fails. What 
could be wrong with this?


060524 212613 reduce > sort
060524 212614 reduce > sort
060524 212615 reduce > sort
060524 212615 found resource common-terms.utf8 at 
file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
060524 212615 found resource common-terms.utf8 at 
file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
060524 212619 Optimizing index.
060524 212619 job_jlbhhm
java.lang.NullPointerException
at 
org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-70) duplicate pages - virtual hosts in db.

2006-05-24 Thread Stefan Neufeind (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-70?page=comments#action_12413169 ] 

Stefan Neufeind commented on NUTCH-70:
--

Is the content exactly the same? Maybe could the page be checked  against an 
already existing one by an MD5 on the content? But I'm not sure if there is a 
clean way to workaround the problem - what if all pages are the same except 
one, on the other vhost? Would have to crawl all anyway, wouldn't you?

> duplicate pages - virtual hosts in db.
> --
>
>  Key: NUTCH-70
>  URL: http://issues.apache.org/jira/browse/NUTCH-70
>  Project: Nutch
> Type: Bug

>  Environment: 0,7 dev
> Reporter: YourSoft

>
> Dear Developers,
> I have a problem with nutch:
> - There are many sites duplicates in the webdb and in the segments.
> The source of this problem is:
> - If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, 
> origo.hu, origo.matav.hu, origo.matavnet.hu etc.: the result pages are the 
> same, only the inlinks are differents.
> - The ip address is the same.
> - When search, all virtualhosts are in the results.
> Google only show one of these virtual hosts, the nutch show all. The result 
> nutch db is larger, and this case slower, than google.
> Have any idea, how to remove these duplicates?
> Regards,
> Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Querying a site by extracting doc informations

2006-05-24 Thread rosario . salatiello

Hi all, I'm new here :)
I have to extract some informations from a crawled document in my site (for
example: name and surname), using them to build a query for another site
(like scholar.google.com) making the result indexed by nutch. Can I do this?
How?

Thank you

Rosario Salatiello

[jira] Commented: (NUTCH-44) too many search results

2006-05-24 Thread Stefan Neufeind (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-44?page=comments#action_12413155 ] 

Stefan Neufeind commented on NUTCH-44:
--

hi,
any progress on this?

> too many search results
> ---
>
>  Key: NUTCH-44
>  URL: http://issues.apache.org/jira/browse/NUTCH-44
>  Project: Nutch
> Type: Bug

>   Components: web gui
>  Environment: web environment
> Reporter: Emilijan Mirceski

>
> There should be a limitation (user defined) on the number of results the 
> search engine can return. 
> For example, if one modifies the seach url as:
> http:///search.jsp?query=&hitsPerPage=2&hitsPerSite=0
> The search will try to return 20,000 pages which isn't good for the server 
> side performance. 
> Is it possible to have a setting in the config xml files to control this?
> Thanks,
> Emilijan

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

2006-05-24 Thread Scott Ganyo (JIRA)

If the Fetcher times out and abandons Fetcher Threads, severe errors will occur 
on those Threads


 Key: NUTCH-283
 URL: http://issues.apache.org/jira/browse/NUTCH-283
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Scott Ganyo
 Attachments: patch.txt

If a Fetcher has chosen to time out and has abandoned outstanding Fetcher 
Threads, resources that those Fetcher Threads may be using are closed.  This 
naturally causes any abandoned Fetcher Threads to fail when they later attempt 
to finish up their work in progress.

I have a patch that addresses this that I am attaching.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

2006-05-24 Thread Scott Ganyo (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-283?page=all ]

Scott Ganyo updated NUTCH-283:
--

Attachment: patch.txt

> If the Fetcher times out and abandons Fetcher Threads, severe errors will 
> occur on those Threads
> 
>
>  Key: NUTCH-283
>  URL: http://issues.apache.org/jira/browse/NUTCH-283
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Scott Ganyo
>  Attachments: patch.txt
>
> If a Fetcher has chosen to time out and has abandoned outstanding Fetcher 
> Threads, resources that those Fetcher Threads may be using are closed.  This 
> naturally causes any abandoned Fetcher Threads to fail when they later 
> attempt to finish up their work in progress.
> I have a patch that addresses this that I am attaching.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Extract infos from documents and query external sites

2006-05-24 Thread HellSpawn


Hi all, I'm new :)

I have to extract some informations from an address book in my site
(example: names and surnames) and then use it to build queries on sites like
scholar.google.com, indexing the result page with my crawler. Can I do it?
How?

Thank you

Rosario Salatiello
--
View this message in context: 
http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4541042
Sent from the Nutch - Dev forum at Nabble.com.

Fetcher and MapReduce

2006-05-24 Thread Hamza Kaya


Hi,

I'm trying to crawl approx. 500.000 urls. After inject and generate I
started fetchers using 6 map tasks and 3 reduce tasks. All the map tasks had
successfully completed while all the reduce tasks got an OutOfMemory
exception. This exception was caught after the append phase (during the sort
phase). As far as I observed, during a fetch operation, all the map tasks
outputs to a temp. sequence file. During the reduce operation, each reducer
copies all map outputs to their local disk and append them to a single seq.
file. After this operation reducer try to sort this file and output the
sorted file to its local disk. And then, a record writer is opened to write
this sorted file to the segment, which is in DFS. If this scenario is
correct, then all the reduce tasks are supposed to do the same job. All try
to sort the whole map outputs and the winner of this operation will be able
to write to dfs. So only one reducer is expected to write to dfs. If this is
the case then an OutOfMemory exception will not be surprising for
500.000+urls. Since reducers will try to sort a file bigger then 1GB.
Any comments
on this scenario are welcome. And how can I avoid these exceptions? Thanx,

--
Hamza KAYA

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

[jira] Commented: (NUTCH-265) Getting Clustered results in better form.

[jira] Closed: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

[jira] Created: (NUTCH-286) Handling common error-pages as 404

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

Re: Extract infos from documents and query external sites

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

[jira] Updated: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

[jira] Created: (NUTCH-285) LinkDb Fails rename doesn't create parent directories

Re: Mailing List nutch-agent Reports of Bots Submitting Forms

Mailing List nutch-agent Reports of Bots Submitting Forms

[jira] Created: (NUTCH-284) NullPointerException during index

[jira] Commented: (NUTCH-70) duplicate pages - virtual hosts in db.

Querying a site by extracting doc informations

[jira] Commented: (NUTCH-44) too many search results

[jira] Created: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

[jira] Updated: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

Extract infos from documents and query external sites

Fetcher and MapReduce

19 matches

Site Navigation

Mail list logo

Footer information