entrance point of Nutch search page

2006-03-03 Thread Michael Ji
hi,

Which JSP file is the entrance for Nutch search page.

I saw nutch using

search(Query query, int numHits, String dedupField,
String sortField, boolean reverse) 

to get the search result.

But not sure which JSP triggers this function.

Is it in tomcat container?

thanks,

Michael,

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
On 3/3/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> Jérôme Charron wrote:
> > Here is my proposal. For each plugin:
> > * Define a target containing core (will be used when building single
> plugin)
> > * Define a target not containing core (will be used when building whole
> > code)
> > I commit this as soon as possible.
>
> That sounds perfect.  Thanks!

Committed.
Quick benchs:
* Before : around 70s
* After : around 50s

Better, but not so perfect...  :-(

Jérôme


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Doug Cutting

Jérôme Charron wrote:

Here is my proposal. For each plugin:
* Define a target containing core (will be used when building single plugin)
* Define a target not containing core (will be used when building whole
code)
I commit this as soon as possible.


That sounds perfect.  Thanks!

Doug


Nutch Crawl Vs. Merge Time Complexity

2006-03-03 Thread Alex
Hi there,

I got a couple of questions that I need help with, Please help.

I'm sort of new to this nutch-dev emailing listing. I'm not quite should how or 
what's the appropriate way of getting envolve with the Nutch development group. 
Please let me know Who should I be contacting in regards to issue and question 
about Nutch?

I've been using Nutch and customizing it so that the returned search results 
can be manage by the use of paging on the web. I'm doing this for my company 
and my supervisor has agreed to contribute the code for paging to the nutch 
community. Please help guide me on how to proceed with this.

Finally, a technical question. I've using Nutch v0.7 and I've been running 
nutch on our company unix system and it was setup to crawl our intranet sites 
for updates daily, I've tried using the Merge, dedup, updatedb, and etc...I'd 
notice the time complexity and efficiency was less productive than doing a 
fresh new crawl. For example if I have two separate crawls from two different 
domains such as hotmail and yahoo, what would the time complexity for nutch to 
crawl this two domains and then do a merge compare to just doing a single full 
crawl of both domains? My guess would be that it will take nutch the same 
amount of times to do either one, if that is so is there a reason to use the 
Merge at all? Please let me know what you think, I'm still trying to understand 
how nutch behave, don't mean to criticize anyone who've work on the Merge 
feature for nutch. 

Thanks.

Alex




-
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze. 

RE: OutOfMemoryError/Restarting Crawl/Indexing what has already been crawled

2006-03-03 Thread Richard Braman
I think this may be a bug.

-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 02, 2006 8:28 PM
To: nutch-dev@lucene.apache.org
Subject: OutOfMemoryError/Restarting Crawl/Indexing what has already
been crawled


I have nutch running on a Compaq DL 380 w/ 1GB of RAM, not my best
machine, but I am only doing a limited crawl of about 52 urls.  When I
do the crawl with depth = 3 or even 6, it completes, when I do it at 10,
it has been running out of memory.  
 
2 questions 
 
1. how do i restart the crawl?  
I have seen the tuturial, whch says
"

 Recover the pages already fetched and than restart the fetcher. You'll
need to create a file fetcher.done in the segment directory an than:
updatedb, generate and fetch . Assuming your index is at /index 

% touch /index/segments/2005somesegment/fetcher.done 

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment

All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

", 

but I have more than one segment, do I only need to do this for the last
one in time, or all of them?

2. how to I index what I have already crawled?
I have seen the indexing section in the tutorial, when I run bin/nutch
invertlinks it gives me a Exception in thread "main"
java.lang.NoClassDefFoundError: invertlinks 
using cygwin
 
The fetcher exited with a
 
060302 165825 SEVERE error writing output:java.lang.OutOfMemoryError:
Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.RuntimeException: SEVERE error
logged.  Exiting fetcher.  at
org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
 at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
 at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

 

Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

http://www.taxcodesoftware.org  
Free Open Source Tax Software

 



[jira] Commented: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-221?page=comments#action_12368779 ] 

Doug Cutting commented on NUTCH-221:


+1  Thanks!

> prepare nutch for upcoming lucene 2.0
> -
>
>  Key: NUTCH-221
>  URL: http://issues.apache.org/jira/browse/NUTCH-221
>  Project: Nutch
> Type: Task
>  Environment: all
> Reporter: Sami Siren
> Assignee: Sami Siren
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: nutch-lucene-deprecation.txt
>
> Remove all deprecated uses of lucene as they will vanish in 2.0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Doug Cutting

Jérôme Charron wrote:

It seems that NUTCH-143 patch has been commited too... is it intentional?


That was indeed a mistake.  Thanks for catching it!  I just reverted the 
unintentional changes.  Thanks also to:


http://svnbook.red-bean.com/en/1.0/ch04s04.html#svn-ch-4-sect-4.2

Doug


[jira] Updated: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-221?page=all ]

Sami Siren updated NUTCH-221:
-

Attachment: nutch-lucene-deprecation.txt

> prepare nutch for upcoming lucene 2.0
> -
>
>  Key: NUTCH-221
>  URL: http://issues.apache.org/jira/browse/NUTCH-221
>  Project: Nutch
> Type: Task
>  Environment: all
> Reporter: Sami Siren
> Assignee: Sami Siren
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: nutch-lucene-deprecation.txt
>
> Remove all deprecated uses of lucene as they will vanish in 2.0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Sami Siren (JIRA)
prepare nutch for upcoming lucene 2.0
-

 Key: NUTCH-221
 URL: http://issues.apache.org/jira/browse/NUTCH-221
 Project: Nutch
Type: Task
 Environment: all
Reporter: Sami Siren
 Assigned to: Sami Siren 
Priority: Minor
 Fix For: 0.8-dev


Remove all deprecated uses of lucene as they will vanish in 2.0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: svn commit: r381751 - in /lucene/nutch/trunk: site/ src/java/org/apache/nutch/crawl/ src/java/org/apache/nutch/fetcher/ src/java/org/apache/nutch/indexer/ src/java/org/apache/nutch/parse/ src/java

2006-03-03 Thread Jérôme Charron
> Adding DOAP for Nutch.  Contributed by Chris Mattmann.
>
> Added:
> lucene/nutch/trunk/site/doap.rdf
> Modified:
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDb.java
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java
> lucene/nutch/trunk/src/java/org/apache/nutch/crawl/LinkDbReader.java
> lucene/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java
>
> lucene/nutch/trunk/src/java/org/apache/nutch/indexer/DeleteDuplicates.java
> lucene/nutch/trunk/src/java/org/apache/nutch/indexer/IndexMerger.java
> lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
> lucene/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java
>
> lucene/nutch/trunk/src/java/org/apache/nutch/plugin/PluginRepository.java
>
> 
> lucene/nutch/trunk/src/java/org/apache/nutch/searcher/DistributedSearch.java
>
> lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

It seems that NUTCH-143 patch has been commited too... is it intentional?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: svn commit: r378655 - in /lucene/nutch/trunk/src/plugin: ./ analysis-de/ analysis-fr/ clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ lib-commons-httpclient/ lib-

2006-03-03 Thread Jérôme Charron
> In a distributed configuration one needs to rebuild the job jar each
> time anything changes, and hence must check all plugins, etc.  So I
> would appreciate it if this didn't take quite so long.

Make sense!
Here is my proposal. For each plugin:
* Define a target containing core (will be used when building single plugin)
* Define a target not containing core (will be used when building whole
code)
I commit this as soon as possible.

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: Unable to complete a full fetch, reason Child Error

2006-03-03 Thread Mike Smith
Hi Doug

I did some more testings using the last svn. Childs still die without any
clear log after a while.

I used two machines through Hadoop, both are datanode and tasktracker and
one is namenode and jobtracker. I started with 2000 seed nodes and it went
fine till 4th cycle, reached about 600,000 pages and the next round was for
3,000,000 pages to fetch. It failed again with this exception in the middle
of fetching:

060302 232934 task_m_7lbv7e  fetching
http://www.findarticles.com/p/articles/mi_m0KJI/is_9_115/ai_107836357
060302 232934 task_m_7lbv7e  fetching
http://www.wholehealthmd.com/hc/resourceareas_supp/1,1442,544,00.html
060302 232934 task_m_7lbv7e  fetching
http://www.dow.com/haltermann/products/d-petro.htm
060302 232934 task_m_7lbv7e 0.7877368% 700644 pages, 24594 errors,
14.0pages/s, 2254 kb/s,
060302 232934 task_m_7lbv7e  fetching
http://www.findarticles.com/p/articles/mi_hb3594/is_199510/ai_n8541042
060302 232934 task_m_7lbv7e Error reading child output
java.io.IOException: Bad file descriptor
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java
:411)
at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java
:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at org.apache.hadoop.mapred.TaskRunner.logStream(TaskRunner.java
:299)
at org.apache.hadoop.mapred.TaskRunner.access$100(TaskRunner.java
:32)
at org.apache.hadoop.mapred.TaskRunner$1.run(TaskRunner.java:266)
060302 232934 task_m_7lbv7e 0.7877451% 700644 pages, 24594 errors,
14.0pages/s, 2254 kb/s,
060302 232934 task_m_7lbv7e 0.7877451% 700644 pages, 24594 errors,
14.0pages/s, 2254 kb/s,
060302 232934 Server connection on port 50050 from 164.67.195.27: exiting
060302 232934 Server connection on port 50050 from 164.67.195.27: exiting
060302 232934 task_m_7lbv7e Child Error
java.io.IOException: Task process exit with nonzero status.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:273)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)
060302 232937 task_m_7lbv7e done; removing files.


And this is console output:



060303 010945  map 86%  reduce 0%
060303 012033  map 86%  reduce 6%
060303 012223  map 87%  reduce 6%
060303 014623  map 88%  reduce 6%
060303 021304  map 89%  reduce 6%
060303 022921  map 50%  reduce 0%
060303 022921 SEVERE error, caught Exception in main()
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:366)
at org.apache.nutch.fetcher.Fetcher.doMain(Fetcher.java:400)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:411)


This error has been around for large scale crawl since couple months ago. I
was wondering if anybody else has had the same issue for large scale crawl.

Thanks, Mike.






On 2/26/06, Gal Nitzan <[EMAIL PROTECTED]> wrote:
>
> Still got the same...
>
> I'm not sure if it is relevant to this issue but the call you added to
> Fetcher.java:
>
> job.setBoolean("mapred.speculative.execution", false);
>
> Doesn't work. All task trackers still fetch together though I have only
> 3 sites in the fetchlist.
>
> The task trackers fetch the same pages...
>
> I have used latest build from hadoop trunk.
>
> Gal.
>
>
> On Fri, 2006-02-24 at 14:15 -0800, Doug Cutting wrote:
> > Mike Smith wrote:
> > > 060219 142408 task_m_grycae  Parent died.  Exiting task_m_grycae
> >
> > This means the child process, executing the task, was unable to ping its
> > parent process (the task tracker).
> >
> > > 060219 142408 task_m_grycae Child Error
> > > java.io.IOException: Task process exit with nonzero status.
> > > at org.apache.hadoop.mapred.TaskRunner.runChild(
> TaskRunner.java:144)
> > > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)
> >
> > And this means that the parent was really still alive, and has noticed
> > that the child killed itself.
> >
> > It would be good to know how the child failed to contact its parent.  We
> > should probably log a stack trace when this happens.  I just made that
> > change in Hadoop and will propagate it to Nutch.
> >
> > Doug
> >
>
>
>