About tomcat

2005-11-21 Thread Anton Potehin
We come to decision, we need restart webapp for new results appeared in search. 
How to this correctly without restarting tomcat?

 

After long work of tomcat,  we have too many open files error. May be this is 
result of restarting of webapp by touch command on web.xml? By now before 
tomcat starting, we setting max number open files parameter to 4096 (1024 by 
default), but we think it is not right decision.

 

 

 



jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat? 

When I try start jobdetails.jsp with tomcat, it return error:
java.lang.NullPointerException
at
org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp:
53)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3
22)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:178)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126
)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105
)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:107)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne
ction(Http11Protocol.java:744)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav
a:527)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo
rkerThread.java:80)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav
a:684)
at java.lang.Thread.run(Thread.java:595) 




RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton
They not need tomcat? But then, what we must type in browser address? 

http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ?


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 21, 2005 12:46 PM
To: nutch-dev@lucene.apache.org
Subject: Re: jobdetails.jsp and jobtracker.jsp

[EMAIL PROTECTED] wrote:

How to use jobtracker.jsp and jobdetails.jsp?
They need tomcat? 
  


No, but jobdetails.jsp requires a parameter (job_id) - start with 
jobtracker.jsp, and then follow the links.

-- 
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






RE: jobdetails.jsp and jobtracker.jsp

2005-11-21 Thread anton

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got negative progress
percentages problem.




fetcher.thread.per.host not working ??

2005-11-21 Thread Christophe Noel

Hello,

There is something wrong with thread per host...
Only one thread should only fetch one host at the same time, so why do i 
get these 2 connect time out (15 sec) at 13:15 and 15 seconds ?!!!
This is not normal and so I get about 1000 errors when I crawl about 
1400 pages...


*Here is the log :*
051121 131515 17 fetch of 
http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=b03ef782492f97b0507f7281ce8088cb 
failed with: java.lang.Exception: java.net.SocketTimeoutException: 
connect timed out
051121 131515 17 fetching 
http://ssel.vub.ac.be/viewcvs/viewcvs.py/cvs/cocompose/MakeAllDist.sh?rev=1.2view=log
051121 131516 70 fetch of 
http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=5e078d23b09e35fa3cb57563f3edac93 
failed with: java.lang.Exception: java.net.SocketTimeoutException: 
connect timed out
051121 131516 70 fetching 
http://www.forum.math.ulg.ac.be/viewthread.html?SESSID=a514546281e260e97b0d4ffef7d3fe67id=18614
051121 131516 47 fetching 
http://alexandrie.droit.fundp.ac.be/Record.htm? 
idlist=287record=19145278280919634500
051121 131516 57 fetch of 
http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=7384e568d8d3c2fd5b8cfacc11baa9a9id=2 
failed with: java.lang.Exception: java.net.SocketTimeoutException: 
connect timed out
051121 131516 57 fetching 
http://www.bib.ucl.ac.be/cgi-bin/chameleon?search=KEYWORDfunction=INITREQSourceScreen=INITREQsessionid=256721skin=gandalfconf=.%2fchameleon.conflng=fr-beitemu1=1003u1=1003t1=Wanko%20Nankam,%20Carolinepos=1prevpos=1beginsrch=1
051121 131516 17 fetching 
http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=08f49a63e3045f6481cba10cbe996eaa
051121 131516 42 fetching 
http://www.iagr.ucl.ac.be/planning/processus-staff/

051121 131516 42 fetching http://www.iagr.ucl.ac.be/staff/
051121 131516 65 fetch of 
http://www.forum2.math.ulg.ac.be/viewsection.html?SESSID=fa1e4296b3df0eed81bbc60b98a371f3id=11 
failed with: java.lang.Exception: java.net.SocketTimeoutException: connect

timed out



merging auto-crawls

2005-11-21 Thread Ben Halsted
Hi All,

I've tried at nutch-user with no success. Can someone help me with the below
questions?

Thanks!

-- Forwarded message --
From: Ben Halsted [EMAIL PROTECTED]
Date: Nov 18, 2005 10:44 PM
Subject: merging auto-crawls
To: nutch-user@lucene.apache.org

Hello,

(Using the latest compiled SVN mapred branch)

I've modified the auto-crawl to always use a pre-existing crawldb. If I run
it multiple times I get multiple linkdb, segments, indexes, and index
directories.

Is it possible to merge the results using the bin/nutch comamnds?

Thanks!
Ben


Re: Problem with CRC files on NDFS

2005-11-21 Thread Doug Cutting

Andrzej Bialecki wrote:
I have a problem with the recently added CRC files, when put-ting 
stuff to NDFS. NDFS complains that these files already exist - I suspect 
that it creates them on the fly just before they are actually 
transmitted from the NDFSClient - and aborts the transfer. I was able to 
succeed in -put operation only if I first deleted all .*.crc files.


I have not seen this.  Can you tell me more how to cause this problem, 
perhaps providing the transcript of a session?  Are you overwriting 
existing files?


A crc file is created just after file is opened for output.  It 
overwrites any existing crc file.  See NFSDataOutputStream.java line 44.


There are a few cases where things will complain about non-existant .crc 
files.  This happens, e.g., when putting a file that was not created by 
Nutch tools.


It also notably happens with Lucene indexes, since these are created by 
FSDirectory, not NDFSDirectory, since NDFS does not permit overwrites, 
and Lucene overwrites in one place (TermInfosWriter.java line 141).  If 
we modify Lucene to write the term count at EOF-8 then Lucene indexes 
can be written directly through a NutchFileSystem API and will be 
correctly checksummed at creation.  Is this change to Lucene justified?


Doug


Re: mapred.map.tasks

2005-11-21 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got negative progress
percentages problem.


Can you please post a simple example that demonstrates the negative 
progress problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.


Thanks,

Doug


Re: Urlfilter bug (doesn't return on long URLs)

2005-11-21 Thread Doug Cutting
This sounds like a bug in the URLFilter implementation.  Is this 
RegexURLFilter?  Can you figure out what regex is causing this? 
Probably the patch should be there, no?


Doug

Rod Taylor wrote:

I stuck a few log statements within ParseOutputFormat.java. One after
'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
across a URL which hit the first but not the second.

This means it is getting stuck (no exit or error, eventually the process
times out and is reattempted to fail exactly the same way).

The URL it is trying to process at the time is very long and somewhat
convoluted. The thread is idle. Adding a restriction to skip URLs longer
than 512 characters seems to have solved it.

4096 characters long
http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/:

:culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/:
:los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/:
:redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/
::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/



Index: ParseOutputFormat.java
===
--- ParseOutputFormat.java  (revision 344015)
+++ ParseOutputFormat.java  

RE: mapred.map.tasks

2005-11-21 Thread anton
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
property
  namefs.default.name/name
  value192.168.0.250:9009/value
  descriptionThe name of the default file system.  Either the
  literal string local or a host:port for NDFS./description
/property

property
  namemapred.job.tracker/name
  value192.168.0.250:9010/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property
property
  namemapred.map.tasks/name
  value2/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.  
  /description
/property

property
  namemapred.tasktracker.tasks.maximum/name
  value2/value
  descriptionThe maximum number of tasks that will be run
  simultaneously by a task tracker.
  /description
/property

property
  namemapred.reduce.tasks/name
  value2/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property
 



On 192.168.0.250 I started:
2)   bin/nutch-daemon.sh start datanode
3)   bin/nutch-daemon.sh start namenode
4)   bin/nutch-daemon.sh start jobtracker
5)   bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

 

Then I launched command: 
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:

051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
 

Log written by tasktracker on 192.168.0.111:
..
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
 

Log written by tasktracker on 192.168.0.250:

051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

 

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

[EMAIL PROTECTED] wrote:
 Why we need parameter mapred.map.tasks greater than number of available
 host? If we set it equal to number of host, we got negative progress
 percentages problem.

Can you please post a simple example that demonstrates the negative 
progress problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug