About tomcat
We come to decision, we need restart webapp for new results appeared in search. How to this correctly without restarting tomcat? After long work of tomcat, we have too many open files error. May be this is result of restarting of webapp by touch command on web.xml? By now before tomcat starting, we setting max number open files parameter to 4096 (1024 by default), but we think it is not right decision.
jobdetails.jsp and jobtracker.jsp
How to use jobtracker.jsp and jobdetails.jsp? They need tomcat? When I try start jobdetails.jsp with tomcat, it return error: java.lang.NullPointerException at org.apache.jsp.m.jobdetails_jsp._jspService(org.apache.jsp.m.jobdetails_jsp: 53) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:3 22) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241) at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126 ) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105 ) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConne ction(Http11Protocol.java:744) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.jav a:527) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWo rkerThread.java:80) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.jav a:684) at java.lang.Thread.run(Thread.java:595)
RE: jobdetails.jsp and jobtracker.jsp
They not need tomcat? But then, what we must type in browser address? http://host_jobtracker:port_jobtracer/jobtracker/jobtracker.jsp ? -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Monday, November 21, 2005 12:46 PM To: nutch-dev@lucene.apache.org Subject: Re: jobdetails.jsp and jobtracker.jsp [EMAIL PROTECTED] wrote: How to use jobtracker.jsp and jobdetails.jsp? They need tomcat? No, but jobdetails.jsp requires a parameter (job_id) - start with jobtracker.jsp, and then follow the links. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: jobdetails.jsp and jobtracker.jsp
Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem.
fetcher.thread.per.host not working ??
Hello, There is something wrong with thread per host... Only one thread should only fetch one host at the same time, so why do i get these 2 connect time out (15 sec) at 13:15 and 15 seconds ?!!! This is not normal and so I get about 1000 errors when I crawl about 1400 pages... *Here is the log :* 051121 131515 17 fetch of http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=b03ef782492f97b0507f7281ce8088cb failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131515 17 fetching http://ssel.vub.ac.be/viewcvs/viewcvs.py/cvs/cocompose/MakeAllDist.sh?rev=1.2view=log 051121 131516 70 fetch of http://www.forum.math.ulg.ac.be/doc/WMC.html?SESSID=5e078d23b09e35fa3cb57563f3edac93 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131516 70 fetching http://www.forum.math.ulg.ac.be/viewthread.html?SESSID=a514546281e260e97b0d4ffef7d3fe67id=18614 051121 131516 47 fetching http://alexandrie.droit.fundp.ac.be/Record.htm? idlist=287record=19145278280919634500 051121 131516 57 fetch of http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=7384e568d8d3c2fd5b8cfacc11baa9a9id=2 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out 051121 131516 57 fetching http://www.bib.ucl.ac.be/cgi-bin/chameleon?search=KEYWORDfunction=INITREQSourceScreen=INITREQsessionid=256721skin=gandalfconf=.%2fchameleon.conflng=fr-beitemu1=1003u1=1003t1=Wanko%20Nankam,%20Carolinepos=1prevpos=1beginsrch=1 051121 131516 17 fetching http://www.forum.math.ulg.ac.be/viewsection.html?SESSID=08f49a63e3045f6481cba10cbe996eaa 051121 131516 42 fetching http://www.iagr.ucl.ac.be/planning/processus-staff/ 051121 131516 42 fetching http://www.iagr.ucl.ac.be/staff/ 051121 131516 65 fetch of http://www.forum2.math.ulg.ac.be/viewsection.html?SESSID=fa1e4296b3df0eed81bbc60b98a371f3id=11 failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out
merging auto-crawls
Hi All, I've tried at nutch-user with no success. Can someone help me with the below questions? Thanks! -- Forwarded message -- From: Ben Halsted [EMAIL PROTECTED] Date: Nov 18, 2005 10:44 PM Subject: merging auto-crawls To: nutch-user@lucene.apache.org Hello, (Using the latest compiled SVN mapred branch) I've modified the auto-crawl to always use a pre-existing crawldb. If I run it multiple times I get multiple linkdb, segments, indexes, and index directories. Is it possible to merge the results using the bin/nutch comamnds? Thanks! Ben
Re: Problem with CRC files on NDFS
Andrzej Bialecki wrote: I have a problem with the recently added CRC files, when put-ting stuff to NDFS. NDFS complains that these files already exist - I suspect that it creates them on the fly just before they are actually transmitted from the NDFSClient - and aborts the transfer. I was able to succeed in -put operation only if I first deleted all .*.crc files. I have not seen this. Can you tell me more how to cause this problem, perhaps providing the transcript of a session? Are you overwriting existing files? A crc file is created just after file is opened for output. It overwrites any existing crc file. See NFSDataOutputStream.java line 44. There are a few cases where things will complain about non-existant .crc files. This happens, e.g., when putting a file that was not created by Nutch tools. It also notably happens with Lucene indexes, since these are created by FSDirectory, not NDFSDirectory, since NDFS does not permit overwrites, and Lucene overwrites in one place (TermInfosWriter.java line 141). If we modify Lucene to write the term count at EOF-8 then Lucene indexes can be written directly through a NutchFileSystem API and will be correctly checksummed at creation. Is this change to Lucene justified? Doug
Re: mapred.map.tasks
[EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal changes to your conf/ directory required to illustrate this, how you start your daemons, etc. Thanks, Doug
Re: Urlfilter bug (doesn't return on long URLs)
This sounds like a bug in the URLFilter implementation. Is this RegexURLFilter? Can you figure out what regex is causing this? Probably the patch should be there, no? Doug Rod Taylor wrote: I stuck a few log statements within ParseOutputFormat.java. One after 'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came across a URL which hit the first but not the second. This means it is getting stuck (no exit or error, eventually the process times out and is reattempted to fail exactly the same way). The URL it is trying to process at the time is very long and somewhat convoluted. The thread is idle. Adding a restriction to skip URLs longer than 512 characters seems to have solved it. 4096 characters long http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/: :culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/: :los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/: :redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/ ::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/ Index: ParseOutputFormat.java === --- ParseOutputFormat.java (revision 344015) +++ ParseOutputFormat.java
RE: mapred.map.tasks
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. In nutch-site.xml I specified parameters: 1) On the both machines: property namefs.default.name/name value192.168.0.250:9009/value descriptionThe name of the default file system. Either the literal string local or a host:port for NDFS./description /property property namemapred.job.tracker/name value192.168.0.250:9010/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.tasktracker.tasks.maximum/name value2/value descriptionThe maximum number of tasks that will be run simultaneously by a task tracker. /description /property property namemapred.reduce.tasks/name value2/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property On 192.168.0.250 I started: 2) bin/nutch-daemon.sh start datanode 3) bin/nutch-daemon.sh start namenode 4) bin/nutch-daemon.sh start jobtracker 5) bin/nutch-daemon.sh start tasktracker I created directory seeds and file urls in it. Urls contained 2 links. Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds). Directory was added successfully.. Then I launched command: bin/nutch crawl seeds -depth 2 I a result I received log written by jobtracker: 051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845' 051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518' 051123 053130 Task 'task_m_z66npx' has finished successfully. Log written by tasktracker on 192.168.0.111: .. 051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31 051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31 051110 142607 Task task_m_z66npx is done. Log written by tasktracker on 192.168.0.250: 051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31 051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31 051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31 051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31 051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31 051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31 051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31 051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31 051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31 051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31 051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31 ... and so on... e.g. in this log were records with reducing percents. I concluded that was an attempt to separate inject to 2 machines e.g. were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx' was finished successfully and 'task_m_xaynqo' caused some problems (negative progress). But if I change parameter mapred.reduce.tasks to 4 all tasks finished successfully and all work right. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 22, 2005 2:10 AM To: nutch-dev@lucene.apache.org Subject: Re: mapred.map.tasks [EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal changes to your conf/ directory required to illustrate this, how you start your daemons, etc. Thanks, Doug