On Mon, 2005-11-21 at 15:11 -0800, Doug Cutting wrote:
> This sounds like a bug in the URLFilter implementation.  Is this 
> RegexURLFilter?  Can you figure out what regex is causing this? 
> Probably the patch should be there, no?

I am using the URL Filtering and normalization plugins. As to where the
patch should go I didn't dig any deeper than this, so that is where I
applied it against my own system to prevent it from breaking.

I put the URL into a file and tried using it as a seed for nutch crawl.
The lockup occurred.

I commented out all entries from regex-urlfilter.txt and
crawl-urlfilter.txt except for the "+." line at the end. The
lockup/timeout still occurred.  Commenting out the full contents of
regex-normalize.xml does not change the outcome either.

Attached is the seed file I used and had in the seeds directory

-bash-2.05b$ ./bin/nutch crawl /home/rbt/nutch-0.8_10/test/seed/
-dir /home/rbt/nutch-0.8_10/test/test3 -topN 1
051122 120048 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml
051122 120049 crawl started in: /home/rbt/nutch-0.8_10/test/test3
051122 120049 rootUrlFile = /home/rbt/nutch-0.8_10/test/seed
051122 120049 threads = 45
051122 120049 depth = 5
051122 120049 topN = 1
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml
051122 120049 Injector: starting
051122 120049 Injector:
crawlDb: /home/rbt/nutch-0.8_10/test/test3/crawldb
051122 120049 Injector: urlDir: /home/rbt/nutch-0.8_10/test/seed
051122 120049 Injector: Converting injected urls to crawl db entries.
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/crawl-tool.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing
file:/home/rbt/nutch-0.8_10/conf/mapred-default.xml
051122 120049 parsing file:/home/rbt/nutch-0.8_10/conf/nutch-site.xml


Exception in thread "main" java.net.ConnectException: Connection timed
out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:305)
        at
java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:171)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:158)
        at java.net.Socket.connect(Socket.java:452)
        at java.net.Socket.connect(Socket.java:402)
        at java.net.Socket.<init>(Socket.java:309)
        at java.net.Socket.<init>(Socket.java:153)
        at org.apache.nutch.ipc.Client
$Connection.<init>(Client.java:110)
        at org.apache.nutch.ipc.Client.getConnection(Client.java:343)
        at org.apache.nutch.ipc.Client.call(Client.java:281)
        at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
        at $Proxy0.getFilesystemName(Unknown Source)
        at org.apache.nutch.mapred.JobClient.getFs(JobClient.java:209)
        at
org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:249)
        at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
        at org.apache.nutch.crawl.Injector.inject(Injector.java:102)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:101)


> Rod Taylor wrote:
> > I stuck a few log statements within ParseOutputFormat.java. One after
> > 'String toUrl =' and another before the 'if (toUrl != null)'. Nutch came
> > across a URL which hit the first but not the second.
> > 
> > This means it is getting stuck (no exit or error, eventually the process
> > times out and is reattempted to fail exactly the same way).
> > 
> > The URL it is trying to process at the time is very long and somewhat
> > convoluted. The thread is idle. Adding a restriction to skip URLs longer
> > than 512 characters seems to have solved it.
> > 
> > 4096 characters long
> > http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete
 /:
> :culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach
 /:
> :los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank
 /:
> :redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_cheste
 r/
> ::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/
> > 
> > 
> > Index: ParseOutputFormat.java
> > ===================================================================
> > --- ParseOutputFormat.java  (revision 344015)
> > +++ ParseOutputFormat.java  (working copy)
> > @@ -56,7 +56,7 @@
> >  
> >          public void write(WritableComparable key, Writable value)
> >            throws IOException {
> > -          
> > +
> >            Parse parse = (Parse)value;
> >            
> >            textOut.append(key, new ParseText(parse.getText()));
> > @@ -73,6 +73,10 @@
> >            for (int i = 0; i < links.length; i++) {
> >              String toUrl = links[i].getToUrl();
> >              try {
> > +              if (toUrl.length() > 512) {
> > +                 throw new Exception("URL length too long: " +
> > toUrl.length() +" characters");
> > +              }
> > +
> >                toUrl = urlNormalizer.normalize(toUrl); // normalize the
> > url
> >                toUrl = URLFilters.filter(toUrl);   // filter the url
> >              } catch (Exception e) {
> > 
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>
http://www.moveandstay.com/aberdeen/::abilene/::addison/::adelaide/::1076::1042::aix_en_provence/::alexandria/::algarve/::alpharetta/::1077::amalfi_coast/::amersham/::amsterdam/::arlington/::ashgrove/::atlanta/::1080::auckland/::austin/::707::bali/::1102::bangalore/::bangkok/::1037::barcelona/::beachwood/::bedminster/::beijing/::bellevue/::belo_horizonte/::berlin/::bethesda/::beverly_hills/::1068::1082::birmingham/::birmingham/::blois/::bloomfield_hills/::boca_raton/::bogota/::bohemia/::960::bonn/::bordeaux/::boston/::bothell/::brasilia/::1145::brest/::bridgewater/::brisbane/::bristol/::brookfield/::broomfield/::brussels/::budapest/::buffalo/::burlington/::cairns/::cambridge/::cambridge/::campbell/::campinas/::canberra/::cape_town/::1040::caracas/::cardiff/::1114::carlsbad/::carlton/::century_city/::cerritos/::1061::charlotte/::cheltenham/::1016::chicago/::chonburi/::christchurch/::308::cincinnati/::cleveland/::cologne/::compiegne/::1079::coral_gables/::costa_mesa/::crete/::culver_city/::curitiba/::1064::1098::1166::dallas/::dandenong/::darwin/::denver/::1063::doncaster/::dortmund/::dubai/::dublin/::dublin/::durham/::195::east_brunswick/::east_sicily/::edina/::edinburgh/::englewood/::erlanger/::essen/::fairfax/::farmington/::fitzroy/::florence/::1090::framingham/::frankfurt/::freehold/::frisco/::1127::979::glasgow/::glendale/::1133::gold_river/::1084::greenwood_village/::1091::guadalajara/::guangzhou/::1170::hamburg/::hanoi/::1132::hauppage/::henderson/::ho_chi_minh_city/::hobart/::hongkong/::houston/::huntington_beach/::1089::independence/::indianapolis/::1059::irvine/::irvine/::irving/::iselin/::671::162::jacksonville/::jakarta/::1113::jersey_city/::johannesburg/::jolimont/::kennesaw/::kew/::king_of_prussia/::kirkland/::krabi/::kuala_lumpur/::673::1185::la_jolla/::la_mirada/::1085::la_rochelle/::lago_maggiore/::laguna_hills/::1144::lake_oswego/::lannion/::1087::1159::las_vegas/::le_mans/::leeds/::lille/::lisbon/::lisle/::london/::long_beach/::los_angeles/::lyon/::1143::1021::963::madrid/::mahwah/::maidenhead/::1067::maitland/::1088::1025::manchester/::1081::mandurah/::manhattan_beach/::manila/::1078::732::1044::1105::marseille/::mclean/::melbourne/::melville/::mexico_city/::miami/::michigan/::milan/::458::minneapolis/::minnetonka/::monterrey/::montpellier/::montreal/::morristown/::1130::686::mountain_view/::mt._laurel/::mumbai/::munich/::nagoya/::nancy/::nantes/::naples/::narre_warren/::nashville/::new_delhi/::new_york/::newark/::newcastle/::newport_beach/::newtown/::199::norcross/::northbrook/::nottingham/::novi/::191::oak_brook/::oakbrook_terrace/::orange/::orlando/::osaka/::1186::overland_park/::131::palatine/::paris/::parnell/::parsippany/::pasadena/::pattaya/::1060::perth/::1120::philadelphia/::phoenix/::phuket/::pittsburgh/::plantation/::pleasanton/::ponsoby/::portland/::porto_alegre/::1123::positano/::prague/::prahran/::1106::princeton/::1058::puglia/::rancho_santa_margarita/::rayong/::reading/::red_bank/::redmond/::rennes/::reston/::rio_de_janeiro/::693::rolling_meadows/::rome/::rosemont/::roseville/::sacramento/::saddle_brook/::1072::saint-nazaire/::1083::salvador/::1115::1029::san_antonio/::san_diego/::san_francisco/::san_jose/::san_juan/::san_mateo/::san_rafael/::san_ramon/::santa_clara/::564::sao_polo/::1049::1062::1118::schaumburg/::scottsdale/::570::seattle/::seoul/::shanghai/::1160::short_hills/::1108::singapore/::sofia/::sophia_antipolis/::sorrento/::1117::579::southfield/::1066::st_kilda/::st_louis/::stockholm/::strasbourg/::192::sun_city/::sunrise/::surabaya/::sydney/::syosset/::1126::1158::tampa/::tarrytown/::taupo/::the_entrance/::tokyo/::toronto/::1069::toulouse/::trinity_beach/::troy/::tulsa/::tuscany_cities/::1055::tuscany_seaside/::tustin/::1134::umbria/::uniondale/::vancouver/::venice/::verona/::victoria/::vienna/::vienna/::1111::1110::walnut_creek/::waltham/::wantirna/::warrenville/::warsaw/::washington_dc/::1128::wellesley_hills/::wellington/::west_chester/::west_sicily/::white_plains/::wiesbaden/::williamstown/::windsor/::woodland_hills/::worthington/::948::zurich/

Reply via email to