Hi, Now it works ! I had links href like "?domaine=1&diplome=TI-DUT" in the source code. I'm not sure if it's valid but web browsers don't complain. By changing hrefs into something like "Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT" makes nutch work better.
Best regards, David Selon [email protected]: > I just removed the empties parameters in the urls. I have now new urls. But > nothing changes. > > Nutch fetch this : > http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT > > Nutch doesn't fetch this : > http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-MP > > Best regards > David > > Selon Markus Jelsma <[email protected]>: > > > I haven't tried it but there's something odd (but valid) going on. It > doesn't > > fetch the URL where the equals-sign is immediately followed up by an > > ampersand. Could you do a test and verify this? > > > > If verified, please open a ticket. > > > > On Wednesday 27 April 2011 16:28:36 [email protected] wrote: > > > Hi, > > > > > > I'm quite lost here, I suspect that theses urls don't want to be fetched > :) > > > > > > Seriously,I'm not sure if it's a problem with query strings, as nutch > > > fetches some urls with get parameters successfully. I tried to change > lots > > > of properties (like db.max.inlinks, db.max.anchor.length, > > > http.content.limit) in the nutch-default.xml, without success. > > > > > > Best regards, > > > David > > > > > > Selon "McGibbney, Lewis John" <[email protected]>: > > > > Hi, > > > > > > > > Has this moved on any? > > > > > > > > Did you manage to successfully fetch your urls, I have been away and > > > > didn't get time to complete. > > > > > > > > ________________________________________ > > > > From: [email protected] [[email protected]] > > > > Sent: 21 April 2011 21:11 > > > > To: [email protected] > > > > Subject: RE: Fetching urls with query string > > > > > > > > Hi, > > > > > > > > Sorry i didn't provide the real urls, here it is : > > > > > > > nutch fetch this : > > > > http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-S > > > ante?domaine=1&diplome=TI-DUT&composante= > > > > > > > nutch does not fetch this : > > > > http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-S > > > > ante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut > > > -000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO > > > > > > > My crawl-urlfilter : > > > > > > > > # skip file:, ftp:, & mailto: urls > > > > -^(file|ftp|mailto): > > > > > > > > # skip image and other suffixes we can't yet parse > > > > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm| > > > tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|crt|cert)$ > > > > > > > # skip URLs containing certain characters as probable queries, etc. > > > > #-[?*!@=] > > > > > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to > break > > > > loops > > > > -.*(/.+?)/.*?\1/.*?\1/ > > > > > > > > # crawler seulement sur front-ig1 > > > > +^http://www.univ-lille1.fr/etudes/offre-de-formation > > > > > > > > # skip everything else > > > > -. > > > > > > > > > > > > By removing the comment on -[?*!@=], nutch doesn't fetch query strings > > > > at all. > > > > For information, i use nutch 0.9 (but i tried with a fresh install of > 1.2 > > > > and i'm having the same problem). > > > > > > > > Thanks for your answer John > > > > Best regards > > > > David > > > > > > > > Selon "McGibbney, Lewis John" <[email protected]>: > > > > > Hi, > > > > > > > > > > It appears that both of the urls you posted return 404 not found then > > > > > autoredirect to a domain seller! > > > > > > > > > > Further to this, did you remove the comment on this > > > > > > > > > > #-[?*!@=]... from the info provided below it appears you have not. > > > > > > > > > > hth > > > > > > > > > > Lewis > > > > > > > > > > ________________________________________ > > > > > From: [email protected] [[email protected]] > > > > > Sent: 21 April 2011 16:15 > > > > > To: [email protected] > > > > > Subject: Fetching urls with query string > > > > > > > > > > Hello, > > > > > > > > > > I have problems fetching some urls having GET parameters with nutch. > > > > > For > > > > > > > > example, nutch is fetching : > > > > http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sant > > > e?domaine=1&diplome=TI-DUT&composante= > > > > > > > > but will not fetch : > > > > http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sant > > > > e?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-00 > > > 0001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO > > > > > > > > I updated the crawl-urlfilter : > > > > > #-[?*!@=] > > > > > > > > > > +^http://www.mywebsite.com/studies/formation-offer/ > > > > > > > > > > and nutch-default.xml : > > > > > > > > > > <property> > > > > > > > > > > <name>db.max.anchor.length</name> > > > > > <value>300</value> > > > > > <description>The maximum number of characters permitted in an > anchor. > > > > > </description> > > > > > > > > > > </property> > > > > > > > > > > but i have the same result, i didn't find anything in the > configuration > > > > > > > > files > > > > > > > > > to > > > > > make it work. Have somebody an idea ? > > > > > > > > > > Best regards, > > > > > David > > > > > > > > > > Email has been scanned for viruses by Altman Technologies' email > > > > > management service - www.altman.co.uk/emailsystems > > > > > > > > > > Glasgow Caledonian University is a registered Scottish charity, > number > > > > > SC021474 > > > > > > > > > > Winner: Times Higher Educationâs Widening Participation Initiative > of > > > > > the Year 2009 and Herald Societyâs Education Initiative of the Year > > > > > 2009. > > > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e > > > n.html > > > > > > > > Winner: Times Higher Educationâs Outstanding Support for Early > Career > > > > > Researchers of the Year 2010, GCU as a lead with Universities > Scotland > > > > > partners. > > > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691, > > > en.html > > > > > > > Email has been scanned for viruses by Altman Technologies' email > > > > management service - www.altman.co.uk/emailsystems > > > > > > > > Glasgow Caledonian University is a registered Scottish charity, number > > > > SC021474 > > > > > > > > Winner: Times Higher Educations Widening Participation Initiative of > the > > > > Year 2009 and Herald Societys Education Initiative of the Year 2009. > > > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e > > > n.html > > > > > > > Winner: Times Higher Educations Outstanding Support for Early Career > > > > Researchers of the Year 2010, GCU as a lead with Universities Scotland > > > > partners. > > > > > > > http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691, > > > en.html > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > > > >

