Re: Fetching urls with query string

darkstreamer Mon, 02 May 2011 03:02:40 -0700

Hi,

Now it works ! I had links href like "?domaine=1&diplome=TI-DUT" in the source
code. I'm not sure if it's valid but web browsers don't complain. By changing
hrefs into something like "Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT"
makes nutch work better.


Best regards,
David

Selon [email protected]:

> I just removed the empties parameters in the urls. I have now new urls. But
> nothing changes.
>
> Nutch fetch this :
>
http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT
>
> Nutch doesn't fetch this :
>
http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-Sante?domaine=1&diplome=TI-DUT&mention=FR_RNE_0593559Y_PR_ST-dut-000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-MP
>
> Best regards
> David
>
> Selon Markus Jelsma <[email protected]>:
>
> > I haven't tried it but there's something odd (but valid) going on. It
> doesn't
> > fetch the URL where the equals-sign is immediately followed up by an
> > ampersand. Could you do a test and verify this?
> >
> > If verified, please open a ticket.
> >
> > On Wednesday 27 April 2011 16:28:36 [email protected] wrote:
> > > Hi,
> > >
> > > I'm quite lost here, I suspect that theses urls don't want to be fetched
> :)
> > >
> > > Seriously,I'm not sure if it's a problem with query strings, as nutch
> > > fetches some urls with get parameters successfully. I tried to change
> lots
> > > of properties (like db.max.inlinks, db.max.anchor.length,
> > > http.content.limit) in the nutch-default.xml, without success.
> > >
> > > Best regards,
> > > David
> > >
> > > Selon "McGibbney, Lewis John" <[email protected]>:
> > > > Hi,
> > > >
> > > > Has this moved on any?
> > > >
> > > > Did you manage to successfully fetch your urls, I have been away and
> > > > didn't get time to complete.
> > > >
> > > > ________________________________________
> > > > From: [email protected] [[email protected]]
> > > > Sent: 21 April 2011 21:11
> > > > To: [email protected]
> > > > Subject: RE: Fetching urls with query string
> > > >
> > > > Hi,
> > > >
> > > > Sorry i didn't provide the real urls, here it is :
> > >
> > > > nutch fetch this :
> > >
> http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-S
> > > ante?domaine=1&diplome=TI-DUT&composante=
> > >
> > > > nutch does not fetch this :
> > >
> http://www.univ-lille1.fr/etudes/offre-de-formation/Sciences-Technologies-S
> > >
> ante?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut
> > > -000001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO
> > >
> > > > My crawl-urlfilter :
> > > >
> > > > # skip file:, ftp:, & mailto: urls
> > > > -^(file|ftp|mailto):
> > > >
> > > > # skip image and other suffixes we can't yet parse
> > >
> > >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
> > > tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|crt|cert)$
> > >
> > > > # skip URLs containing certain characters as probable queries, etc.
> > > > #-[?*!@=]
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> > > > loops
> > > > -.*(/.+?)/.*?\1/.*?\1/
> > > >
> > > > # crawler seulement sur front-ig1
> > > > +^http://www.univ-lille1.fr/etudes/offre-de-formation
> > > >
> > > > # skip everything else
> > > > -.
> > > >
> > > >
> > > > By removing  the comment on -[?*!@=], nutch doesn't fetch query strings
> > > > at all.
> > > > For information, i use nutch 0.9 (but i tried with a fresh install of
> 1.2
> > > > and i'm having the same problem).
> > > >
> > > > Thanks for your answer John
> > > > Best regards
> > > > David
> > > >
> > > > Selon "McGibbney, Lewis John" <[email protected]>:
> > > > > Hi,
> > > > >
> > > > > It appears that both of the urls you posted return 404 not found then
> > > > > autoredirect to a domain seller!
> > > > >
> > > > > Further to this, did you remove the comment on this
> > > > >
> > > > > #-[?*!@=]... from the info provided below it appears you have not.
> > > > >
> > > > > hth
> > > > >
> > > > > Lewis
> > > > >
> > > > > ________________________________________
> > > > > From: [email protected] [[email protected]]
> > > > > Sent: 21 April 2011 16:15
> > > > > To: [email protected]
> > > > > Subject: Fetching urls with query string
> > > > >
> > > > > Hello,
> > > > >
> > > > > I have problems fetching some urls having GET parameters with nutch.
> > > > > For
> > >
> > > > > example, nutch is fetching :
> > >
> http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sant
> > > e?domaine=1&diplome=TI-DUT&composante=
> > >
> > > > > but will not fetch :
> > >
> http://www.mywebsite.com/studies/formation-offer/Sciences-Technologies-Sant
> > >
> e?domaine=1&diplome=TI-DUT&composante=&mention=FR_RNE_0593559Y_PR_ST-dut-00
> > > 0001&specialite=FR_RNE_0593559Y_PR_formation-DUT-INFO
> > >
> > > > > I updated the crawl-urlfilter :
> > > > > #-[?*!@=]
> > > > >
> > > > > +^http://www.mywebsite.com/studies/formation-offer/
> > > > >
> > > > > and nutch-default.xml :
> > > > >
> > > > > <property>
> > > > >
> > > > >   <name>db.max.anchor.length</name>
> > > > >   <value>300</value>
> > > > >   <description>The maximum number of characters permitted in an
> anchor.
> > > > >   </description>
> > > > >
> > > > > </property>
> > > > >
> > > > > but i have the same result, i didn't find anything in the
> configuration
> > > >
> > > > files
> > > >
> > > > > to
> > > > > make it work. Have somebody an idea ?
> > > > >
> > > > > Best regards,
> > > > > David
> > > > >
> > > > > Email has been scanned for viruses by Altman Technologies' email
> > > > > management service - www.altman.co.uk/emailsystems
> > > > >
> > > > > Glasgow Caledonian University is a registered Scottish charity,
> number
> > > > > SC021474
> > > > >
> > > > > Winner: Times Higher Educationâs Widening Participation Initiative
> of
> > > > > the Year 2009 and Herald Societyâs Education Initiative of the Year
> > > > > 2009.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e
> > > n.html
> > >
> > > > > Winner: Times Higher Educationâs Outstanding Support for Early
> Career
> > > > > Researchers of the Year 2010, GCU as a lead with Universities
> Scotland
> > > > > partners.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,
> > > en.html
> > >
> > > > Email has been scanned for viruses by Altman Technologies' email
> > > > management service - www.altman.co.uk/emailsystems
> > > >
> > > > Glasgow Caledonian University is a registered Scottish charity, number
> > > > SC021474
> > > >
> > > > Winner: Times Higher Educations Widening Participation Initiative of
> the
> > > > Year 2009 and Herald Societys Education Initiative of the Year 2009.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,e
> > > n.html
> > >
> > > > Winner: Times Higher Educations Outstanding Support for Early Career
> > > > Researchers of the Year 2010, GCU as a lead with Universities Scotland
> > > > partners.
> > >
> > >
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,
> > > en.html
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>
>
>

Re: Fetching urls with query string

Reply via email to