Re: How to extract & follow relative links ?

Rolando Espinoza La Fuente Wed, 25 Dec 2013 09:40:25 -0800

Perhaps your links get filtered out due the relative url being the same
page and the #fragment being removed by the link extractor.


You can get the links with fragments by using the canonicalize=False option.

See scrapy shell session below:

In [1]: body = '<a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>'

In [2]: from scrapy.http import HtmlResponse

In [3]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [4]: r = HtmlResponse('http://www.example.com/', body=body)

In [5]: lx = SgmlLinkExtractor(canonicalize=False)

In [6: lx.extract_links(r)
Out[6]: [Link(url='http://www.example.com/#date=2013-12-24&Id=1269282',
text=u'Tynwald Titan', fragment='', nofollow=False)]

But the #fragment is not supposed to be sent to the server, so if you
attempt to request that url you will get the same page. Most likely the
website uses javascript to display the information based on the fragment.
Give that scrapy doesn't execute javascript, you will need to figure out
what the site does and reproduce that in scrapy (i.e. building an ajax
request with the date and id in the fragment).

Regards
Rolando



On Wed, Dec 25, 2013 at 10:24 AM, Roberto López <
[email protected]> wrote:

> Finally I have used this Rule :
>
> rules = (
>         Rule(SgmlLinkExtractor( *restrict_xpaths*=(
> '//th[@class="headline"]'), 
> tags=("a"),attrs=("href"),allow=(r''),process_value
> =my_process_value), callback='my_parser', follow=False),
>         )
>
>
>  def my_process_value(value):
>         print '---->'+value
>         return
>
>
> I get the links I want, links within the restricted_xpath, it´s work. this
> is the output:
>
> ---->#Day=2013-12-24&Id=*33*
>
> But . . . I would like to do the same using allow and process_value. Do
> you know how I can do it ?
>
>
>
> El martes, 24 de diciembre de 2013 23:31:43 UTC+1, Roberto López escribió:
>
>> Hi.
>>
>>
>> I have to extract and follow links like this:
>>
>> <a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>
>>
>> The next rule doesn't work, no links found
>>
>> rules = ( Rule(SgmlLinkExtractor(allow=r''), callback='parse_item',follow
>> =True), )
>>
>> Do you know how I can do it ?
>>
>> Best regards
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: How to extract & follow relative links ?

Reply via email to