Re: How to extract & follow relative links ?

Roberto López Thu, 02 Jan 2014 16:47:26 -0800

Thanks Rolando. It's clear enough and it helped me a lot.

I'm still learning. Now it´s time to learn about item loaders.


Regards ¡¡¡

El miércoles, 25 de diciembre de 2013 18:38:58 UTC+1, Rolando Espinoza La 
fuente escribió:
>
> Perhaps your links get filtered out due the relative url being the same 
> page and the #fragment being removed by the link extractor.
>
> You can get the links with fragments by using the canonicalize=False 
> option.
>
> See scrapy shell session below:
>
> In [1]: body = '<a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>'
>
> In [2]: from scrapy.http import HtmlResponse
>
> In [3]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
>
> In [4]: r = HtmlResponse('http://www.example.com/', body=body)
>
> In [5]: lx = SgmlLinkExtractor(canonicalize=False)
>
> In [6: lx.extract_links(r)
> Out[6]: [Link(url='http://www.example.com/#date=2013-12-24&Id=1269282', 
> text=u'Tynwald Titan', fragment='', nofollow=False)]
>
> But the #fragment is not supposed to be sent to the server, so if you 
> attempt to request that url you will get the same page. Most likely the 
> website uses javascript to display the information based on the fragment. 
> Give that scrapy doesn't execute javascript, you will need to figure out 
> what the site does and reproduce that in scrapy (i.e. building an ajax 
> request with the date and id in the fragment).
>
> Regards
> Rolando
>
>
>
> On Wed, Dec 25, 2013 at 10:24 AM, Roberto López 
> <[email protected]<javascript:>
> > wrote:
>
>> Finally I have used this Rule :
>>
>> rules = (
>>         Rule(SgmlLinkExtractor( *restrict_xpaths*=(
>> '//th[@class="headline"]'), 
>> tags=("a"),attrs=("href"),allow=(r''),process_value
>> =my_process_value), callback='my_parser', follow=False),
>>         )
>>
>>
>>  def my_process_value(value):
>>         print '---->'+value
>>         return 
>>
>>
>> I get the links I want, links within the restricted_xpath, it´s work. 
>> this is the output:
>>
>> ---->#Day=2013-12-24&Id=*33*
>>
>> But . . . I would like to do the same using allow and process_value. Do 
>> you know how I can do it ?
>>
>>
>>
>> El martes, 24 de diciembre de 2013 23:31:43 UTC+1, Roberto López escribió:
>>
>>> Hi.
>>>
>>>
>>> I have to extract and follow links like this:
>>>
>>> <a href="#date=2013-12-24&Id=1269282">Tynwald Titan</a>
>>>
>>> The next rule doesn't work, no links found
>>>
>>> rules = ( Rule(SgmlLinkExtractor(allow=r''), callback='parse_item',follow
>>> =True), )
>>>
>>> Do you know how I can do it ?
>>>
>>> Best regards
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]<javascript:>
>> .
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: How to extract & follow relative links ?

Reply via email to