Re: some help for regex with scrapy

Artem Utin Sat, 03 Sep 2016 21:48:07 -0700

Hello. 

I'd recommend to use as much selectors as possible before diving into 
regexes, especially if you're not good at it.
So, you can use response.xpath('/html/body/div[3]/ul/li/a/@href').extract() 
to extract anchor href's, 
and response.xpath('/html/body/div[3]/ul/li/a/text()').extract() to extract 
anchor's text (it's mentioned in docs 
<http://doc.scrapy.org/en/1.1/topics/selectors.html> btw)
Afterwards, you can try out regex for extracting cities names at pythex 
<http://pythex.org/?regex=.*%2F(.*%3F).php&test_string=%3Ca%20href%3D%22http%3A%2F%2Fwww.nowdl.cn%2Fcity%2Fbeijing%2Fbeijing.php%22%20target%3D%22_blank%22%3E%5Cu5317%5Cu4eac%3C%2Fa%3E&ignorecase=0&multiline=0&dotall=0&verbose=0>


On Sunday, September 4, 2016 at 3:04:35 AM UTC+10, peter zhu wrote:
>
> Hey,guys!
>  http://www.nowdl.cn/all.html
> my steps:
> 1,scrapy shell http://www.nowdl.cn/all.html
> 2,response.xpath('/html/body/div[3]/ul/li/a').extract()
> i want to extract the content before suffix ".php"
> for example:
> u'<a href="http://www.nowdl.cn/city/beijing/*beijing*.php"; 
> target="_blank">\u5317\u4eac</a>',
> i need bold fonts "beijing" and want to chang unicode "\u5317\u4eac" 
> -->"北京市"
> now my question is:
> 1,how to use the regex to extract the contents which i need?
> 2,how to change the unicode to chinese?
> thks any suggestions!
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: some help for regex with scrapy

Reply via email to