Re: Order of Post Scrape Processing

Malik Rumi Sun, 23 Aug 2015 19:59:04 -0700

 

ok. the extraneous paragraph issue is solved:



>>> response.xpath("//body/div[4]/p//text()").extract()

See http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Start_with_%2F%2F 


As is the text inside the <i> tag, which I got from SO, 
http://stackoverflow.com/questions/23459521/scrapy-combine-text-along-with-the-bold?rq=1


The solution in the 
docs, 
http://doc.scrapy.org/en/1.0/topics/selectors.html#working-with-relative-xpaths 
did not work. I got 


Traceback (most recent call last):

File "<console>", line 1, in <module>

AttributeError: 'list' object has no attribute 'xpath'


Perhaps I misunderstood the use case the docs had in mind? If not, that 
section is a good candidate for a rewrite. I'd even be willing to do it, if 
some more knowledgeable core dev was available to answer questions as I 
went through my drafts...

On Sunday, August 23, 2015 at 1:09:08 PM UTC-5, Travis Leleu wrote:
>
> Without knowing the exact source of the page you're trying to scrape, 
> we're flying a little blind here.
>
> It does look like you're using an absolute xpath instead of (what I think 
> you want) a relative one.  (try './div[4]/p' instead of '//div[4]/p').  
> What you wrote will return the p element underneath the 4th div in any part 
> of the etree; I'm assuming you previously selected that section and now 
> want to select a child element.
>
> In all, it sounds like you could use some more practice with xpath 
> selectors, as yours don't seem to be quite right for what you're trying to 
> do.
>
> On Sun, Aug 23, 2015 at 10:40 AM, Malik Rumi <[email protected] 
> <javascript:>> wrote:
>
>> I let fly... and discovered two issues.
>>
>> I did a test targeting a single paragraph which I got the xpath for. But 
>> any text inside that paragraph that was wrapped in an html tag (<i></i>, in 
>> this case) was also skipped as if it were html, leaving jagged, unexplained 
>> gaps in the text I retrieved.
>>
>> Then, and this I have no explanation for, it did NOT stop at the one test 
>> paragraph, but ran on to scrape a footnote. The thing is, the footnote was 
>> NOT in the test paragraph, but came four paragraphs later. Here is the xpath
>>
>> for my test paragraph: /html/body/div[4]/p
>>
>> And this is the one for the rogue extra 
>> footnote: /html/body/div[25]/div[4]/p
>>
>> So scrapy ignored the extra 'div[25]' and treated both the same. Now I 
>> didn't put the full path in my code, I put //div[4]/p. So maybe I can play 
>> with this one and get it fixed. But what do I do about the text inside the 
>> <i> tag? If people use that instead of CSS, this will be an ongoing 
>> problem. Thanks. 
>>
>>
>> On Friday, August 21, 2015 at 5:03:40 PM UTC-5, Malik Rumi wrote:
>>>
>>> Well, I have been accused of overthinking things before, so you might be 
>>> onto something. All right, I will let fly.
>>>
>>> On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote:
>>>>
>>>> Hi Malik,
>>>>
>>>> I speak only for myself, but I always thought the emphasis on csv item 
>>>> exporting was to eliminate barriers to entry to get a scrape up and 
>>>> running.  If you can export to csv, you can open to Excel, and anybody 
>>>> reading the scrapy tutorial can do that.
>>>>
>>>> I'm not 100% clear on your objectives, but I generally do string 
>>>> manipulation in my scrapers, using the db to handle deduplication.  Then I 
>>>> write processing scripts from there to normalize, fuzzy dedupe, etc.
>>>>
>>>> It sounds a little like you're overthinking it -- I'd recommend just 
>>>> letting it fly, grab the data you want in the scraper, and save it as a 
>>>> DjangoItem.  You can and will, rescrape later.
>>>>
>>>>
>>>> On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> 
>>>> wrote:
>>>>
>>>>> I am getting ready to work on my first scrapy project. I've done the 
>>>>> dmoz tutorial and looked at the docs. My project is to obtain the data, 
>>>>> do 
>>>>> a large number of search and replaces, any other needed clean up, and 
>>>>> save 
>>>>> it to Postgres. I was wondering what 'best practices' are for putting 
>>>>> together the pipeline? 
>>>>>
>>>>>
>>>>> If I am understanding things correctly, most pipelines write the 
>>>>> results of the scrape into plain text files for processing, regardless of 
>>>>> the specific tools to be used in that processing, and then bulk upload 
>>>>> the 
>>>>> finished product to the database.
>>>>>
>>>>>
>>>>> However, I am seriously considering going straight to the database 
>>>>> with DjangoItem, so that I can calculate the urls and then incorporate 
>>>>> that 
>>>>> data in my search and replace. But I suspect trying to do all this text 
>>>>> processing in the db is a bad idea, but I don't know that for sure. Maybe 
>>>>> it makes no difference?
>>>>>
>>>>>
>>>>> Another option might be to Tee the scrape into both the db and text 
>>>>> files. This way I can still use the db to calculate the urls, even if 
>>>>> that 
>>>>> is all I do with those results. Then I could process the text files and 
>>>>> INSERT/UPDATE the final result back into Postgres, overwriting the 
>>>>> original 
>>>>> raw scrape content. But then I wondered about keeping track of all the 
>>>>> changes. Has anyone used Git in a situation like this?
>>>>>
>>>>>
>>>>> Thanks for sharing.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>>
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Order of Post Scrape Processing

Reply via email to