Re: Order of Post Scrape Processing

Malik Rumi Sun, 23 Aug 2015 10:41:22 -0700

I let fly... and discovered two issues.

I did a test targeting a single paragraph which I got the xpath for. But 
any text inside that paragraph that was wrapped in an html tag (<i></i>, in 
this case) was also skipped as if it were html, leaving jagged, unexplained 
gaps in the text I retrieved.


Then, and this I have no explanation for, it did NOT stop at the one test 
paragraph, but ran on to scrape a footnote. The thing is, the footnote was 
NOT in the test paragraph, but came four paragraphs later. Here is the xpath

for my test paragraph: /html/body/div[4]/p

And this is the one for the rogue extra 
footnote: /html/body/div[25]/div[4]/p

So scrapy ignored the extra 'div[25]' and treated both the same. Now I 
didn't put the full path in my code, I put //div[4]/p. So maybe I can play 
with this one and get it fixed. But what do I do about the text inside the 
<i> tag? If people use that instead of CSS, this will be an ongoing 
problem. Thanks. 

On Friday, August 21, 2015 at 5:03:40 PM UTC-5, Malik Rumi wrote:
>
> Well, I have been accused of overthinking things before, so you might be 
> onto something. All right, I will let fly.
>
> On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote:
>>
>> Hi Malik,
>>
>> I speak only for myself, but I always thought the emphasis on csv item 
>> exporting was to eliminate barriers to entry to get a scrape up and 
>> running.  If you can export to csv, you can open to Excel, and anybody 
>> reading the scrapy tutorial can do that.
>>
>> I'm not 100% clear on your objectives, but I generally do string 
>> manipulation in my scrapers, using the db to handle deduplication.  Then I 
>> write processing scripts from there to normalize, fuzzy dedupe, etc.
>>
>> It sounds a little like you're overthinking it -- I'd recommend just 
>> letting it fly, grab the data you want in the scraper, and save it as a 
>> DjangoItem.  You can and will, rescrape later.
>>
>>
>> On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> wrote:
>>
>>> I am getting ready to work on my first scrapy project. I've done the 
>>> dmoz tutorial and looked at the docs. My project is to obtain the data, do 
>>> a large number of search and replaces, any other needed clean up, and save 
>>> it to Postgres. I was wondering what 'best practices' are for putting 
>>> together the pipeline? 
>>>
>>>
>>> If I am understanding things correctly, most pipelines write the results 
>>> of the scrape into plain text files for processing, regardless of the 
>>> specific tools to be used in that processing, and then bulk upload the 
>>> finished product to the database.
>>>
>>>
>>> However, I am seriously considering going straight to the database with 
>>> DjangoItem, so that I can calculate the urls and then incorporate that data 
>>> in my search and replace. But I suspect trying to do all this text 
>>> processing in the db is a bad idea, but I don't know that for sure. Maybe 
>>> it makes no difference?
>>>
>>>
>>> Another option might be to Tee the scrape into both the db and text 
>>> files. This way I can still use the db to calculate the urls, even if that 
>>> is all I do with those results. Then I could process the text files and 
>>> INSERT/UPDATE the final result back into Postgres, overwriting the original 
>>> raw scrape content. But then I wondered about keeping track of all the 
>>> changes. Has anyone used Git in a situation like this?
>>>
>>>
>>> Thanks for sharing.
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Order of Post Scrape Processing

Reply via email to