Re: Order of Post Scrape Processing

Travis Leleu Sun, 23 Aug 2015 11:09:24 -0700

Without knowing the exact source of the page you're trying to scrape, we're
flying a little blind here.


It does look like you're using an absolute xpath instead of (what I think
you want) a relative one.  (try './div[4]/p' instead of '//div[4]/p').
What you wrote will return the p element underneath the 4th div in any part
of the etree; I'm assuming you previously selected that section and now
want to select a child element.

In all, it sounds like you could use some more practice with xpath
selectors, as yours don't seem to be quite right for what you're trying to
do.

On Sun, Aug 23, 2015 at 10:40 AM, Malik Rumi <[email protected]> wrote:

> I let fly... and discovered two issues.
>
> I did a test targeting a single paragraph which I got the xpath for. But
> any text inside that paragraph that was wrapped in an html tag (<i></i>, in
> this case) was also skipped as if it were html, leaving jagged, unexplained
> gaps in the text I retrieved.
>
> Then, and this I have no explanation for, it did NOT stop at the one test
> paragraph, but ran on to scrape a footnote. The thing is, the footnote was
> NOT in the test paragraph, but came four paragraphs later. Here is the xpath
>
> for my test paragraph: /html/body/div[4]/p
>
> And this is the one for the rogue extra
> footnote: /html/body/div[25]/div[4]/p
>
> So scrapy ignored the extra 'div[25]' and treated both the same. Now I
> didn't put the full path in my code, I put //div[4]/p. So maybe I can play
> with this one and get it fixed. But what do I do about the text inside the
> <i> tag? If people use that instead of CSS, this will be an ongoing
> problem. Thanks.
>
>
> On Friday, August 21, 2015 at 5:03:40 PM UTC-5, Malik Rumi wrote:
>>
>> Well, I have been accused of overthinking things before, so you might be
>> onto something. All right, I will let fly.
>>
>> On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote:
>>>
>>> Hi Malik,
>>>
>>> I speak only for myself, but I always thought the emphasis on csv item
>>> exporting was to eliminate barriers to entry to get a scrape up and
>>> running.  If you can export to csv, you can open to Excel, and anybody
>>> reading the scrapy tutorial can do that.
>>>
>>> I'm not 100% clear on your objectives, but I generally do string
>>> manipulation in my scrapers, using the db to handle deduplication.  Then I
>>> write processing scripts from there to normalize, fuzzy dedupe, etc.
>>>
>>> It sounds a little like you're overthinking it -- I'd recommend just
>>> letting it fly, grab the data you want in the scraper, and save it as a
>>> DjangoItem.  You can and will, rescrape later.
>>>
>>>
>>> On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> wrote:
>>>
>>>> I am getting ready to work on my first scrapy project. I've done the
>>>> dmoz tutorial and looked at the docs. My project is to obtain the data, do
>>>> a large number of search and replaces, any other needed clean up, and save
>>>> it to Postgres. I was wondering what 'best practices' are for putting
>>>> together the pipeline?
>>>>
>>>>
>>>> If I am understanding things correctly, most pipelines write the
>>>> results of the scrape into plain text files for processing, regardless of
>>>> the specific tools to be used in that processing, and then bulk upload the
>>>> finished product to the database.
>>>>
>>>>
>>>> However, I am seriously considering going straight to the database with
>>>> DjangoItem, so that I can calculate the urls and then incorporate that data
>>>> in my search and replace. But I suspect trying to do all this text
>>>> processing in the db is a bad idea, but I don't know that for sure. Maybe
>>>> it makes no difference?
>>>>
>>>>
>>>> Another option might be to Tee the scrape into both the db and text
>>>> files. This way I can still use the db to calculate the urls, even if that
>>>> is all I do with those results. Then I could process the text files and
>>>> INSERT/UPDATE the final result back into Postgres, overwriting the original
>>>> raw scrape content. But then I wondered about keeping track of all the
>>>> changes. Has anyone used Git in a situation like this?
>>>>
>>>>
>>>> Thanks for sharing.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
>
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Order of Post Scrape Processing

Reply via email to