I let fly... and discovered two issues. I did a test targeting a single paragraph which I got the xpath for. But any text inside that paragraph that was wrapped in an html tag (<i></i>, in this case) was also skipped as if it were html, leaving jagged, unexplained gaps in the text I retrieved.
Then, and this I have no explanation for, it did NOT stop at the one test paragraph, but ran on to scrape a footnote. The thing is, the footnote was NOT in the test paragraph, but came four paragraphs later. Here is the xpath for my test paragraph: /html/body/div[4]/p And this is the one for the rogue extra footnote: /html/body/div[25]/div[4]/p So scrapy ignored the extra 'div[25]' and treated both the same. Now I didn't put the full path in my code, I put //div[4]/p. So maybe I can play with this one and get it fixed. But what do I do about the text inside the <i> tag? If people use that instead of CSS, this will be an ongoing problem. Thanks. On Friday, August 21, 2015 at 5:03:40 PM UTC-5, Malik Rumi wrote: > > Well, I have been accused of overthinking things before, so you might be > onto something. All right, I will let fly. > > On Thursday, August 20, 2015 at 11:09:23 PM UTC-5, Travis Leleu wrote: >> >> Hi Malik, >> >> I speak only for myself, but I always thought the emphasis on csv item >> exporting was to eliminate barriers to entry to get a scrape up and >> running. If you can export to csv, you can open to Excel, and anybody >> reading the scrapy tutorial can do that. >> >> I'm not 100% clear on your objectives, but I generally do string >> manipulation in my scrapers, using the db to handle deduplication. Then I >> write processing scripts from there to normalize, fuzzy dedupe, etc. >> >> It sounds a little like you're overthinking it -- I'd recommend just >> letting it fly, grab the data you want in the scraper, and save it as a >> DjangoItem. You can and will, rescrape later. >> >> >> On Thu, Aug 20, 2015 at 8:47 PM, Malik Rumi <[email protected]> wrote: >> >>> I am getting ready to work on my first scrapy project. I've done the >>> dmoz tutorial and looked at the docs. My project is to obtain the data, do >>> a large number of search and replaces, any other needed clean up, and save >>> it to Postgres. I was wondering what 'best practices' are for putting >>> together the pipeline? >>> >>> >>> If I am understanding things correctly, most pipelines write the results >>> of the scrape into plain text files for processing, regardless of the >>> specific tools to be used in that processing, and then bulk upload the >>> finished product to the database. >>> >>> >>> However, I am seriously considering going straight to the database with >>> DjangoItem, so that I can calculate the urls and then incorporate that data >>> in my search and replace. But I suspect trying to do all this text >>> processing in the db is a bad idea, but I don't know that for sure. Maybe >>> it makes no difference? >>> >>> >>> Another option might be to Tee the scrape into both the db and text >>> files. This way I can still use the db to calculate the urls, even if that >>> is all I do with those results. Then I could process the text files and >>> INSERT/UPDATE the final result back into Postgres, overwriting the original >>> raw scrape content. But then I wondered about keeping track of all the >>> changes. Has anyone used Git in a situation like this? >>> >>> >>> Thanks for sharing. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
