If you are web scraping you really need your code to be as efficient as
possible and to do as little as possible. Firstly, make sure you are using
everything the servers of the websites you are scraping are giving you to
decide whether to bother downloading the page. For example, check the etag
and only bother to scape if it is different from the last time you scraped
data.. If you don't trust the server's ETag, you can hash the page when you
download it and check that against your stored hash so you can check
whether it changed and whether it's worth processing.

Your approach of trying a 'get' with all the properties set and picking up
the exception has costs -- Assuming your tables have enough rows that
scanning the entire table won't be efficient for every "get" you will need
to have every column you are using in you "get" indexed in the database.
This obviously has a storage cost as well as an additional insert/update
cost and a larger cost to run the query than a simple select against a
single key. Whether that is more efficient than getting the result and
comparing the fields in python I don't know. I imagine it will be dependent
on what your RDBMS is and how it is hosted as well as how many rows and
columns will be in your database table.

You could initialise a flag to False and as you process your scraped data
you could compare it to the attributes of your instance and set the flag to
True if they have changed and then not bother saving if you get to the end
of processing your scraped data and the modified flag has not been set to
True.

Dan

On 6 November 2015 at 16:12, Yunti <bkga...@gmail.com> wrote:

> Hi Dan,
>
> Thanks for the suggestion, it's a web scraper (run as a django management
> command) which then saves the data to the database via the Django ORM.
> Given it's a scraper rather than a form (or view) is the above suggested
> function an ok way to proceed or would you suggest something else is more
> appropriate/best practice?
>
>
>
> On Friday, 6 November 2015 14:40:59 UTC, Dan Tagg wrote:
>>
>> Hi Yunti,
>>
>>
>> You could go up a level in the structure of your application and apply
>> the logic there, where there is more support.
>>
>> Are you using Django forms? The ModelForm class pretty much does what you
>> want, it examines form data, validating it against its type and any
>> validation rules you have set in the form or your model, compares it to the
>> instance's data in the database and only saves if there has been some kind
>> of change.
>>
>> Dan
>>
>> On 6 November 2015 at 13:47, Yunti <bkg...@gmail.com> wrote:
>>
>>> Jani,
>>>
>>> Thanks for your reply - you explained it much more concisely than I did.
>>> :)
>>>
>>> Good to have it confirmed that update_or_create() doesn't quite do what
>>> I needed - I was confused as to whether it would or not.
>>>
>>> Thanks for taking the time to do that function, that looks ideal. I'll
>>> test it out.
>>>
>>>
>>> On Friday, 6 November 2015 12:52:11 UTC, Jani Tiainen wrote:
>>>
>>>> Your problem lies on the way Django actually carries out create or
>>>> update.
>>>>
>>>> As name suggest, create or update does either one. But that's what you
>>>> don't want - you want conditional update.
>>>>
>>>> Only update if certain fields have been changed. Well this can be done
>>>> few ways.
>>>>
>>>> So you want to do
>>>> "update_only_if_at_least_one_of_default_fields_changed_or_create"
>>>>
>>>> Operation is simple, if object is not found, create new one using
>>>> defaults if found, pull values as a dict, compare against
>>>> default values and if at least one differs do an update. Otherwise
>>>> don't do anything.
>>>>
>>>> So basically code would look something like this:
>>>>
>>>> update_if_changed_or_create(**kwargs):
>>>>     defaults = kwargs.pop('defaults', None)
>>>>
>>>>     qs = MyModel.objects.filter(**kwargs)
>>>>
>>>>      if not qs:
>>>>         obj = MyModel(**kwargs).save()
>>>>         return obj, True  # Created object
>>>>     else if len(qs) == 1:
>>>>         obj = qs[0]
>>>>         changed = False
>>>>         for k, v in defaults:
>>>>              if getattr(obj, k) != v:
>>>>                  changed = True
>>>>                  setattr(obj, k, v)
>>>>         if changed:
>>>>             obj.save()
>>>>         return obj, False  # Updated object
>>>>     else:
>>>>         # Multiple objects...
>>>>
>>>>     return obj, None  # No change.
>>>>
>>>>
>>>> On 06.11.2015 14:08, Yunti wrote:
>>>>
>>>> Carsten ,
>>>>
>>>> Thanks for your reply,
>>>>
>>>> A note about the last statement: If a Supplier object has the same
>>>> unique_id, and all
>>>> other fields (in `defaults`) are the same as well, logically there is
>>>> no difference
>>>> between updating and not updating – the result is the same.
>>>>
>>>> The entry in the database is the same - apart from the last_updated
>>>> flag if it's not rewritten over the top of it.  This means I can check for
>>>> new data often and be alerted when there is an actual update (i.e. a change
>>>> to the data).  If it rewrites the data everytime it checks then I have no
>>>> idea when data was actually updated.
>>>>
>>>> Have you checked? How?
>>>> In your create_or_update_if_diff() you seem to try to re-invent
>>>> update_or_create(), but
>>>> have you actually examined the results of the
>>>>
>>>>      supplier, created = Supplier.objects.update_or_create(...)
>>>>
>>>> call?
>>>>
>>>> I checked by seeing that the last_updated field in the database was
>>>> updated everytime.  (I suppose the issue could be with how that field gets
>>>> reset to the next time it's run- I didn't eliminate that possibility.)
>>>>
>>>> Yes I was worried that I might be recreating (a poor version) of
>>>> update_or_create() but it didn't seem to have the option where it wouldn't
>>>> write to the database if there was no change to the data.
>>>> Can it do this? And how would I verify when an item has been updated or
>>>> created (or neither) - could I output to the console?
>>>>
>>>> If it can how do I call it so it checks against all fields (unique_id
>>>> and defaults) and updates using the defaults if it finds a difference (and
>>>> creates if it doesn't find a unique_id)?
>>>>
>>>> I'm still not sure if this is possible and how to call the function,
>>>> particular how to pass in the remaining defaults to check against -
>>>> **kwargs = defaults isn't right but not sure what it should be.
>>>>
>>>> supplier, created = 
>>>> Supplier.objects.update_or_create(unique_id=product_detail['supplierId'], 
>>>> **kwargs=defaults,
>>>>                                                        defaults={
>>>>                                                            'name': 
>>>> product_detail['supplierName'],
>>>>                                                            
>>>> 'entity_name_1': entity_name_1,
>>>>                                                            
>>>> 'entity_name_2': entity_name_1,
>>>>                                                            'rating': 
>>>> product_detail['supplierRating']})
>>>>
>>>> On Thursday, 5 November 2015 20:05:39 UTC, Carsten Fuchs wrote:
>>>>>
>>>>> Hi Yunti, Am 05.11.2015 um 18:19 schrieb Yunti: > I have tried to use
>>>>> the update_or_create() method assuming that it would either, create > a 
>>>>> new
>>>>> entry in the db if it found none or update an existing one if it found one
>>>>> and had > differences to the defaults passed in  - or wouldn't update if
>>>>> there was no difference. A note about the last statement: If a Supplier
>>>>> object has the same unique_id, and all other fields (in `defaults`) are 
>>>>> the
>>>>> same as well, logically there is no difference between updating and not
>>>>> updating – the result is the same. >   However it just seemed to recreate
>>>>> entries each time even if there were no changes. Have you checked? How? In
>>>>> your create_or_update_if_diff() you seem to try to re-invent
>>>>> update_or_create(), but have you actually examined the results of the
>>>>>  supplier, created = Supplier.objects.update_or_create(...) call? > I 
>>>>> think
>>>>> the issue was that I wanted to: > 1)  get an entry if all fields were the
>>>>> same, update_or_create() updates an object with the given kwargs, the 
>>>>> match
>>>>> is not made against *all* fields (i.e. for the match the fields in
>>>>> `defaults` are not accounted for). > 2) or create a new entry if it didn't
>>>>> find an existing entry with the unique_id > 3) or if there was an entry
>>>>> with the same unique_id, update that entry with remaining > fields.
>>>>> update_or_create() should achieve this. It's hard to tell more without
>>>>> additional information, but
>>>>> https://docs.djangoproject.com/en/1.8/ref/models/querysets/#update-or-create
>>>>> explains the function well, including how it works. If you work through
>>>>> this in small steps, check examples and their (intermediate) results, you
>>>>> should be able to find what the original problem was. Best regards, 
>>>>> Carsten
>>>>
>>>> -- You received this message because you are subscribed to the Google
>>>> Groups "Django users" group. To unsubscribe from this group and stop
>>>> receiving emails from it, send an email to
>>>> django-users...@googlegroups.com. To post to this group, send email to
>>>> django...@googlegroups.com. Visit this group at
>>>> http://groups.google.com/group/django-users. To view this discussion
>>>> on the web visit
>>>> https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/django-users/9b529e2d-7e2b-4194-a77c-8434efe6205d%40googlegroups.com?utm_medium=email&utm_source=footer>.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Django users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to django-users...@googlegroups.com.
>>> To post to this group, send email to django...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/django-users.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com
>>> <https://groups.google.com/d/msgid/django-users/889c6480-98b3-415d-af92-490d11de5695%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users+unsubscr...@googlegroups.com.
> To post to this group, send email to django-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com
> <https://groups.google.com/d/msgid/django-users/3cea33db-f2e7-4739-a202-99a717bda092%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Wildman and Herring Limited, Registered Office: 52 Great Eastern Street,
London, EC2A 3EP, Company no: 05766374

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/CAPZHCY6rtSf%2BP0twzNSXiHAQKkVh1mWAW2MBZoau%2Bjjtxpv38w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to