On 30/01/20 9:35 PM, R.Wieser wrote:
MRAB's scheme does have the disadvantages to me that Chris has pointed
out.
Nothing that can't be countered by keeping copies of the last X number of
to-be-dowloaded-URLs files.

That's a good idea, but how would the automated system 'know' to give-up on the current file and utilise generation n-1? Unable to open the file or ???


As for rewriting every time, you will /have/ to write something for every
action (and flush the file!), if you think you should be able to ctrl-c (or
worse) out of the program.

Which is the nub of the problem!

Using ctrl+c is a VERY BAD idea. Depending upon the sophistication of the solution/existing code, surely there is another way...

Even closing/pulling-out the networking connection to cause an exception within Python, would enable management of a more 'clean' and 'data safe' shutdown!
(see also 'sledgehammer to crack a nut')

Why do you need to abandon the process mid-way?


But, you could opt to write this sessions successfully downloaded URLs to a
seperate file, and only merge that with the origional one program start.
That together with an integrity check of the seperate file (eventually on a
line-by-line (URL) basis) should make the origional files corruption rather
unlikely.

What is the OP's definition of "unlikely" or "acceptable risk"?
If RDBMS == "unnecessary complexity", then (presumably) 'concern' will be commensurately low, and much of the discussion to-date, moot?

I've not worked on 'downloads' (which I take to mean data files, eg forms from the tax office - guess what task I'm procrastinating over?) but have automated the downloading of web page content/headers. There are so many reasons why such won't work first-time, when they should every time; that it may be quite difficult to detect 'corruption' (as distinct from so many of these other issues that may arise)...


A database /sounds/ good, but what happens when you ctrl-c outof a
non-atomic operation ?   How do you fix that ?    IOW: Databases can be
corrupted for pretty-much the same reason as for a simple datafile (but with
much worse consequences).

[apologies for personal comment]
I, (with my skill-set, tool-set, collection of utilities, ... - see earlier mention of "bias") reach for an RDBMS more quickly than many*. Mea culpa or 'more power to [my] right arm'?


The DB suggestion (posted earlier) involved only a single table, to which fields would be added/populated during processing as a record of progress/status. Thus, replacing the single file that the OP (originally) outlined as fitting his/her needs, with a single DB-table.

Accordingly, there is no non-atomic transaction in the proposal - UPDATE is atomic in most (competent) RDBMS. (again, in my ignorance of that project, please don't (anyone) think I'm including/excluding SQLite)


Contrarily, if the 'single table idea' is hardly a "database" by most definitions, why bother? The answer lies in the very mechanisms to combat corruptions and interruptions being discussed! As a fundamentally-lazy person, I'd rather leave the RDBMS-coders to wrestle with such complexities 'for me'. Then, I can 'stand on the shoulders' of such 'giants', by driving their (competently working) 'black box'...
(YMMV!)


Now, it transpires, the OP possesses DB skills. So, (s)he is in a position to make the go/no decision which suits the actual spec. Yahoo! (not TM)


Also think of the old adagio: "I had a problem, and than I thought I could
use X.  Now I have two problems..." - with X traditionally being "regular
expressions".   In other words: do KISS (keep it ....)

Good point! (I'm not a great fan of RegEx-es either)
- reduce/avoid complexity, "simple is better than complex"! (Python: import this)


Surely though, it is only appropriate to dive into the concerns and complexities of DB accuracy and "consistency", if we do likewise with file systems?

The rationale of my 'laziness' argument 'for' using an RDBMS, also applies to plain-vanilla file systems. Do I want to deal with the complexities of managing files and corruptions, in that arena?
(you could easily guess the answer to that!)

Do you?
(the answer may be quite different - but no matter, I'm not going to say you are "wrong", as long as in making such a decision (files?DB) we compare 'like with like' - in fact, before that: as long as the client's spec says that we need to be worrying about such detail!
(otherwise YAGNI applies!)


By the way: The "just write the URLs in a folder" method is not at all a bad
one.   /Very/ easy to maintain, resilent (especially when you consider the
self-repairing capabilities of some filesystems) and the polar opposite of a
"customer lock-in". :-)

+1
Be aware that formation rules for URLs are not congruent with OS FS rules!
(such concerns don't apply if the URLs are data within a file/table)



* was astonished to discover (a show-of-hands poll at some conference or other) that 'the average applications programmer' dislikes SQL/RDBMS and would rather have 'someone else' handle that side of things. Most of those ascribed their attitude to not having been able to 'get [their] heads around SQL' - which left me baffled because I 'just see it'. However, my mental processes have been queried (more than once)! Upon reflection, this 'discovery' made me happy - found me another niche to occupy...
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to