Re: Suggestions on mechanism or existing code - maintain persistence of file download history

DL Neil via Python-list Thu, 30 Jan 2020 16:05:18 -0800

On 30/01/20 9:35 PM, R.Wieser wrote:

MRAB's scheme does have the disadvantages to me that Chris has pointed
out.

Nothing that can't be countered by keeping copies of the last X number of
to-be-dowloaded-URLs files.

That's a good idea, but how would the automated system 'know' to give-upon the current file and utilise generation n-1? Unable to open the fileor ???

As for rewriting every time, you will /have/ to write something for every
action (and flush the file!), if you think you should be able to ctrl-c (or
worse) out of the program.


Which is the nub of the problem!

Using ctrl+c is a VERY BAD idea. Depending upon the sophistication ofthe solution/existing code, surely there is another way...

Even closing/pulling-out the networking connection to cause an exceptionwithin Python, would enable management of a more 'clean' and 'data safe'shutdown!

(see also 'sledgehammer to crack a nut')

Why do you need to abandon the process mid-way?

But, you could opt to write this sessions successfully downloaded URLs to a
seperate file, and only merge that with the origional one program start.
That together with an integrity check of the seperate file (eventually on a
line-by-line (URL) basis) should make the origional files corruption rather
unlikely.


What is the OP's definition of "unlikely" or "acceptable risk"?

If RDBMS == "unnecessary complexity", then (presumably) 'concern' willbe commensurately low, and much of the discussion to-date, moot?

I've not worked on 'downloads' (which I take to mean data files, egforms from the tax office - guess what task I'm procrastinating over?)but have automated the downloading of web page content/headers. Thereare so many reasons why such won't work first-time, when they shouldevery time; that it may be quite difficult to detect 'corruption' (asdistinct from so many of these other issues that may arise)...

A database /sounds/ good, but what happens when you ctrl-c outof a
non-atomic operation ?   How do you fix that ?    IOW: Databases can be
corrupted for pretty-much the same reason as for a simple datafile (but with
much worse consequences).


[apologies for personal comment]

I, (with my skill-set, tool-set, collection of utilities, ... - seeearlier mention of "bias") reach for an RDBMS more quickly than many*.Mea culpa or 'more power to [my] right arm'?

The DB suggestion (posted earlier) involved only a single table, towhich fields would be added/populated during processing as a record ofprogress/status. Thus, replacing the single file that the OP(originally) outlined as fitting his/her needs, with a single DB-table.

Accordingly, there is no non-atomic transaction in the proposal - UPDATEis atomic in most (competent) RDBMS.(again, in my ignorance of that project, please don't (anyone) think I'mincluding/excluding SQLite)

Contrarily, if the 'single table idea' is hardly a "database" by mostdefinitions, why bother? The answer lies in the very mechanisms tocombat corruptions and interruptions being discussed! As afundamentally-lazy person, I'd rather leave the RDBMS-coders to wrestlewith such complexities 'for me'. Then, I can 'stand on the shoulders' ofsuch 'giants', by driving their (competently working) 'black box'...

(YMMV!)

Now, it transpires, the OP possesses DB skills. So, (s)he is in aposition to make the go/no decision which suits the actual spec. Yahoo!(not TM)

Also think of the old adagio: "I had a problem, and than I thought I could
use X.  Now I have two problems..." - with X traditionally being "regular
expressions".   In other words: do KISS (keep it ....)


Good point! (I'm not a great fan of RegEx-es either)

- reduce/avoid complexity, "simple is better than complex"! (Python:import this)

Surely though, it is only appropriate to dive into the concerns andcomplexities of DB accuracy and "consistency", if we do likewise withfile systems?

The rationale of my 'laziness' argument 'for' using an RDBMS, alsoapplies to plain-vanilla file systems. Do I want to deal with thecomplexities of managing files and corruptions, in that arena?

(you could easily guess the answer to that!)

Do you?

(the answer may be quite different - but no matter, I'm not going to sayyou are "wrong", as long as in making such a decision (files?DB) wecompare 'like with like' - in fact, before that: as long as the client'sspec says that we need to be worrying about such detail!

(otherwise YAGNI applies!)

By the way: The "just write the URLs in a folder" method is not at all a bad
one.   /Very/ easy to maintain, resilent (especially when you consider the
self-repairing capabilities of some filesystems) and the polar opposite of a
"customer lock-in". :-)


+1
Be aware that formation rules for URLs are not congruent with OS FS rules!
(such concerns don't apply if the URLs are data within a file/table)

* was astonished to discover (a show-of-hands poll at some conference orother) that 'the average applications programmer' dislikes SQL/RDBMS andwould rather have 'someone else' handle that side of things. Most ofthose ascribed their attitude to not having been able to 'get [their]heads around SQL' - which left me baffled because I 'just see it'.However, my mental processes have been queried (more than once)! Uponreflection, this 'discovery' made me happy - found me another niche tooccupy...

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: Suggestions on mechanism or existing code - maintain persistence of file download history

Reply via email to