Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On 31/07/2019 03:02, boB Stepp wrote: > preceding scores plus the current one. If the data in the file > somehow got mangled, it would be an extraordinary coincidence for > every row to yield a correct total score if that total score was > recalculated from the corrupted data. True but the likelihood of that happening is vanishingly small. What is much more likely is that a couple of bits in the entire file will be wrong. So a 5 becomes a 7 for example. Remember that the data in the files is a character based (assuming its a text file) not numerical. The conversion to numbers happens when you read it. The conversion is more likely to detect corrupted data than any calculations you perform. > But the underlying question that I am trying to answer is how > likely/unlikely is it for a file to get corrupted nowadays? It is still quite likely. Not as much as it was 40 years ago, but still very much a possibility. Especially if the data is stored/accessed over a network link. It is still very much a real issue for anyone dealing with critical data. > worthwhile verifying the integrity of every file in a program, or, at > least, every data file accessed by a program every program run? Which > leads to your point... Anything critical should go in a database. That will be much less likely to get corrupted since most RDBMS systems include data cleansing and verification as part of their function. Also for working with large volumes of data(where corruption risk rises just because of the volumes) a database is a more effective way of storing data anyway. >> Checking data integrity is what checksums are for. > > When should this be done in normal programming practice? Any time you gave a critical piece of data in a text file. If it is important to know that the data has changed (for any reason, not just data corruption) then use a checksum. Certainly if it's publicly available or you plan on shipping it over a network a checksum is a good idea. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On 31/7/19 2:21 am, boB Stepp wrote: I have been using various iterations of a solitaire scorekeeper program to explore different programming thoughts. In my latest musings I am wondering about -- in general -- whether it is best to store calculated data values in a file and reload these values, or whether to recalculate such data upon each new run of a program. In terms of my solitaire scorekeeper program is it better to store "Hand Number, Date, Time, Score, Total Score" or instead, "Hand Number, Date, Time, Score"? Of course I don't really need to store hand number since it is easily determined by its row/record number in its csv file. In this trivial example I cannot imagine there is any realistic difference between the two approaches, but I am trying to generalize my thoughts for potentially much more expensive calculations, very large data sets, and what is the likelihood of storage errors occurring in files. Any thoughts on this? TIA! From a scientific viewpoint, you want to keep the raw data, so you can perform other calculations that you may not have thought of yet. But that's not got much to do with programming ;) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On Tue, Jul 30, 2019 at 7:26 PM Mats Wichmann wrote: > > On 7/30/19 5:58 PM, Alan Gauld via Tutor wrote: > > On 30/07/2019 17:21, boB Stepp wrote: > > > >> musings I am wondering about -- in general -- whether it is best to > >> store calculated data values in a file and reload these values, or > >> whether to recalculate such data upon each new run of a program. > > > > It depends on the use case. > > > > For example a long running server process may not care about startup > > delays because it only starts once (or at least very rarely) so either > > approach would do but saving diskspace may be helpful so calculate the > > values. > > > > On the other hand a data batch processor running once as part of a > > chain working with high data volumes probably needs to start quickly. > > In which case do the calculations take longer than reading the > > extra data? Probably, so store in a file. > > > > There are other options too such as calculating the value every > > time it is used - only useful if the data might change > > dynamically during the program execution. > > > > It all depends on how much data?, how often it is used?, > > how often would it be calculated? How long does the process > > run for? etc. > > > Hey, boB - I bet you *knew* the answer was going to be "it depends" :) You are coming to know me all too well! ~(:>)) I just wanted to check with the professionals here if my thinking (Concealed behind the asked questions.) was correct or, if not, where I am off. > There are two very common classes of application that have to make this > very decision - real databases, and their toy cousins, spreadsheets. > > In the relational database world - characterized by very long-running > processes (like: unless it crashes, runs until reboot. and maybe even > beyond that - if you have a multi-mode replicated or distributed DB it > may survive failure of one point) - if a field is calculated it's not > stored. Because - what Alan said: in an RDBMS, data are _expected_ to > change during runtime. And then for performance reasons, there may be > some cases where it's precomputed and stored to avoid huge delays when > the computation is expensive. That world even has a term for that: a > materialized view (in contrast to a regular view). It can get pretty > tricky, you need something that causes the materialized view to update > when data has changed; for databases that don't natively support the > behavior you then have to fiddle with triggers and hopefully it works > out. More enlightened now? Not more enlightened, perhaps, but more convinced than ever on how difficult it is to manage the complexity of real world programs. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On Tue, Jul 30, 2019 at 7:05 PM Alan Gauld via Tutor wrote: > > On 30/07/2019 18:20, boB Stepp wrote: > > > What is the likelihood of file storage corruption? I have a vague > > sense that in earlier days of computing this was more likely to > > happen, but nowadays? Storing and recalculating does act as a good > > data integrity check of the file data. > > No it doesn't! You are quite likely to get a successful calculation > using nonsense data and therefore invalid results. But they look > valid - a number is a number... Though I may be dense here, for the particular example I started with the total score in a solitaire game is equal to the sum of all of the preceding scores plus the current one. If the data in the file somehow got mangled, it would be an extraordinary coincidence for every row to yield a correct total score if that total score was recalculated from the corrupted data. But the underlying question that I am trying to answer is how likely/unlikely is it for a file to get corrupted nowadays? Is it worthwhile verifying the integrity of every file in a program, or, at least, every data file accessed by a program every program run? Which leads to your point... > Checking data integrity is what checksums are for. When should this be done in normal programming practice? -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On 7/30/19 5:58 PM, Alan Gauld via Tutor wrote: > On 30/07/2019 17:21, boB Stepp wrote: > >> musings I am wondering about -- in general -- whether it is best to >> store calculated data values in a file and reload these values, or >> whether to recalculate such data upon each new run of a program. > > It depends on the use case. > > For example a long running server process may not care about startup > delays because it only starts once (or at least very rarely) so either > approach would do but saving diskspace may be helpful so calculate the > values. > > On the other hand a data batch processor running once as part of a > chain working with high data volumes probably needs to start quickly. > In which case do the calculations take longer than reading the > extra data? Probably, so store in a file. > > There are other options too such as calculating the value every > time it is used - only useful if the data might change > dynamically during the program execution. > > It all depends on how much data?, how often it is used?, > how often would it be calculated? How long does the process > run for? etc. Hey, boB - I bet you *knew* the answer was going to be "it depends" :) There are two very common classes of application that have to make this very decision - real databases, and their toy cousins, spreadsheets. In the relational database world - characterized by very long-running processes (like: unless it crashes, runs until reboot. and maybe even beyond that - if you have a multi-mode replicated or distributed DB it may survive failure of one point) - if a field is calculated it's not stored. Because - what Alan said: in an RDBMS, data are _expected_ to change during runtime. And then for performance reasons, there may be some cases where it's precomputed and stored to avoid huge delays when the computation is expensive. That world even has a term for that: a materialized view (in contrast to a regular view). It can get pretty tricky, you need something that causes the materialized view to update when data has changed; for databases that don't natively support the behavior you then have to fiddle with triggers and hopefully it works out. More enlightened now? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On 30/07/2019 18:20, boB Stepp wrote: > What is the likelihood of file storage corruption? I have a vague > sense that in earlier days of computing this was more likely to > happen, but nowadays? Storing and recalculating does act as a good > data integrity check of the file data. No it doesn't! You are quite likely to get a successful calculation using nonsense data and therefore invalid results. But they look valid - a number is a number... Checking data integrity is what checksums are for. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On 30/07/2019 17:21, boB Stepp wrote: > musings I am wondering about -- in general -- whether it is best to > store calculated data values in a file and reload these values, or > whether to recalculate such data upon each new run of a program. It depends on the use case. For example a long running server process may not care about startup delays because it only starts once (or at least very rarely) so either approach would do but saving diskspace may be helpful so calculate the values. On the other hand a data batch processor running once as part of a chain working with high data volumes probably needs to start quickly. In which case do the calculations take longer than reading the extra data? Probably, so store in a file. There are other options too such as calculating the value every time it is used - only useful if the data might change dynamically during the program execution. It all depends on how much data?, how often it is used?, how often would it be calculated? How long does the process run for? etc. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On Tue, Jul 30, 2019 at 12:05 PM Zachary Ware wrote: > > On Tue, Jul 30, 2019 at 11:24 AM boB Stepp wrote: > > In this trivial example I cannot imagine there is any realistic > > difference between the two approaches, but I am trying to generalize > > my thoughts for potentially much more expensive calculations, very > > large data sets, and what is the likelihood of storage errors > > occurring in files. Any thoughts on this? > > As with many things in programming, it comes down to how much time you > want to trade for space. If you have a lot of space and not much > time, store the calculated values. If you have a lot of time (or the > calculation time is negligible) and not much space, recalculate every > time. If you have plenty of both, store it and recalculate it anyway What is the likelihood of file storage corruption? I have a vague sense that in earlier days of computing this was more likely to happen, but nowadays? Storing and recalculating does act as a good data integrity check of the file data. -- boB ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Which is better in principle: to store (in file) calculated data or to re-calculate it upon restarting program?
On Tue, Jul 30, 2019 at 11:24 AM boB Stepp wrote: > In this trivial example I cannot imagine there is any realistic > difference between the two approaches, but I am trying to generalize > my thoughts for potentially much more expensive calculations, very > large data sets, and what is the likelihood of storage errors > occurring in files. Any thoughts on this? As with many things in programming, it comes down to how much time you want to trade for space. If you have a lot of space and not much time, store the calculated values. If you have a lot of time (or the calculation time is negligible) and not much space, recalculate every time. If you have plenty of both, store it and recalculate it anyway :). Storing the information can also be useful for offline debugging. -- Zach ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor