On 27/05/2021 21.28, Loris Bennett wrote: > Hi, > > I currently a have around 3 years' worth of files like > > home.20210527 > home.20210526 > home.20210525 > ... > > so around 1000 files, each of which contains information about data > usage in lines like > > name kb > alice 123 > bob 4 > ... > zebedee 9999999 > > (there are actually more columns). I have about 400 users and the > individual files are around 70 KB in size. > > Once a month I want to plot the historical usage as a line graph for the > whole period for which I have data for each user. > > I already have some code to extract the current usage for a single from > the most recent file: > > for line in open(file, "r"): > columns = line.split() > if len(columns) < data_column: > logging.debug("no. of cols.: %i less than data col", len(columns)) > continue > regex = re.compile(user) > if regex.match(columns[user_column]): > usage = columns[data_column] > logging.info(usage) > return usage > logging.error("unable to find %s in %s", user, file) > return "none" > > Obviously I will want to extract all the data for all users from a file > once I have opened it. After looping over all files I would naively end > up with, say, a nested dict like > > {"20210527": { "alice" : 123, , ..., "zebedee": 9999999}, > "20210526": { "alice" : 123, "bob" : 3, ..., "zebedee": 9}, > "20210525": { "alice" : 123, "bob" : 1, ..., "zebedee": 9999999}, > "20210524": { "alice" : 123, ..., "zebedee": 9}, > "20210523": { "alice" : 123, ..., "zebedee": 9999999}, > ...} > > where the user keys would vary over time as accounts, such as 'bob', are > added and latter deleted. > > Is creating a potentially rather large structure like this the best way > to go (I obviously could limit the size by, say, only considering the > last 5 years)? Or is there some better approach for this kind of > problem? For plotting I would probably use matplotlib.
NB I am predisposed to use databases. People without such skills will likely feel the time-and-effort investment to learn uneconomic for such a simple, single, example! Because the expressed concern seems to be the size of the data-set, (one assumes) only certain users' data will be graphed (at one time). Another concern may be that every time the routine executes, it repeats the bulk of its regex-based processing. I would establish a DB with (at least, as appropriate) two tables: one the list of files from which the data has been extracted, and the second containing the data currently formatted as a dict. NB The second may benefit from stating in "normal form" or splitting into related tables, and certainly indexing. Thus the process requires two steps: firstly to capture the data (from the files) into the DB, and secondly to graph the appropriate groups or otherwise 'chosen' users. SQL will simplify data retrieval, and feeding into matplotlib (or whichever tool). It will also enable simple improvements both to select sub-sets of users or to project over various periods of time. YMMV! -- Regards, =dn -- https://mail.python.org/mailman/listinfo/python-list