Thomas Passin <li...@tompassin.net> writes: > On 11/24/2022 9:06 AM, Loris Bennett wrote: >> Thomas Passin <li...@tompassin.net> writes: >> >>> On 11/23/2022 11:00 AM, Loris Bennett wrote: >>>> Hi, >>>> I am using pandas to parse a file with the following structure: >>>> Name fileset type KB quota limit >>>> in_doubt grace | files quota limit in_doubt grace >>>> shortname sharedhome USR 14097664 524288000 545259520 0 >>>> none | 107110 0 0 0 none >>>> gracedays sharedhome USR 774858944 524288000 775946240 0 >>>> 5 days | 1115717 0 0 0 none >>>> nametoolong sharedhome USR 27418496 524288000 545259520 >>>> 0 none | 11581 0 0 0 none >>>> I was initially able to use >>>> df = pandas.read_csv(file_name, delimiter=r"\s+") >>>> because all the values for 'grace' were 'none'. Now, however, >>>> non-"none" values have appeared and this fails. >>>> I can't use >>>> pandas.read_fwf >>>> even with an explicit colspec, because the names in the first column >>>> which are too long for the column will displace the rest of the data to >>>> the right. >>>> The report which produces the file could in fact also generate a >>>> properly delimited CSV file, but I have a lot of historical data in the >>>> readable but poorly parsable format above that I need to deal with. >>>> If I were doing something similar in the shell, I would just pipe >>>> the >>>> file through sed or something to replace '5 days' with, say '5_days'. >>>> How could I achieve a similar sort of preprocessing in Python, ideally >>>> without having to generate a lot of temporary files? >>> >>> This is really annoying, isn't it? A space-separated line with spaces >>> in data entries. If the example you give is typical, I don't think >>> there is a general solution. If you know there are only certain >>> values like that, then you could do a search-and-replace for them in >>> Python just like the example you gave for "5 days". >>> >>> If you know that the field that might contain entries with spaces is >>> the same one, e.g., the one just before the "|" marker, you could make >>> use of that. But it could be tricky. >>> >>> I don't know how many files like this you will need to process, nor >>> how many rows they might contain. If I were to do tackle this job, I >>> would probably do some quality checking first. Using this example >>> file, figure out how many fields there are supposed to be. First, >>> split the file into lines: >>> >>> with open("filename") as f: >>> lines = f.readlines() >>> >>> # Check space-separated fields defined in first row: >>> fields = lines[0].split() >>> num_fields = len(fields) >>> print(num_fields) # e.g., 100) >>> >>> # Find lines that have the wrong number of fields >>> bad_lines = [] >>> for line in lines: >>> fields = line.split() >>> if len(fields) != num_fields: >>> bad_lines.append(line) >>> >>> print(len(bad_lines)) >>> >>> # Inspect a sample >>> for line in bad_lines[:10]: >>> print(line) >>> >>> This will give you an idea of how many problems lines there are, and >>> if they can all be fixed by a simple replacement. If they can and >>> this is the only file you need to handle, just fix it up and run it. >>> I would replace the spaces with tabs or commas. Splitting a line on >>> spaces (split()) takes care of the issue of having a variable number >>> of spaces, so that's easy enough. >>> >>> If you will need to handle many files, and you can automate the fixes >>> - possibly with a regular expression - then you should preprocess each >>> file before giving it to pandas. Something like this: >>> >>> def fix_line(line): >>> """Test line for field errors and fix errors if any.""" >>> # .... >>> return fixed >>> >>> # For each file >>> with open("filename") as f: >>> lines = f.readlines() >>> >>> fixed_lines = [] >>> for line in lines: >>> fixed = fix_line(line) >>> fields = fixed.split() >>> tabified = '\t'.join(fields) # Could be done by fix_line() >>> fixed_lines.append(tabified) >>> >>> # Now use an IOString to feed the file to pandas >>> # From memory, some details may not be right >>> f = IOString() >>> f.writelines(fixed_lines) >>> >>> # Give f to pandas as if it were an external file >>> # ... >>> >> Thanks to both Gerard and Thomas for the pointer to IOString. I >> ended up >> just reading the file line-by-line, using a regex to replace >> '<n> <units> |' >> with >> '<n><units> |' >> and writing the new lines to an IOString, which I then passed to >> pandas.read_csv. >> The wrapper approach looks interesting, but it looks like I need to >> read >> up more on contexts before adding that to my own code, otherwise I may >> not understand it in a month's time. > > Glad that IOString works for you here. I seem to remember that after > writing to the IOString, you have to seek to 0 before reading from > it. Better check that point!
Stefan (whom I forgot to thank: Verziehung, Stefan!), mentioned seek(0), so fortunately I was primed when I read the Python documentation for IOString. Cheers, Loris -- This signature is currently under constuction. -- https://mail.python.org/mailman/listinfo/python-list