Steven D'Aprano wrote: > On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod...@libero.it wrote: >> Dear all. >> I would like to extract from some file some data. >> The line I'm interested is this: >> >> Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward >> Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) >> Dropped: 308617 (14.51%) > > > Some people, when confronted with a problem, think "I know, I'll > use regular expressions." Now they have two problems. > -- Jamie Zawinski > > I swear that Perl has been a blight on an entire generation of > programmers. All they know is regular expressions, so they turn every > data processing problem into a regular expression. Or at least they > *try* to. As you have learned, regular expressions are hard to read, > hard to write, and hard to get correct. > > Let's write some Python code instead. > > > def extract(line): > # Extract key:number values from the string. > line = line.strip() # Remove leading and trailing whitespace. > words = line.split() > accumulator = [] # Collect parts of the string we care about. > for word in words: > if word.startswith('(') and word.endswith('%)'): > # We don't care about percentages in brackets. > continue > try: > n = int(word) > except ValueError: > accumulator.append(word) > else: > accumulator.append(n) > # Now accumulator will be a list of strings and ints: > # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000] > # Collect consecutive strings as the key, int to be the value. > results = {} > keyparts = [] > for item in accumulator: > if isinstance(item, int): > key = ' '.join(keyparts) > keyparts = [] > if key.endswith(':'): > key = key[:-1] > results[key] = item > else: > keyparts.append(item) > # When we have finished processing, the keyparts list should be empty. > if keyparts: > extra = ' '.join(keyparts) > print('Warning: found extra text at end of line "%s".' % extra) > return results > > > > Now let me test it: > > py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091' > ... ' (84.38%) Forward Only Surviving: 17315 (0.81%)' > ... ' Reverse Only Surviving: 6413 (0.30%) Dropped:' > ... ' 308617 (14.51%)\n') > py> > py> print(line) > Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward > Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) > Dropped: 308617 (14.51%) > > py> extract(line) > {'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving': > 6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436} > > > Remember that dicts are unordered. All the data is there, but in > arbitrary order. Now that you have a nice function to extract the data, > you can apply it to the lines of a data file in a simple loop: > > with open("255.trim.log") as p: > for line in p: > if line.startswith("Input "): > d = extract(line) > print(d) # or process it somehow
The tempter took posession of me and dictated: >>> pprint.pprint( ... [(k, int(v)) for k, v in ... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)]) [('Input Read Pairs', 2127436), ('Both Surviving', 1795091), ('Forward Only Surviving', 17315), ('Reverse Only Surviving', 6413), ('Dropped', 308617)] _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor