On 08/04/2017 01:52 AM, Peter Otten wrote: <SNIP> > It looks like Python is fairly competetive: > > $ wc -l hugequote.txt > 1000000 hugequote.txt612250 > $ cat unquote.py > import csv > > with open("hugequote.txt") as instream: > for field, in csv.reader(instream): > print(field) > > $ time python3 unquote.py > /dev/null > > real 0m3.773s > user 0m3.665s > sys 0m0.082s > > $ time cat hugequote.txt | sed 's/"""/"/g;s/""/"/g' > /dev/null > > real 0m4.862s > user 0m4.721s > sys 0m0.330s > > Run on ancient AMD hardware ;) >
It's actually better than sed. What you're seeing is - I believe - load time dominating the overall time. I reran this with a 20M line file: time cat superhuge.txt | sed 's/"""/"/g;s/""/"/g' >/dev/null real 0m53.091s user 0m52.861s sys 0m0.820s time python unquote.py >/dev/null real 0m22.377s user 0m22.021s sys 0m0.352s Note that this is with python2, not python3. Also, I confirmed that the cat and pipe into sed was not a factor in the performance. My guess is that delimiter recognition logic in the csv module is far more efficient than the general purpose regular expression/dfa implementaton in sed. Extra Credit Assignment: Reimplement in python using: - string substitution - regular expressions Tschüss... -- https://mail.python.org/mailman/listinfo/python-list