[issue26436] Add the regex-dna benchmark

2016-10-18 Thread STINNER Victor

STINNER Victor added the comment:

The performance project is now hosted on GitHub. I created a pull request from 
Serhiy's patch:
https://github.com/python/performance/pull/17

So I now close this issue. Let's continue the discussion there.

--
resolution:  -> third party
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-10-18 Thread STINNER Victor

STINNER Victor added the comment:

I created https://github.com/python/performance/issues/15

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-10-16 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
components: +Regular Expressions
nosy: +ezio.melotti, mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-09-13 Thread STINNER Victor

STINNER Victor added the comment:

Serhiy: Can you please open a pull request on the new performance module? 
https://github.com/python/performance

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-03-01 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I used the code from fasta and regex-dna tests almost without changes. I.e. one 
part create the data in standard FASTA format (with 60-character lines and 
headers), and other part parses this format. The code can be simple if generate 
and consume raw data.

As for the quality of the code, tested code is pretty simple and enough 
pythonic. Yes, using replace() is more idiomatic and faster, but we are testing 
regular expressions. bytes.translate() doesn't work with dict, and 
str.translate() is slower than replace() or re.sub().

The code for generating test data is not the kind of the code that should be 
used in tutorials. It is highly optimized code that uses different optimization 
tricks that could be hard to understand without comments. But nothing 
unpythonic. It can be simplified if avoid formatting the data in standard FASTA 
format.

> I would add another kind of question: is it stressing something useful that 
> isn't already stressed by the two other regex benchmarks we already have?

Yes, it is. The regex_v8 benchmark is 2x faster with regex than with re. But 
the regex_dna benchmark is 1.6x slower with regex than with re. Thus these 
tests are stressing different aspects of regular expressions.

It may be worth also to test regular expressions with unicode strings. I expect 
some difference with latest Python and earlier 3.x and 2.7. The question is how 
to do this? Add a special option to switch between bytes and unicode (as 
--force_bytes in regex_effbot), or just run tests for bytes and unicode 
sequentially and add results?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-27 Thread Terry J. Reedy

Terry J. Reedy added the comment:

DNA matching can be done with difflib.  Serious high-volume work should use 
compiled specialized matchers and aligners.

This particular benchmark, explained a bit at 
https://benchmarksgame.alioth.debian.org/u64q/regexdna-description.html#regexdna,
 manipulates and searches standard FASTA format representations of sequences 
with the regex available in each language.  (The site has another Python 
implementation at 
https://benchmarksgame.alioth.debian.org/u64q/program.php?test=regexdna=python3=1.
 It uses unicode strings rather than bytes, and multiprocessing.Pool to run 
re.findall in parallel.)

FASTA uses lowercase a,c,g,t for known bases and at least 11 uppercase letters 
for subsets of bases representing partially known bases.  The third task is to 
expand upper case letters to subsets of lowercase letters.  Since the rules 
requires use of re and one substitution at a time, the 2 Python programs run 
re.sub over the current sequence 11 times.  More idiomatic for Python, and 
probably faster, would be to use seq.replace(old,new) instead.  Perhaps even 
more idiomatic and probably faster still, would be to use str.translate, as in 
this reduced example.

>>> table = {ord('B') : '(c|g|t)', ord('D') : '(a|g|t)'}
>>> 'aBcDg'.translate(table)
'a(c|g|t)c(a|g|t)g'

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-27 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I would add another kind of question: is it stressing something useful that 
isn't already stressed by the two other regex benchmarks we already have?

Given that it seems built around a highly-specialized use case (DNA matching?) 
and we don't even know if regular expressions are actually the tool of choice 
in the field (unless someone here is a specialist), I'm rather skeptical.

(in general, everything coming the "Computer Language Benchmarks Game" should 
be taken with a grain of salt IMHO: it's mostly people wanting to play writing 
toy programs)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-27 Thread Brett Cannon

Brett Cannon added the comment:

Terry's right about what I meant; is the code of such quality that you would 
let it into the stdlib?

As for execution time, I would vote for increasing the input size to take 1 
second as it's just going to get faster and faster just  from CPU improvements 
alone.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-27 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I believe Brett is asking whether the code looks like the sort of Python code 
that one of us might write, as opposed to 'language X in Python'.  In my quick 
perusal, As far as I looked, I would say yes, except for using floats and while 
instead of int and for because the former are supposedly faster.  (See the loop 
in the middle of random_fasta.)  So do we want a benchmark micro-optimized for 
CSF's system or written 'normally' (with for, int, and range). I did not notice 
any PEP 8 style violations.

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-27 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> I assume the code looks idiomatic to you?

Sorry, I have not understood your question. Could you please reformulate?

The performance of all Python versions is rough the same. 2.7 is about 8% 
slower than 3.2 and 3.3, 3.4-default are about 3-4% slower than 3.2 and 3.3.

I have taken input data size such that the regex-dna benchmark runs rough the 
same time as the slowest regex benchmark regex-compile (0.7 sec per iteration 
on my computer, about a minute with default options). This is 1/50 of the size 
used in The Computer Language Benchmarks Game.

Since the benchmark generates input data, its size can easily be changed. 
Needed only update control sums.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-25 Thread Brett Cannon

Brett Cannon added the comment:

Oh, and how long does an execution take?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-25 Thread Brett Cannon

Brett Cannon added the comment:

I assume the code looks idiomatic to you?

And out of curiosity, what does the performance look like between something 3.5 
and default?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26436] Add the regex-dna benchmark

2016-02-25 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Proposed patch adds the regex-dna benchmark from The Computer Language 
Benchmarks Game (http://benchmarksgame.alioth.debian.org/). This is artificial 
but well known benchmark.

The patch is based on regex-dna Python 3 #5 program and fasta Python 3 #3 
program (for generating input).

--
components: Benchmarks
files: bm_regex_dna.patch
keywords: patch
messages: 260854
nosy: brett.cannon, pitrou, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Add the regex-dna benchmark
type: enhancement
Added file: http://bugs.python.org/file42028/bm_regex_dna.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com