Carl Friedrich Bolz-Tereick pushed to branch branch/py3.8 at PyPy / pypy


Commits:
7f8d83b5 by Carl Friedrich Bolz-Tereick at 2022-07-26T19:25:32+02:00
randomly fix some typos

- - - - -
41f576f4 by Carl Friedrich Bolz-Tereick at 2022-08-05T18:00:08+02:00
stop using a trie and switch to a DAWG for the bidirectional
name<->unicode code point mapping

it's smaller and lookups (in both directions) are faster

--HG--
branch : unicodedata-dawg

- - - - -
7d4c154a by Carl Friedrich Bolz-Tereick at 2022-08-05T18:38:52+02:00
just check all the names of the first 65536 characters

--HG--
branch : unicodedata-dawg

- - - - -
a0c974ae by Carl Friedrich Bolz-Tereick at 2022-08-05T20:30:51+02:00
use startswith

--HG--
branch : unicodedata-dawg

- - - - -
ff84054e by Carl Friedrich Bolz-Tereick at 2022-08-05T20:58:46+02:00
save some more bytes

--HG--
branch : unicodedata-dawg

- - - - -
c24a2c24 by Carl Friedrich Bolz-Tereick at 2022-08-06T14:53:11+02:00
intermediate check-in: more compact one character edges

(relies on alphabet being ascii)

--HG--
branch : unicodedata-dawg

- - - - -
4779cbdd by Carl Friedrich Bolz-Tereick at 2022-08-06T19:48:18+02:00
use leb128 to encode the count, saves another 22kb

another intermediate checkin with lots of mess around

--HG--
branch : unicodedata-dawg

- - - - -
dacaf7f0 by Carl Friedrich Bolz-Tereick at 2022-08-06T20:13:36+02:00
switch order of fields in edge encoding

--HG--
branch : unicodedata-dawg

- - - - -
da0db534 by Carl Friedrich Bolz-Tereick at 2022-08-06T20:38:29+02:00
make edge target encoding also be varsized

--HG--
branch : unicodedata-dawg

- - - - -
e3299d09 by Carl Friedrich Bolz-Tereick at 2022-08-06T21:20:46+02:00
compress further by storing offsets

--HG--
branch : unicodedata-dawg

- - - - -
e93701ca by Carl Friedrich Bolz-Tereick at 2022-08-06T21:49:40+02:00
remove cruft

--HG--
branch : unicodedata-dawg

- - - - -
f2f764af by Carl Friedrich Bolz-Tereick at 2022-08-07T14:20:10+02:00
refactor to not have the separate size computatio

--HG--
branch : unicodedata-dawg

- - - - -
899613f9 by Carl Friedrich Bolz-Tereick at 2022-08-07T14:37:52+02:00
reshuffle a bit

--HG--
branch : unicodedata-dawg

- - - - -
b08242c8 by Carl Friedrich Bolz-Tereick at 2022-08-07T14:49:28+02:00
put the bit somewhere else, at a slight cost

--HG--
branch : unicodedata-dawg

- - - - -
c957417c by Carl Friedrich Bolz-Tereick at 2022-08-07T15:38:53+02:00
add "final" bits to the edges and remove the edge count

--HG--
branch : unicodedata-dawg

- - - - -
8fbad397 by Carl Friedrich Bolz-Tereick at 2022-08-07T19:08:35+02:00
add some hypothesis tests and fix the found problems

--HG--
branch : unicodedata-dawg

- - - - -
0a376a5c by Carl Friedrich Bolz-Tereick at 2022-08-07T20:54:01+02:00
fix rpython

--HG--
branch : unicodedata-dawg

- - - - -
defc1eaa by Carl Friedrich Bolz-Tereick at 2022-08-07T21:30:02+02:00
argh, actual fix

--HG--
branch : unicodedata-dawg

- - - - -
1064c1e2 by Carl Friedrich Bolz-Tereick at 2022-08-09T13:18:11+02:00
Use base compression again for names, make printed output less enormous

--HG--
branch : unicodedata-dawg

- - - - -
99f6a17c by Carl Friedrich Bolz-Tereick at 2022-08-09T17:04:42+02:00
use int32 for codepoints, not C longs

--HG--
branch : unicodedata-dawg

- - - - -
de0804c3 by Carl Friedrich Bolz-Tereick at 2022-08-09T17:23:47+02:00
small improvements

--HG--
branch : unicodedata-dawg

- - - - -
7121a2f6 by Carl Friedrich Bolz-Tereick at 2022-08-10T22:03:36+02:00
use a single big db to store almost all information. 10% space and is much
faster. use CPython's code for db page tables logic

--HG--
branch : unicodedata-dawg

- - - - -
f45ee617 by Carl Friedrich Bolz-Tereick at 2022-08-11T15:56:06+02:00
intermediate checkin: rewrite code generation infrastructure and estimate sizes

--HG--
branch : unicodedata-dawg

- - - - -
d30dc835 by Carl Friedrich Bolz-Tereick at 2022-08-12T12:48:04+02:00
more switching to the code writer

--HG--
branch : unicodedata-dawg

- - - - -
41f07249 by Carl Friedrich Bolz-Tereick at 2022-08-12T16:42:05+02:00
tweak guesses

--HG--
branch : unicodedata-dawg

- - - - -
71b05811 by Carl Friedrich Bolz-Tereick at 2022-08-12T19:08:52+02:00
do the composition data differently

--HG--
branch : unicodedata-dawg

- - - - -
9b83c891 by Carl Friedrich Bolz-Tereick at 2022-08-12T19:39:01+02:00
share composition_data

--HG--
branch : unicodedata-dawg

- - - - -
aecc840d by Carl Friedrich Bolz-Tereick at 2022-08-12T20:34:46+02:00
fix

--HG--
branch : unicodedata-dawg

- - - - -
68126b7e by Carl Friedrich Bolz-Tereick at 2022-08-12T21:24:37+02:00
integrate composition data into the decomposition tables

--HG--
branch : unicodedata-dawg

- - - - -
4a6d166e by Carl Friedrich Bolz-Tereick at 2022-08-12T21:46:58+02:00
compress pre- and postfix constants

--HG--
branch : unicodedata-dawg

- - - - -
73e77d19 by Carl Friedrich Bolz-Tereick at 2022-08-13T12:57:31+02:00
tests and fixes

--HG--
branch : unicodedata-dawg

- - - - -
3071f3e8 by Carl Friedrich Bolz-Tereick at 2022-08-14T11:08:44+02:00
unify all char lists into the same output list. also include casefolds.

--HG--
branch : unicodedata-dawg

- - - - -
4ad5b92d by Carl Friedrich Bolz-Tereick at 2022-08-14T11:11:16+02:00
remove some old unicode versions, only keep those for py 2.7, and 3.6 onwards

--HG--
branch : unicodedata-dawg

- - - - -
e80baf28 by Carl Friedrich Bolz-Tereick at 2022-08-14T11:17:17+02:00
fix tests

--HG--
branch : unicodedata-dawg

- - - - -
3c137c46 by Carl Friedrich Bolz-Tereick at 2022-08-14T13:28:07+02:00
refactor the db generation

--HG--
branch : unicodedata-dawg

- - - - -
7d5bc5a3 by Carl Friedrich Bolz-Tereick at 2022-08-14T16:30:46+02:00
use methods to generate less "unknown"

--HG--
branch : unicodedata-dawg

- - - - -
d7159bf7 by Carl Friedrich Bolz-Tereick at 2022-08-14T20:48:10+02:00
failing test

--HG--
branch : unicodedata-dawg

- - - - -
051d71fb by Carl Friedrich Bolz-Tereick at 2022-08-15T13:25:10+02:00
lookup should not return aliases by default

--HG--
branch : unicodedata-dawg

- - - - -
d1ce4fe4 by Carl Friedrich Bolz-Tereick at 2022-08-15T13:46:28+02:00
fix test

--HG--
branch : unicodedata-dawg

- - - - -
515c84d8 by Carl Friedrich Bolz-Tereick at 2022-08-15T13:46:44+02:00
print estimated size

--HG--
branch : unicodedata-dawg

- - - - -
a15b0dee by Carl Friedrich Bolz-Tereick at 2022-08-15T13:47:12+02:00
oops

--HG--
branch : unicodedata-dawg

- - - - -
627d9b0a by Carl Friedrich Bolz-Tereick at 2022-08-15T13:47:32+02:00
regenerate everything

--HG--
branch : unicodedata-dawg

- - - - -
34574429 by Carl Friedrich Bolz-Tereick at 2022-08-15T17:18:37+02:00
try to document the API of the rpython unicodedb

--HG--
branch : unicodedata-dawg

- - - - -
6f01c6dc by Carl Friedrich Bolz-Tereick at 2022-08-15T21:02:10+02:00
merge unicodedata-dawg: replace the trie of names in unicodedata with a 
directed acyclic word graph to make it more compact. also various other 
improvements to make unicodedata more compact. shrinks pypy2 by 2.1mb, pypy3 by 
2.6mb

- - - - -
83c79073 by Carl Friedrich Bolz-Tereick at 2022-08-16T12:43:50+02:00
merge default

--HG--
branch : py3.8

- - - - -


6 changed files:

- pypy/module/unicodedata/interp_ucd.py
- pypy/module/unicodedata/test/test_hyp.py
- pypy/module/unicodedata/test/test_unicodedata.py
- rpython/rlib/rstring.py
- − rpython/rlib/unicodedata/CaseFolding-6.0.0.txt
- − rpython/rlib/unicodedata/CaseFolding-6.1.0.txt


View it on Heptapod: 
https://foss.heptapod.net/pypy/pypy/-/compare/7e841a072c8d4db065430826d7fbedcdf903e3e7...83c79073be5a3d46620ad3a8d5787951ca62d8b0

-- 
View it on Heptapod: 
https://foss.heptapod.net/pypy/pypy/-/compare/7e841a072c8d4db065430826d7fbedcdf903e3e7...83c79073be5a3d46620ad3a8d5787951ca62d8b0
You're receiving this email because of your account on foss.heptapod.net.


_______________________________________________
pypy-commit mailing list -- pypy-commit@python.org
To unsubscribe send an email to pypy-commit-le...@python.org
https://mail.python.org/mailman3/lists/pypy-commit.python.org/
Member address: arch...@mail-archive.com

Reply via email to