[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-14 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I though there are at least 3-4 use cases in the core and stdlib.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-14 Thread Inada Naoki


Change by Inada Naoki :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-14 Thread Inada Naoki


Inada Naoki  added the comment:


New changeset 3a8c56295d6272ad2177d2de8af4c3f824f3ef92 by Inada Naoki in branch 
'master':
Revert "bpo-39087: Add _PyUnicode_GetUTF8Buffer()" (GH-18985)
https://github.com/python/cpython/commit/3a8c56295d6272ad2177d2de8af4c3f824f3ef92


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-13 Thread Inada Naoki


Change by Inada Naoki :


--
pull_requests: +18333
pull_request: https://github.com/python/cpython/pull/18985

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-13 Thread Inada Naoki


Inada Naoki  added the comment:

I'm sorry about merging PR 18327, but I can not find enough usage example of 
the _PyUnicode_GetUTF8Buffer.

PyUnicode_AsUTF8AndSize is optimized, and utf8_cache is not so bad in most 
case.  So _PyUnicode_GetUTF8Buffer seems not worth enough.

I will revert PR 18327.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-13 Thread Inada Naoki


Change by Inada Naoki :


--
pull_requests: +18332
pull_request: https://github.com/python/cpython/pull/18984

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-03-13 Thread Inada Naoki


Inada Naoki  added the comment:


New changeset c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b by Inada Naoki in branch 
'master':
bpo-39087: Add _PyUnicode_GetUTF8Buffer() (GH-17659)
https://github.com/python/cpython/commit/c7ad974d341d3edb6b9d2a2dcae4d3d4794ada6b


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-02-26 Thread Inada Naoki


Inada Naoki  added the comment:


New changeset 02a4d57263a9846de35b0db12763ff9e7326f62c by Inada Naoki in branch 
'master':
bpo-39087: Optimize PyUnicode_AsUTF8AndSize() (GH-18327)
https://github.com/python/cpython/commit/02a4d57263a9846de35b0db12763ff9e7326f62c


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-02-03 Thread Inada Naoki


Inada Naoki  added the comment:

Attached patch is the benchmark function I used in previous post.

--
Added file: https://bugs.python.org/file48879/bench-asutf8.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-02-03 Thread Inada Naoki

Inada Naoki  added the comment:

I am still not sure about we should add new API only for avoiding cache.

* PyUnicode_AsUTF8String : When we need bytes or want to avoid cache.
* PyUnicode_AsUTF8AndSize : When we need C string, and cache is acceptable.


With PR-18327, PyUnicode_AsUTF8AndSize become 10+% faster than master branch, 
and same speed to PyUnicode_AsUTF8String.


## vs master

$ ./python -m pyperf timeit --compare-to=../cpython/python --python-names 
master:patched -s 'from _testcapi import unicode_bench_asutf8 as b' -- 'b(1000, 
"hello", "こんにちは")'
master: . 96.6 us +- 3.3 us
patched: . 83.3 us +- 0.3 us

Mean +- std dev: [master] 96.6 us +- 3.3 us -> [patched] 83.3 us +- 0.3 us: 
1.16x faster (-14%)


## vs AsUTF8String

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as 
b' -- 'b(1000, "hello", "こんにちは")'
.
Mean +- std dev: 83.2 us +- 0.2 us

$ ./python -m pyperf timeit -s 'from _testcapi import 
unicode_bench_asutf8string as b' -- 'b(1000, "hello", "こんにちは")'
.
Mean +- std dev: 81.9 us +- 2.1 us


## vs AsUTF8String (ASCII)

If we can not accept cache, PyUnicode_AsUTF8String is slower than 
PyUnicode_AsUTF8 when the unicode is ASCII string.  PyUnicode_GetUTF8Buffer 
helps only this case.

$ ./python -m pyperf timeit -s 'from _testcapi import unicode_bench_asutf8 as 
b' -- 'b(1000, "hello", "world")'
.
Mean +- std dev: 37.5 us +- 1.7 us

$ ./python -m pyperf timeit -s 'from _testcapi import 
unicode_bench_asutf8string as b' -- 'b(1000, "hello", "world")'
.
Mean +- std dev: 46.4 us +- 1.6 us

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2020-02-03 Thread Inada Naoki


Change by Inada Naoki :


--
pull_requests: +17701
pull_request: https://github.com/python/cpython/pull/18327

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-24 Thread Inada Naoki


Inada Naoki  added the comment:

> I like this idea, but I think that we should at least notify Python-Dev about 
> all additions to the public C API. If somebody have objections or better 
> idea, it is better to know earlier.

I created a post about this issue in discuss.python.org.
https://discuss.python.org/t/better-api-for-encoding-unicode-objects-with-utf-8/2909

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-23 Thread Inada Naoki


Change by Inada Naoki :


--
pull_requests: +17140
pull_request: https://github.com/python/cpython/pull/17683

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-21 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

I like this idea, but I think that we should at least notify Python-Dev about 
all additions to the public C API. If somebody have objections or better idea, 
it is better to know earlier.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Inada Naoki


Change by Inada Naoki :


--
keywords: +patch
pull_requests: +17127
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/17659

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Inada Naoki


Inada Naoki  added the comment:

> Don't you need to DECREF bytes somehow, at least, in case of failure?

Thanks.  I will create a pull request with suggested changes.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread STINNER Victor


STINNER Victor  added the comment:

return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);

Don't you need to DECREF bytes somehow, at least, in case of failure?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Inada Naoki


Inada Naoki  added the comment:

s/return NULL/return -1/g

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Inada Naoki


Inada Naoki  added the comment:

> Would it be possible to use a "container" object like a Py_buffer? Is there a 
> way to customize the code executed when a Py_buffer is "released"?

It looks nice idea!  Py_buffer.obj is decref-ed when releasing the buffer.
https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release


int PyUnicode_GetUTF8Buffer(PyObject *unicode, Py_buffer *view)
{
if (!PyUnicode_Check(unicode)) {
PyErr_BadArgument();
return NULL;
}
if (PyUnicode_READY(unicode) == -1) {
return NULL;
}

if (PyUnicode_UTF8(unicode) != NULL) {
return PyBuffer_FillInfo(view, unicode,
 PyUnicode_UTF8(unicode),
 PyUnicode_UTF8_LENGTH(unicode),
 1, PyBUF_CONTIG_RO);
}
PyObject *bytes = _PyUnicode_AsUTF8String(unicode, NULL);
if (bytes == NULL) {
return NULL;
}
return PyBytesType.tp_as_buffer(bytes, view, PyBUF_CONTIG_RO);
}

--
nosy:  -skrah

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

> Would it be possible to use a "container" object like a Py_buffer?

Looks like a good idea.

int PyUnicode_GetUTF8Buffer(Py_buffer *view, const char *errors)

--
nosy: +skrah

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread STINNER Victor


STINNER Victor  added the comment:

> The returned object is the owner of the *utf8*.  You need to Py_DECREF() it 
> after
> you finished to using the *utf8*.  The owner may be not the unicode.

Would it be possible to use a "container" object like a Py_buffer? Is there a 
way to customize the code executed when a Py_buffer is "released"?

Py_buffer would be nice since it already has a pointer attribute (data) and a 
length attribute, and there is an API to "release" a Py_buffer. It can be 
marked as read-only, etc.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-19 Thread Serhiy Storchaka


Serhiy Storchaka  added the comment:

Do you mean some concrete code? Several times I wished similar feature. To get 
a UTF-8 cache if it exists and encode to UTF-8 without creating a cache 
otherwise. 

The private _PyUnicode_UTF8() macro could help

if ((s = _PyUnicode_UTF8(str))) {
size = _PyUnicode_UTF8_LENGTH(str);
tmpbytes = NULL;
}
else {
tmpbytes = _PyUnicode_AsUTF8String(str, "replace");
s = PyBytes_AS_STRING(tmpbytes);
size = PyBytes_GET_SIZE(tmpbytes);
}

but it is not even available outside of unicodeobject.c.

PyUnicode_BorrowUTF8() looks too complex for the public API. I am not sure that 
it will be easy to implement it in PyPy. It also does not cover all use cases 
-- sometimes you want to convert to UTF-8 but does not use any memory 
allocation at all (either use an existing buffer or raise an error if there is 
no cached UTF-8 or the string is not ASCII).

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue39087] [C API] No efficient C API to get UTF-8 string from unicode object.

2019-12-18 Thread STINNER Victor


Change by STINNER Victor :


--
title: No efficient API to get UTF-8 string from unicode object. -> [C API] No 
efficient C API to get UTF-8 string from unicode object.

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com