Re: [Numpy-discussion] `allclose` vs `assert_allclose`

2014-07-16 Thread Nathaniel Smith
On 16 Jul 2014 10:26, "Tony Yu"  wrote:
>
> Is there any reason why the defaults for `allclose` and `assert_allclose`
differ? This makes debugging a broken test much more difficult. More
importantly, using an absolute tolerance of 0 causes failures for some
common cases. For example, if two values are very close to zero, a test
will fail:
>
> np.testing.assert_allclose(0, 1e-14)
>
> Git blame suggests the change was made in the following commit, but I
guess that change only reverted to the original behavior.
>
>
https://github.com/numpy/numpy/commit/f43223479f917e404e724e6a3df27aa701e6d6bf
>
> It seems like the defaults for  `allclose` and `assert_allclose` should
match, and an absolute tolerance of 0 is probably not ideal. I guess this
is a pretty big behavioral change, but the current default for
`assert_allclose` doesn't seem ideal.

What you say makes sense to me, and loosening the default tolerances won't
break any existing tests. (And I'm not too worried about people who were
counting on getting 1e-7 instead of 1e-5 or whatever... if it matters that
much to you exactly what tolerance you test, you should be setting the
tolerance explicitly!) I vote that unless someone comes up with some
terrible objection in the next few days then you should submit a PR :-)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] `allclose` vs `assert_allclose`

2014-07-16 Thread Ralf Gommers
On Wed, Jul 16, 2014 at 6:37 AM, Tony Yu  wrote:

> Is there any reason why the defaults for `allclose` and `assert_allclose`
> differ? This makes debugging a broken test much more difficult. More
> importantly, using an absolute tolerance of 0 causes failures for some
> common cases. For example, if two values are very close to zero, a test
> will fail:
>
> np.testing.assert_allclose(0, 1e-14)
>
> Git blame suggests the change was made in the following commit, but I
> guess that change only reverted to the original behavior.
>
>
> https://github.com/numpy/numpy/commit/f43223479f917e404e724e6a3df27aa701e6d6bf
>

Indeed, was reverting a change that crept into
https://github.com/numpy/numpy/commit/f527b49a


>
> It seems like the defaults for  `allclose` and `assert_allclose` should
> match, and an absolute tolerance of 0 is probably not ideal. I guess this
> is a pretty big behavioral change, but the current default for
> `assert_allclose` doesn't seem ideal.
>

I agree, current behavior quite annoying. It would make sense to change the
atol default to 1e-8, but technically it's a backwards compatibility break.
Would probably have a very minor impact though. Changing the default for
rtol in one of the functions may be much more painful though, I don't think
that should be done.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Rounding float to integer while minizing the difference between the two arrays?

2014-07-16 Thread Chao YUE
Dear all,

I have two arrays with both float type, let's say X and Y. I want to round
the X to integers (intX) according to some decimal threshold, at the same
time I want to limit the following difference as small:

diff = np.sum(X*Y) - np.sum(intX*Y)

I don't have to necessarily minimize the "diff" variable (If with this
demand the computation time is too long). But I would like to limit the
"diff" to, let's say ten percent within np.sum(X*Y).

I have tried to write some functions, but I don't know where to start the
opitimization.

def convert_integer(x,threshold=0):
"""
This fucntion converts the float number x to integer according to the
threshold.
"""
if abs(x-0) < 1e5:
return 0
else:
pdec,pint = math.modf(x)
if pdec > threshold:
return int(math.ceil(pint)+1)
else:
return int(math.ceil(pint))

def convert_arr(arr,threshold=0):
out = arr.copy()
for i,num in enumerate(arr):
out[i] = convert_integer(num,threshold=threshold)
return out

In [147]:
convert_arr(np.array([0.14,1.14,0.12]),0.13)

Out[147]:
array([1, 2, 0])

Now my problem is, how can I minimize or limit the following?
diff = np.sum(X*Y) - np.sum(convert_arr(X,threshold=?)*Y)

Because it's the first time I encounter such kind of question, so please
give me some clue to start :p Thanks a lot in advance.

Best,

Chao

-- 
please visit:
http://www.globalcarbonatlas.org/
***
Chao YUE
Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
UMR 1572 CEA-CNRS-UVSQ
Batiment 712 - Pe 119
91191 GIF Sur YVETTE Cedex
Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rounding float to integer while minizing the difference between the two arrays?

2014-07-16 Thread Chao YUE
Sorry, there is one error in this part of code, it should be:


def convert_integer(x,threshold=0):
"""
This fucntion converts the float number x to integer according to the
threshold.
"""
if abs(x-0) < 1e-5:
return 0
else:
pdec,pint = math.modf(x)
if pdec > threshold:
return int(math.ceil(pint)+1)
else:
return int(math.ceil(pint))



On Wed, Jul 16, 2014 at 3:18 PM, Chao YUE  wrote:

> Dear all,
>
> I have two arrays with both float type, let's say X and Y. I want to round
> the X to integers (intX) according to some decimal threshold, at the same
> time I want to limit the following difference as small:
>
> diff = np.sum(X*Y) - np.sum(intX*Y)
>
> I don't have to necessarily minimize the "diff" variable (If with this
> demand the computation time is too long). But I would like to limit the
> "diff" to, let's say ten percent within np.sum(X*Y).
>
> I have tried to write some functions, but I don't know where to start the
> opitimization.
>
> def convert_integer(x,threshold=0):
> """
> This fucntion converts the float number x to integer according to the
> threshold.
> """
> if abs(x-0) < 1e5:
> return 0
> else:
> pdec,pint = math.modf(x)
> if pdec > threshold:
> return int(math.ceil(pint)+1)
> else:
> return int(math.ceil(pint))
>
> def convert_arr(arr,threshold=0):
> out = arr.copy()
> for i,num in enumerate(arr):
> out[i] = convert_integer(num,threshold=threshold)
> return out
>
> In [147]:
> convert_arr(np.array([0.14,1.14,0.12]),0.13)
>
> Out[147]:
> array([1, 2, 0])
>
> Now my problem is, how can I minimize or limit the following?
> diff = np.sum(X*Y) - np.sum(convert_arr(X,threshold=?)*Y)
>
> Because it's the first time I encounter such kind of question, so please
> give me some clue to start :p Thanks a lot in advance.
>
> Best,
>
> Chao
>
> --
> please visit:
> http://www.globalcarbonatlas.org/
>
> ***
> Chao YUE
> Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
> UMR 1572 CEA-CNRS-UVSQ
> Batiment 712 - Pe 119
> 91191 GIF Sur YVETTE Cedex
> Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16
>
> 
>



-- 
please visit:
http://www.globalcarbonatlas.org/
***
Chao YUE
Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
UMR 1572 CEA-CNRS-UVSQ
Batiment 712 - Pe 119
91191 GIF Sur YVETTE Cedex
Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Rounding float to integer while minizing the difference between the two arrays?

2014-07-16 Thread Chao YUE
Dear all,

A bit sorry, this is not difficult. scipy.optimize.minimize_scalar seems to
solve my problem. Thanks anyway, for this great tool.

Cheers,

Chao


On Wed, Jul 16, 2014 at 3:18 PM, Chao YUE  wrote:

> Dear all,
>
> I have two arrays with both float type, let's say X and Y. I want to round
> the X to integers (intX) according to some decimal threshold, at the same
> time I want to limit the following difference as small:
>
> diff = np.sum(X*Y) - np.sum(intX*Y)
>
> I don't have to necessarily minimize the "diff" variable (If with this
> demand the computation time is too long). But I would like to limit the
> "diff" to, let's say ten percent within np.sum(X*Y).
>
> I have tried to write some functions, but I don't know where to start the
> opitimization.
>
> def convert_integer(x,threshold=0):
> """
> This fucntion converts the float number x to integer according to the
> threshold.
> """
> if abs(x-0) < 1e5:
> return 0
> else:
> pdec,pint = math.modf(x)
> if pdec > threshold:
> return int(math.ceil(pint)+1)
> else:
> return int(math.ceil(pint))
>
> def convert_arr(arr,threshold=0):
> out = arr.copy()
> for i,num in enumerate(arr):
> out[i] = convert_integer(num,threshold=threshold)
> return out
>
> In [147]:
> convert_arr(np.array([0.14,1.14,0.12]),0.13)
>
> Out[147]:
> array([1, 2, 0])
>
> Now my problem is, how can I minimize or limit the following?
> diff = np.sum(X*Y) - np.sum(convert_arr(X,threshold=?)*Y)
>
> Because it's the first time I encounter such kind of question, so please
> give me some clue to start :p Thanks a lot in advance.
>
> Best,
>
> Chao
>
> --
> please visit:
> http://www.globalcarbonatlas.org/
>
> ***
> Chao YUE
> Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
> UMR 1572 CEA-CNRS-UVSQ
> Batiment 712 - Pe 119
> 91191 GIF Sur YVETTE Cedex
> Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16
>
> 
>



-- 
please visit:
http://www.globalcarbonatlas.org/
***
Chao YUE
Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL)
UMR 1572 CEA-CNRS-UVSQ
Batiment 712 - Pe 119
91191 GIF Sur YVETTE Cedex
Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String type again.

2014-07-16 Thread Chris Barker - NOAA Federal
> But HDF5
> additionally has a fixed-storage-width UTF8 type, so we could map to a
> NumPy fixed-storage-width type trivially.

Sure -- this is why *nix uses utf-8 for filenames -- it can just be a
char*. But that just punts the problem to client code.

I think a UTF-8 string type does not match the numpy model well, and I
don't think we should support it just because it would be easier for
the HDF 5 wrappers.

( to be fair, there are probably other similar systems numpy wants to
interface with that cod use this...)

It seems if you want a 1:1 binary mapping between HDF and numpy for
utf strings, then a bytes type in numpy makes more sense. Numpy
could/should have encode and decode methods for converting byte arrays
to/from Unicode arrays (does it already? ).

> "Custom" in this context means a user-created HDF5 data-conversion
> filter, which is necessary since all data conversion is handled inside
> the HDF5 library.

> As far as generic Unicode goes, we currently don't support the NumPy
> "U" dtype in h5py for similar reasons; there's no destination type in
> HDF5 which (1) would preserve the dtype for round-trip write/read
> operations and (2) doesn't risk truncation.

It sounds to like HDF5 simply doesn't support Unicode. Calling an
array of bytes utf-8 simple pushes the problem on to client libs. As
that's where the problem lies, then the PyHDF may be the place to
address it.

If we put utf-8 in numpy, we have the truncation problem there instead
-- which is exactly what I think we should avoid.

> A Latin-1 based 'a' type
> would have similar problems.

Maybe not -- latin1 is fixed width.

>> Does HDF enforce ascii-only? what does it do with the > 127 values?
>
> Unfortunately/fortunately the charset is not enforced for either ASCII

So you can dump Latin-1 into and out of the HDF 'ASCII' type -- it's
essentially the old char* / py2 string. An ugly situation, but why not
use it?

> or UTF-8,

So ASCII and utf-8 are really the same thing, with different meta-data...

> although the HDF Group has been thinking about it.

I wonder if they would consider going Latin-1 instead of ASCII --
similarly to utf-8 it's backward compatible with ASCII, but gives you
a little more.

I don't know that there is another 1byte encoding worth using -- it
maybe be my English bias, but it seems Latin-1 gives us ASCII+some
extra stuff handy for science ( I use the degree symbol a lot, for
instance) with nothing lost.

> Ideally, NumPy would support variable-length
> strings, in which case all these headaches would go away.

Would they? That would push the problem back to PyHDF -- which I'm
arguing is where it belongs, but I didn't think you were ;-)
>
> But I
> imagine that's also somewhat complicated. :)

That's a whole other kettle of fish, yes.


-Chris
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] parallel distutils extensions build? use gcc -flto

2014-07-16 Thread Julian Taylor
hi,
I have been playing around a bit with gccs link time optimization
feature and found that using it actually speeds up a from scratch build
of numpy due to its ability to perform parallel optimization and linking.
As a bonus you also should get faster binaries due to the better
optimizations lto allows.

As compiling with lto does require some possibly lesser know details I
wanted to share it.

Prerequesits are a working gcc toolchain of at least gcc-4.8 and
binutils > 2.21, gcc 4.9 is better as its faster.

First of all numpy checks the long double representation by compiling a
file and looking at the binary, this won't work as the od -b
reimplementation here does not understand lto objects, so on x86 we must
short circuit that:
--- a/numpy/core/setup_common.py
+++ b/numpy/core/setup_common.py
@@ -174,6 +174,7 @@ def check_long_double_representation(cmd):
 # We need to use _compile because we need the object filename
 src, object = cmd._compile(body, None, None, 'c')
 try:
+   return 'IEEE_DOUBLE_LE'
 type = long_double_representation(pyod(object))
 return type
 finally:


Next we build numpy as usual but override the compiler, linker and ar to
add our custom flags.
The setup.py call would look like this:

CC='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -O3' \
LDSHARED='gcc -fno-fat-lto-objects -flto=4 -fuse-linker-plugin -shared
-O3' AR=gcc-ar \
python setup.py build_ext

Some explanation:
The ar override is needed as numpy builds a static library and ar needs
to know about lto objects. gcc-ar does exactly that.
-flto=4 the main flag tell gcc to perform link time optimizations using
4 parallel processes.
-fno-fat-lto-objects tells gcc to only build lto objects, normally it
builds both an lto object and a normal object for toolchain
compatibilty. If our toolchain can handle lto objects this is just a
waste of time and we skip it. (The flag is default in gcc-4.9 but not 4.8)
-fuse-linker-plugin directs it to run its link time optimizer plugin in
the linking step, the linker must support plugins, both bfd (> 2.21) and
gold linker do so. This allows for more optimizations.
-O3 has to be added to the linker too as thats where the optimization
occurs. In general a problem with lto is that the compiler options of
all steps much match the flags used for linking.

If you are using c++ or gfortran you also have to override that to use
lto (CXX and FF(?))

See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for a lot
more details.


For some numbers on my machine a from scratch numpy build with no
caching takes 1min55s, with lto on 4 it only takes 55s. Pretty neat for
a much more involved optimization process.

Concerning the speed gain we get by this, I ran our benchmark suite with
this build, there were no really significant gains which is somewhat
expected as numpy is simple C code with most function bottlenecks
already inlined.

So conclusion: flto seems to work well with recent gccs and allows for
faster builds using the limited distutils. While probably not useful for
development where compiler caching (ccache) is of utmost importance it
is still interesting for projects doing one shot uncached builds (travis
like CI) and have huge objects (e.g. swig or cython) and don't want to
change to proper parallel build systems like bento.

PS: So far I know clang also supports lto but I never used it
PPS: using NPY_SEPARATE_COMPILATION=0 crashes gcc-4.9, time for a bug
report.

Cheers,
Julian
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] __numpy_ufunc__

2014-07-16 Thread Ralf Gommers
On Wed, Jul 16, 2014 at 10:07 AM, Nathaniel Smith  wrote:

> Weirdly, I never received Chuck's original email in this thread. Should
> some list admin be informed?
>
Also weirdly, my reply didn't show up on gmane. Not sure if it got through,
so re-sending:

It's already in, so do you mean not using? Would help to know what the
issue is, because it's finished enough that it's already used in a released
version of scipy (in sparse matrices).

Ralf

I also am not sure what/where Julian's comments were, so I second the call
> for context :-). Putting it off until 1.10 doesn't seem like an obviously
> bad idea to me, but specifics would help...
>
> (__numpy_ufunc__ is the new system for allowing arbitrary third party
> objects to override how ufuncs are applied to them, i.e. it means
> np.sin(sparsemat) and np.sin(gpuarray) can be defined to do something
> sensible. Conceptually it replaces the old __array_prepare__/__array_wrap__
> system, which was limited to ndarray subclasses and has major limits on
> what you can do. Of course __array_prepare/wrap__ will also continue to be
> supported for compatibility.)
>
-n
> On 16 Jul 2014 00:10, "Benjamin Root"  wrote:
>
>> Perhaps a bit of context might be useful? How is numpy_ufunc different
>> from the ufuncs that we know and love? What are the known implications?
>> What are the known shortcomings? Are there ABI and/or API concerns between
>> 1.9 and 1.10?
>>
>> Ben Root
>>
>>
>> On Mon, Jul 14, 2014 at 2:22 PM, Charles R Harris <
>> charlesr.har...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Julian has raised the question of including numpy_ufunc in numpy 1.9. I
>>> don't feel strongly one way or the other, but it doesn't seem to be
>>> finished yet and 1.10 might be a better place to work out the remaining
>>> problems along with the astropy folks testing possible uses.
>>>
>>> Thoughts?
>>>
>>> Chuck
>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] __numpy_ufunc__

2014-07-16 Thread Benjamin Root
Perhaps a bit of context might be useful? How is numpy_ufunc different from
the ufuncs that we know and love? What are the known implications? What are
the known shortcomings? Are there ABI and/or API concerns between 1.9 and
1.10?

Ben Root


On Mon, Jul 14, 2014 at 2:22 PM, Charles R Harris  wrote:

> Hi All,
>
> Julian has raised the question of including numpy_ufunc in numpy 1.9. I
> don't feel strongly one way or the other, but it doesn't seem to be
> finished yet and 1.10 might be a better place to work out the remaining
> problems along with the astropy folks testing possible uses.
>
> Thoughts?
>
> Chuck
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String type again.

2014-07-16 Thread Charles R Harris
On Tue, Jul 15, 2014 at 9:15 AM, Charles R Harris  wrote:

>
>
>
> On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <
> sebast...@sipsolutions.net> wrote:
>
>> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
>> > As previous posts have pointed out, Numpy's `S` type is currently
>> > treated as a byte string, which leads to more complicated code in
>> > python3. OTOH, the unicode type is stored as UCS4, which consumes a
>> > lot of space, especially for ascii strings. This note proposes to
>> > adapt the currently existing 'a' type letter, currently aliased to
>> > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
>> > internal representations for unicode strings, ascii and latin1. Ascii
>> > has the advantage that it is a subset of UTF-8, whereas latin1 has a
>> > few more symbols. Another possibility is to just make it an UTF-8
>> > encoding, but I think this would involve more overhead as Python would
>> > need to determine the maximum character size. These are just
>> > preliminary thoughts, comments are welcome.
>> >
>>
>> Just wondering, couldn't we have a type which actually has an
>> (arbitrary, python supported) encoding (and "bytes" might even just be a
>> special case of no encoding)? Basically storing bytes and on access do
>> element[i].decode(specified_encoding) and on storing element[i] =
>> value.encode(specified_encoding).
>>
>> There is always the never ending small issue of trailing null bytes. If
>> we want to be fully compatible, such a type would have to store the
>> string length explicitly to support trailing null bytes.
>>
>
> UTF-8 encoding works with null bytes. That is one of the reasons it is so
> popular.
>
>
Thinking more about it, the easiest thing to do might be to make the S
dtype a UTF-8 encoding. Most of the machinery to deal with that is already
in place. That change might affect some users though, and we might need to
do some work to make it backwards compatible with python 2.

Chuck
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String type again.

2014-07-16 Thread Jeff Reback
in 0.15.0 pandas will have full fledged support for categoricals which in 
effect allow u 2 map a smaller number of strings to integers 

this is now in pandas master 

http://pandas-docs.github.io/pandas-docs-travis/categorical.html

feedback welcome!

> On Jul 14, 2014, at 1:00 PM, Olivier Grisel  wrote:
> 
> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky :
>> 
>>> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith  wrote:
>>> 
>>> I feel like for most purposes, what we *really* want is a variable length
>>> string dtype (I.e., where each element can be a different length.).
>> 
>> 
>> 
>> I've been toying with the idea of creating an array type for interned
>> strings.  In many applications dealing with large arrays of variable size
>> strings, the strings come from a relatively short set of names.  Arrays of
>> interned strings can be manipulated very efficiently because in may respects
>> they are just like arrays of integers.
> 
> +1 I think this is why pandas is using dtype=object to load string
> data: in many cases short string values are used to represent
> categorical variables with a comparatively small cardinality of
> possible values for a dataset with comparatively numerous records.
> 
> In that case the dtype=object is not that bad as it just stores
> pointer on string objects managed by Python. It's possible to intern
> the strings manually at load time (I don't know if pandas or python
> already do it automatically in that case). The integer semantics is
> good for that case. Having an explicit dtype might be even better.
> 
> -- 
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String type again.

2014-07-16 Thread Chris Barker
On Mon, Jul 14, 2014 at 10:39 AM, Andrew Collette  wrote:


> For storing data in HDF5 (PyTables or h5py), it would be somewhat
> cleaner if either ASCII or UTF-8 are used, as these are the only two
> charsets officially supported by the library.


good argument for ASCII, but utf-8 is a bad idea, as there is no 1:1
correspondence between length of string in bytes and length in characters
-- as numpy needs to pre-allocate a defined number of bytes for a dtype,
there is a disconnect between the user and numpy as to how long a string is
being stored...this isn't a problem for immutable strings, and less of a
problem for HDF, as you can determine how many bytes you need before you
write the file (or does HDF support var-length elements?)


>  Latin-1 would require a
> custom read/write converter, which isn't the end of the world


"custom"? it would be an encoding operation -- which you'd need to go from
utf-8 to/from unicode anyway. So you would lose the ability to have a nice
1:1 binary representation map between numpy and HDF... good argument for
ASCII, I guess. Or for HDF to use latin-1 ;-)

Does HDF enforce ascii-only? what does it do with the > 127 values?


> would be tricky to do in a correct way, and likely somewhat slow.
> We'd also run into truncation issues since certain latin-1 chars
> become multibyte sequences in UTF8.
>

that's the whole issue with UTF-8 -- it needs to be addressed somewhere,
and the numpy-HDF interface seems like a smarter place to put it than the
numpy-user interface!

I assume 'a' strings would still be null-padded?


yup.



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] `allclose` vs `assert_allclose`

2014-07-16 Thread Tony Yu
Is there any reason why the defaults for `allclose` and `assert_allclose`
differ? This makes debugging a broken test much more difficult. More
importantly, using an absolute tolerance of 0 causes failures for some
common cases. For example, if two values are very close to zero, a test
will fail:

np.testing.assert_allclose(0, 1e-14)

Git blame suggests the change was made in the following commit, but I guess
that change only reverted to the original behavior.

https://github.com/numpy/numpy/commit/f43223479f917e404e724e6a3df27aa701e6d6bf

It seems like the defaults for  `allclose` and `assert_allclose` should
match, and an absolute tolerance of 0 is probably not ideal. I guess this
is a pretty big behavioral change, but the current default for
`assert_allclose` doesn't seem ideal.

Thanks,
-Tony
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] __numpy_ufunc__

2014-07-16 Thread Ralf Gommers
On Mon, Jul 14, 2014 at 8:22 PM, Charles R Harris  wrote:

> Hi All,
>
> Julian has raised the question of including numpy_ufunc in numpy 1.9. I
> don't feel strongly one way or the other, but it doesn't seem to be
> finished yet and 1.10 might be a better place to work out the remaining
> problems along with the astropy folks testing possible uses.
>
> Thoughts?
>

It's already in, so do you mean not using? Would help to know what the
issue is, because it's finished enough that it's already used in a released
version of scipy (in sparse matrices).

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] String type again.

2014-07-16 Thread Aldcroft, Thomas
On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith  wrote:

> On 12 Jul 2014 23:06, "Charles R Harris" 
> wrote:
> >
> > As previous posts have pointed out, Numpy's `S` type is currently
> treated as a byte string, which leads to more complicated code in python3.
> OTOH, the unicode type is stored as UCS4, which consumes a lot of space,
> especially for ascii strings. This note proposes to adapt the currently
> existing 'a' type letter, currently aliased to 'S', as a new fixed encoding
> dtype. Python 3.3 introduced two one byte internal representations for
> unicode strings, ascii and latin1. Ascii has the advantage that it is a
> subset of UTF-8, whereas latin1 has a few more symbols. Another possibility
> is to just make it an UTF-8 encoding, but I think this would involve more
> overhead as Python would need to determine the maximum character size.
> These are just preliminary thoughts, comments are welcome.
>
> I feel like for most purposes, what we *really* want is a variable length
> string dtype (I.e., where each element can be a different length.). Pandas
> pays quite some price in overhead to fake this right now. Adding such a
> thing will cause some problems regarding compatibility (what to do with
> array(["foo"])) and education, but I think it's worth it in the long run. A
> variable length string with out of band storage also would allow for a lot
> of py3.3-style storage tricks of we want then.
>
> Given that, though, I'm a little dubious about adding a third fixed length
> string type, since it seems like it might be a temporary patch, yet raises
> the prospect of having to indefinitely support *5* distinct string types (3
> of which will map to py3 str)...
>
> OTOH, fixed length nul padded latin1 would be useful for various flat file
> reading tasks.
>
As one of the original agitators for this, let me re-iterate that what the
astronomical community *really* wants is the original proposal as described
by Chris Barker [1] and essentially what Charles said.  We have large data
archives that have ASCII string data in binary formats like FITS and HDF5.
 The current readers for those datasets present users with numpy S data
types, which in Python 3 cannot be compared to str (unicode) literals.  In
many cases those datasets are large, and in my case I regularly deal with
multi-Gb sized bytestring arrays.  Converting those to a U dtype is not
practical.

This issue is the sole blocker that I personally have in beginning to move
our operations code base to be Python 3 compatible, and eventually actually
baselining Python 3.

A variable length string would be great, but it feels like a different (and
more difficult) problem to me.  If, however, this can be the solution to
the problem I described, and it can be implemented in a finite time, then
I'm all for it!  :-)

I hate begging for features with no chance of contributing much to the
implementation (lacking the necessary expertise in numpy internals).  I
would be happy to draft a NEP if that will help the process.

Cheers,
Tom

[1]:
http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html

> -n
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Bug in np.cross for 2D vectors

2014-07-16 Thread Jaime Fernández del Río
On Tue, Jul 15, 2014 at 2:22 AM, Neil Hodgson 
wrote:

> Hi,
>
> We came across this bug while using np.cross on 3D arrays of 2D vectors.
>

What version of numpy are you using? This should already be solved in numpy
master, and be part of the 1.9 release. Here's the relevant commit,
although the code has been cleaned up a bit in later ones:

https://github.com/numpy/numpy/commit/b9454f50f23516234c325490913224c3a69fb122

Jaime

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] __numpy_ufunc__

2014-07-16 Thread Nathaniel Smith
Weirdly, I never received Chuck's original email in this thread. Should
some list admin be informed?

I also am not sure what/where Julian's comments were, so I second the call
for context :-). Putting it off until 1.10 doesn't seem like an obviously
bad idea to me, but specifics would help...

(__numpy_ufunc__ is the new system for allowing arbitrary third party
objects to override how ufuncs are applied to them, i.e. it means
np.sin(sparsemat) and np.sin(gpuarray) can be defined to do something
sensible. Conceptually it replaces the old __array_prepare__/__array_wrap__
system, which was limited to ndarray subclasses and has major limits on
what you can do. Of course __array_prepare/wrap__ will also continue to be
supported for compatibility.)

-n
On 16 Jul 2014 00:10, "Benjamin Root"  wrote:

> Perhaps a bit of context might be useful? How is numpy_ufunc different
> from the ufuncs that we know and love? What are the known implications?
> What are the known shortcomings? Are there ABI and/or API concerns between
> 1.9 and 1.10?
>
> Ben Root
>
>
> On Mon, Jul 14, 2014 at 2:22 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>> Hi All,
>>
>> Julian has raised the question of including numpy_ufunc in numpy 1.9. I
>> don't feel strongly one way or the other, but it doesn't seem to be
>> finished yet and 1.10 might be a better place to work out the remaining
>> problems along with the astropy folks testing possible uses.
>>
>> Thoughts?
>>
>> Chuck
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion