Re: [Numpy-discussion] py2/py3 pickling
Hi- Would it be possible then (in relatively short order) to create a py2 - py3 numpy pickle converter? This would run in py2, np.load or unpickle a pickle in the usual way and then repickle and/or save using a pickler that uses an explicit pickle type for encoding the bytes associated with numpy dtypes. The numpy unpickler in py3 would then know what to do. IE. is there a way to make the numpy py2 pickler be explicit about byte strings? Presumably this would cover most use-cases even for complicated pickled objects and could be used transparently within py2 or py3. Best, C On Aug 24, 2015, at 2:30 PM, Nathaniel Smith n...@pobox.com wrote: On Aug 24, 2015 9:29 AM, Pauli Virtanen p...@iki.fi mailto:p...@iki.fi wrote: 24.08.2015, 01:02, Chris Laumann kirjoitti: [clip] Is there documentation about the limits and workarounds for py2/py3 pickle/np.save/load compatibility? I haven't found anything except developer bug tracking discussions (eg. #4879 in github numpy). Not sure if it's written down somewhere but: - You should consider pickles not portable between Py2/3. - Setting encoding='bytes' or encoding='latin1' should produce correct results for numerical data. However, neither is safe because the option also affects other data than numpy arrays that you may have possibly saved. For those wondering what's going on here: if you pickled a str in python 2, then python 3 wants to unpickle it as a str. But in python 2 str was a vector of arbitrary bytes in some assumed encoding, and in python 3 str is a vector of Unicode characters. So it needs to know what encoding to use, which is fine and what you'd expect for the py2-py3 transition. But: when pickling arrays, numpy on py2 used a str to store the raw memory of your array. Trying to run this data through a character decoder then obviously makes a mess of everything. So the fundamental problem is that on py2, there's no way to distinguish between a string of text and a string of bytes -- they're encoded in exactly the same way in the pickle file -- and the python 3 unpickler just has to guess. You can tell it to guess in a way that works for raw bytes -- that's what the encoding= options Pauli mentions above do -- but obviously this will then be incorrect if you have any actual non-latin1 textual strings in your pickle, and you can't get it to handle both correctly at the same time. If you're desperate, it should be possible to get your data out of py2 pickles by loading then with one of the encoding options above, and then going through the resulting object and converting all the actual textual strings back to the correct encoding by hand. No data is actually lost. And of course even this is unnecessary if your file contains only ASCII/latin1. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] py2/py3 pickling
Hi all- Is there documentation about the limits and workarounds for py2/py3 pickle/np.save/load compatibility? I haven't found anything except developer bug tracking discussions (eg. #4879 in github numpy). The kinds of errors you get can be really obscure when save/loading complicated objects or pickles containing numpy scalars. It's really unclear to me why the following shouldn't work -- it doesn't have anything apparent to do with string handling and unicode. Run in py2: import pickle import numpy as np a = np.float64(0.99) pickle.dump(a, open('test.pkl', 'wb')) And then in py3: import pickle import numpy as np b = pickle.load(open('test.pkl', 'rb')) And you get: UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128) If you force encoding='bytes' in the load, it works. Is this explained anywhere? Best, C ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] It looks like Py 3.5 will include a dedicated infix matrix multiply operator
That’s great. Does this mean that, in the not-so-distant future, the matrix class will go the way of the dodos? I have had more subtle to fix bugs sneak into code b/c something returns a matrix instead of an array than almost any other single source I can think of. Having two almost indistinguishable types for 2d arrays with slightly different semantics for a small subset of operations is terrible. Best, C -- Chris Laumann Sent with Airmail On March 14, 2014 at 7:16:24 PM, Christophe Bal (projet...@gmail.com) wrote: This id good for Numpyists but this will be another operator that good also help in another contexts. As a math user, I was first very skeptical but finally this is a good news for non Numpyists too. Christophe BAL Le 15 mars 2014 02:01, Frédéric Bastien no...@nouiz.org a écrit : This is great news. Excellent work Nathaniel and all others! Frédéric On Fri, Mar 14, 2014 at 8:57 PM, Aron Ahmadia a...@ahmadia.net wrote: That's the best news I've had all week. Thanks for all your work on this Nathan. -A On Fri, Mar 14, 2014 at 8:51 PM, Nathaniel Smith n...@pobox.com wrote: Well, that was fast. Guido says he'll accept the addition of '@' as an infix operator for matrix multiplication, once some details are ironed out: https://mail.python.org/pipermail/python-ideas/2014-March/027109.html http://legacy.python.org/dev/peps/pep-0465/ Specifically, we need to figure out whether we want to make an argument for a matrix power operator (@@), and what precedence/associativity we want '@' to have. I'll post two separate threads to get feedback on those in an organized way -- this is just a heads-up. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [help needed] associativity and precedence of '@'
Hi all, Let me preface my two cents by saying that I think the best part of @ being accepted is the potential for deprecating the matrix class — the syntactic beauty of infix for matrix multiply is a nice side effect IMHO :) This may be why my basic attitude is: I don’t think it matters very much but I would vote (weakly) for weak-right. Where there is ambiguity, I suspect most practitioners will just put in parentheses anyway — especially with combinations of * and @, where I don’t think there is a natural intuitive precedence relationship. At least, element-wise multiplication is very rare in math/physics texts as an explicitly defined elementary operation so I’d be surprised if anybody had a strong intuition about the precedence of the ‘*’ operator. And the binding order doesn’t matter if it is scalar multiplication. I have quite a bit of code with large matrices where the order of matrix-vector multiplies is an important optimization and I would certainly have a few simpler looking expressions for op @ op @ vec, hence the weak preference for right-associativity. That said, I routinely come across situations where the optimal matrix multiplication order is more complicated than can be expressed as left-right or right-left (because some matrices might be diagonal, CSR or CSC), which is why the preference is only weak. I don’t see a down-side in the use-case that it is actually associative (as in matrix-matrix-vector). Best, Chris -- Chris Laumann Sent with Airmail On March 14, 2014 at 8:42:00 PM, Nathaniel Smith (n...@pobox.com) wrote: Hi all, Here's the main blocker for adding a matrix multiply operator '@' to Python: we need to decide what we think its precedence and associativity should be. I'll explain what that means so we're on the same page, and what the choices are, and then we can all argue about it. But even better would be if we could get some data to guide our decision, and this would be a lot easier if some of you all can help; I'll suggest some ways you might be able to do that. So! Precedence and left- versus right-associativity. If you already know what these are you can skim down until you see CAPITAL LETTERS. We all know what precedence is. Code like this: a + b * c gets evaluated as: a + (b * c) because * has higher precedence than +. It binds more tightly, as they say. Python's complete precedence able is here: http://docs.python.org/3/reference/expressions.html#operator-precedence Associativity, in the parsing sense, is less well known, though it's just as important. It's about deciding how to evaluate code like this: a * b * c Do we use a * (b * c) # * is right associative or (a * b) * c # * is left associative ? Here all the operators have the same precedence (because, uh... they're the same operator), so precedence doesn't help. And mostly we can ignore this in day-to-day life, because both versions give the same answer, so who cares. But a programming language has to pick one (consider what happens if one of those objects has a non-default __mul__ implementation). And of course it matters a lot for non-associative operations like a - b - c or a / b / c So when figuring out order of evaluations, what you do first is check the precedence, and then if you have multiple operators next to each other with the same precedence, you check their associativity. Notice that this means that if you have different operators that share the same precedence level (like + and -, or * and /), then they have to all have the same associativity. All else being equal, it's generally considered nice to have fewer precedence levels, because these have to be memorized by users. Right now in Python, every precedence level is left-associative, except for '**'. If you write these formulas without any parentheses, then what the interpreter will actually execute is: (a * b) * c (a - b) - c (a / b) / c but a ** (b ** c) Okay, that's the background. Here's the question. We need to decide on precedence and associativity for '@'. In particular, there are three different options that are interesting: OPTION 1 FOR @: Precedence: same as * Associativity: left My shorthand name for it: same-left (yes, very creative) This means that if you don't use parentheses, you get: a @ b @ c - (a @ b) @ c a * b @ c - (a * b) @ c a @ b * c - (a @ b) * c OPTION 2 FOR @: Precedence: more-weakly-binding than * Associativity: right My shorthand name for it: weak-right This means that if you don't use parentheses, you get: a @ b @ c - a @ (b @ c) a * b @ c - (a * b) @ c a @ b * c - a @ (b * c) OPTION 3 FOR @: Precedence: more-tightly-binding than * Associativity: right My shorthand name for it: tight-right This means that if you don't use parentheses, you get: a @ b @ c - a @ (b @ c) a * b @ c - a * (b @ c) a @ b * c - (a @ b) * c We need to pick which of which options we think is best, based
Re: [Numpy-discussion] Memory leak?
Hi all- Thanks for the info re: memory leak. In trying to work around it, I think I’ve discovered another (still using SuperPack). This leaks ~30MB / run: hists = zeros((50,64), dtype=int) for i in range(50): for j in range(2**13): hists[i,j%64] += 1 The code leaks using hists[i,j] = hists[i,j] + 1 as well. Is this the same leak or different? Doesn’t seem to have much in common. Incidentally, using a = ones(v.shape[0]) a.dot(v) Instead of np.sum (in the previous example that i sent) does not leak. Re: superpack.. As a fairly technically proficient user, I’m aware that the super pack installs dev builds and that they may therefore by somewhat less reliable. I’m okay with that tradeoff and I don’t expect you guys to actually treat the super pack as a stable release — I also try to report that I’m using the superpack when I report bugs. I sometimes run git versions of ipython, numpy, etc in order to fiddle with the code and make tiny bug fixes/contributions myself. I don’t know the statistics re: superpack users but there is no link from scipy.org’s main install page so most new users won’t find it easily. Fonnesbeck’s webpage does say they are dev builds only two sentences into the paragraph. Best, Chris -- Chris Laumann Sent with Airmail On January 31, 2014 at 9:31:40 AM, Julian Taylor (jtaylor.deb...@googlemail.com) wrote: On 31.01.2014 18:12, Nathaniel Smith wrote: On Fri, Jan 31, 2014 at 4:29 PM, Benjamin Root ben.r...@ou.edu wrote: Just to chime in here about the SciPy Superpack... this distribution tracks the master branch of many projects, and then puts out releases, on the assumption that master contains pristine code, I guess. I have gone down strange rabbit holes thinking that a particular bug was fixed already and the user telling me a version number that would confirm that, only to discover that the superpack actually packaged matplotlib about a month prior to releasing a version. I will not comment on how good or bad of an idea it is for the Superpack to do that, but I just wanted to make other developers aware of this to keep them from falling down the same rabbit hole. Wow, that is good to know. Esp. since the web page: http://fonnesbeck.github.io/ScipySuperpack/ simply advertises that it gives you things like numpy 1.9 and scipy 0.14, which don't exist. (With some note about dev versions buried in prose a few sentences later.) Empirically, development versions of numpy have always contained bugs, regressions, and compatibility breaks that were fixed in the released version; and we make absolutely no guarantees about compatibility between dev versions and any release versions. And it sort of has to be that way for us to be able to make progress. But if too many people start using dev versions for daily use, then we and downstream dependencies will have to start adding compatibility hacks and stuff to support those dev versions. Which would be a nightmare for developers and users both. Recommending this build for daily use by non-developers strikes me as dangerous for both users and the wider ecosystem. while probably not good for the user I think its very good for us. This is the second bug I introduced found by superpack users. This one might have gone unnoticed into the next release as it is pretty much impossible to find via tests. Even in valgrind reports its hard to find as its lumped in with all of pythons hundreds of memory arena still-reachable leaks. Concerning the fix, it seems if python sees tp_free == PYObject_Del/Free it replaces it with the tp_free of the base type which is int_free in this case. int_free uses a special allocator for even lower overhead so we start leaking. We either need to find the right flag to set for our scalars so it stops doing that, add an indirection so the function pointers don't match or stop using the object allocator as we are apparently digging to deep into pythons internal implementation details by doing so. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Memory leak?
Current scipy superpack for osx so probably pretty close to master. So it's a known leak? Hmm. Maybe I'll have to work on a different machine for a bit.Chris---Sent from my iPhone using Mail Ninja--- Original Message ---which version of numpy are you using?there seems to be a leak in the scalar return due to the PyObject_Malloc usage in git master, but it doesn't affect 1.8.0 On Fri, Jan 31, 2014 at 7:20 AM, Chris Laumann chris.laum...@gmail.com wrote: Hi all- The following snippet appears to leak memory badly (about 10 MB per execution): P = randint(0,2,(30,13)) for i in range(50): print "\r", i, "/", 50 for ai in ndindex((2,)*13):j = np.sum(P.dot(ai)) If instead you execute (no np.sum call):P = randint(0,2,(30,13)) for i in range(50): print "\r", i, "/", 50 for ai in ndindex((2,)*13):j = P.dot(ai) There is no leak. Any thoughts? I’m stumped. Best, Chris -- Chris LaumannSent with Airmail___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Memory leak?
Hi all- The following snippet appears to leak memory badly (about 10 MB per execution): P = randint(0,2,(30,13)) for i in range(50): print \r, i, /, 50 for ai in ndindex((2,)*13): j = np.sum(P.dot(ai)) If instead you execute (no np.sum call): P = randint(0,2,(30,13)) for i in range(50): print \r, i, /, 50 for ai in ndindex((2,)*13): j = P.dot(ai) There is no leak. Any thoughts? I’m stumped. Best, Chris -- Chris Laumann Sent with Airmail___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Memory leak in numpy?
Hi all- I think I just found a memory leak in numpy, or maybe I just don’t understand generators. Anyway, the following snippet will quickly eat a ton of RAM: P = randint(0,2, (20,13)) for i in range(50): for ai in ndindex((2,)*13): j = P.dot(ai) If you replace the last line with something like j = ai, the memory leak goes away. I’m not exactly sure what’s going on but the .dot seems to be causing the memory taken by the tuple ai to be held. This devours RAM in python 2.7.5 (OS X Mavericks default I believe), numpy version 1.8.0.dev-3084618. I’m upgrading to the latest Superpack (numpy 1.9) right now but I somehow doubt this behavior will change. Any thoughts? Best, Chris -- Chris Laumann Sent with Airmail___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bitwise operations and unsigned types
Good morning all-- didn't realize this would generate quite such a buzz. To answer a direct question, I'm using the github master. A few thoughts (from a fairly heavy numpy user for numerical simulations and analysis): The current behavior is confusing and (as far as i can tell) undocumented. Scalars act up only if they are big: In [152]: np.uint32(1) 1 Out[152]: 1 In [153]: np.uint64(1) 1 --- TypeError Traceback (most recent call last) /Users/claumann/ipython-input-153-191a0b5fe216 in module() 1 np.uint64(1) 1 TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' But arrays don't seem to mind: In [154]: ones(3, dtype=np.uint32) 1 Out[154]: array([1, 1, 1], dtype=uint32) In [155]: ones(3, dtype=np.uint64) 1 Out[155]: array([1, 1, 1], dtype=uint64) As you mentioned, explicitly casting 1 to np.uint makes the above scalar case work, but I don't understand why this is unnecessary for the arrays. I could understand a general argument that type casting rules should always be the same independent of the underlying ufunc, but I'm not sure if that is sufficiently smart. Bitwise ops probably really ought to treat nonnegative python integers as unsigned. I disagree, promoting to object kind of destroys the whole idea of bitwise operations. I think we *fixed* a bug. That is an interesting point of view. I could see that point of view. But, was this discussed as a bug prior to this change occurring? I'm not sure what 'promoting to object' constitutes in the new numpy, but just a small thought. I can think of two reasons to go to the trouble of using bitfields over more pythonic (higher level) representations: speed/memory overhead and interfacing with external hardware/software. For me, it's mostly the former -- I've already implemented this program once using a much more pythonic approach but it just has too much memory overhead to scale to where I want it. If a coder goes to the trouble of using bitfields, there's probably a good reason they wanted a lower level representation in which bitfield ops happen in parallel as integer operations. But, what do you mean that bitwise operations are destroyed by promotion to objects? Best, Chris On Apr 6, 2012, at 5:57 AM, Nathaniel Smith wrote: On Fri, Apr 6, 2012 at 7:19 AM, Travis Oliphant tra...@continuum.io wrote: That is an interesting point of view. I could see that point of view. But, was this discussed as a bug prior to this change occurring? I just heard from a very heavy user of NumPy that they are nervous about upgrading because of little changes like this one. I don't know if this particular issue would affect them or not, but I will re-iterate my view that we should be very careful of these kinds of changes. I agree -- these changes make me very nervous as well, especially since I haven't seen any short, simple description of what changed or what the rules actually are now (comparable to the old scalars do not affect the type of arrays). But, I also want to speak up in favor in one respect, since real world data points are always good. I had some code that did def do_something(a): a = np.asarray(a) a -= np.mean(a) ... If someone happens to pass in an integer array, then this is totally broken -- np.mean(a) may be non-integral, and in 1.6, numpy silently discards the fractional part and performs the subtraction anyway, e.g.: In [4]: a Out[4]: array([0, 1, 2, 3]) In [5]: a -= 1.5 In [6]: a Out[6]: array([-1, 0, 0, 1]) The bug was discovered when Skipper tried running my code against numpy master, and it errored out on the -=. So Mark's changes did catch one real bug that would have silently caused completely wrong numerical results! https://github.com/charlton/charlton/commit/d58c72529a5b33d06b49544bc3347c6480dc4512 - Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Bitwise operations and unsigned types
Hi all- I've been trying to use numpy arrays of ints as arrays of bit fields and mostly this works fine. However, it seems that the bitwise_* ufuncs do not support unsigned integer dtypes: In [142]: np.uint64(5)3 --- TypeError Traceback (most recent call last) /Users/claumann/ipython-input-142-65e3301d5d07 in module() 1 np.uint64(5)3 TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' This seems odd as unsigned ints are the most natural bitfields I can think of -- the sign bit is just confusing when doing bit manipulation. Python itself of course doesn't make much a distinction between ints, longs, unsigned etc. Is this a bug? Thanks, Chris -- Chris Laumann Sent with Sparrow (http://www.sparrowmailapp.com/?sig) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion