[Numpy-discussion] How to concatenate two arrays without duplicating memory?
Hello, Let's say we have two arrays A and B of shapes (1, 2000) and (1, 4000). If I do C=numpy.concatenate((A, B), axis=1), I get a new array of dimension (1, 6000) with duplication of memory. I am looking for a way to have a non contiguous array C in which the left (1, 2000) elements point to A and the right (1, 4000) elements point to B. Any hint will be appreciated. Thanks, Armando ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about np.savez
A Wednesday 02 September 2009 05:50:57 Robert Kern escrigué: On Tue, Sep 1, 2009 at 21:11, Jorge Scandaliarisjorgesmbox...@yahoo.es wrote: David Warde-Farley dwf at cs.toronto.edu writes: If you actually want to save multiple arrays, you can use savez('fname', *[a,b,c]) and they will be accessible under the names arr_0, arr_1, etc. and a list of these names is in the 'files' attribute on the NpzFile object. To retrieve your list of arrays when you load, you can just do mynewlist = [data[arrname] for arrname in data.files] Thanks for the tip. I have realized, though, that I might need some more flexibility than just the ability to save ndarrays. The data I am dealing with is best kept in a hierarchical way (I could represent the structure with ndarrays also, but I think it would be messy and difficult). I am having a look at h5py to see if it fulfill my needs. I know there is pytables, too, but from having a quick look it seems h5py is simpler. Am I right on this?. I also get a nice side-effect, the data would be readable by the de-facto standard software used by most people in my field. If there is a particular format that uses HDF5 that you are trying to replicate, h5py is the clear answer. However, PyTables will, by and large, make files that are entirely readable by other HDF5 libraries when you just use the subset of features that is supported by HDF5-proper. For example, tables and arrays work just fine. What won't be supported by non-PyTables libraries are things like dataset attributes which are pickled objects. Your non-PyTables HDF5 apps will see some extraneous attributes on the arrays and tables, but those are typically not necessary for interpretation. Most of these 'extraneous' attributes are derived from the use of the high level HDF5 interface (http://www.hdfgroup.org/HDF5/doc/HL/). If they bother you, you can get rid of them by setting the parameter ``PYTABLES_SYS_ATTRS`` to false (either in tables/parameters.py or passing it to `tables.openFile`). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrays without duplicating memory?
On Wed, Sep 02, 2009 at 09:40:49AM +0200, V. Armando Solé wrote: Let's say we have two arrays A and B of shapes (1, 2000) and (1, 4000). If I do C=numpy.concatenate((A, B), axis=1), I get a new array of dimension (1, 6000) with duplication of memory. I am looking for a way to have a non contiguous array C in which the left (1, 2000) elements point to A and the right (1, 4000) elements point to B. You cannot in the numpy memory model. The numpy memory model defines an array as something that has regular strides to jump from an element to the next one. Gaël ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Hello Sturla, I had a quick look at your code. Looks fine. A few notes... In select you should replace numpy with np. In _median how can you, if n==2, use s[] if s is not defined? What if n==1? Also, I think when returning an empty array, it should be of the same type you would get in the other cases. You could replace _median with the following. Best, Luca def _median(x, inplace): assert(x.ndim == 1) n = x.shape[0] if n 2: k = n 1 s = select(x, k, inplace=inplace) if n 1: return s[k] else: return 0.5*(s[k]+s[:k].max()) elif n == 0: return np.empty(0, dtype=x.dtype) elif n == 2: return 0.5*(x[0]+x[1]) else: # n == 1 return x[0] ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrayswithout duplicating memory?
As Gaël pointed out you cannot create A, B and then C as the concatenation of A and B without duplicating the vectors. I am looking for a way to have a non contiguous array C in which the left (1, 2000) elements point to A and the right (1, 4000) elements point to B. But you can still re-link A to the left elements and B to the right ones afterwards by using views into C. C=numpy.concatenate((A, B), axis=1) A,B = C[:,:2000], C[:,2000:] Best, Luca ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrays without duplicating memory?
Gael Varoquaux wrote: You cannot in the numpy memory model. The numpy memory model defines an array as something that has regular strides to jump from an element to the next one. I expected problems in the suggested case (concatenating columns) but I did not expect the problem would be so severe to affect the case of row concatenation. I guess I am still considering a 2D array as an array of pointers and that does not apply to numpy arrays. Thanks for the info. Armando ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about np.savez
Thanks David, Robert and Francesc for comments and suggestions. It's nice having options, but that also means one has to choose ;) I will have a closer look at pytables. The thing that got me scared about it was the word database. I have close to zero experience using or, even worst, designing databases. Maybe I am wrong. The way I was considering for structuring could be considered like a, rudimentary at least, database. I have the feeling this is turning into killing a fly with a cannon... Jorge ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrayswithout duplicating memory?
Citi, Luca wrote: As Gaël pointed out you cannot create A, B and then C as the concatenation of A and B without duplicating the vectors. But you can still re-link A to the left elements and B to the right ones afterwards by using views into C. Thanks for the hint. In my case the A array is already present and the contents of the B array can be read from disk. At least I have two workarounds making use of your suggested solution of re-linking: - create the C array, copy the contents of A to it and read the contents of B directly into C with duplication of the memory of A during some time. - save the array A in disk, create the array C, read the contents of A and B into it and re-link A and B with no duplication but ugly. Thanks, Armando ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Question about np.savez
A Wednesday 02 September 2009 11:20:55 Jorge Scandaliaris escrigué: Thanks David, Robert and Francesc for comments and suggestions. It's nice having options, but that also means one has to choose ;) I will have a closer look at pytables. The thing that got me scared about it was the word database. I have close to zero experience using or, even worst, designing databases. Maybe I am wrong. The way I was considering for structuring could be considered like a, rudimentary at least, database. Well, I agree that the term 'database' is perhaps a bit scaring and I don't actually like this term to be applied to PyTables --I always like to say that PyTables is not a database competitor, but rather a companion. Just for completeness, here it is my own comparison among PyTables and h5py: http://www.pytables.org/moin/FAQ#HowdoesPyTablescomparewiththeh5pyproject.3F I have the feeling this is turning into killing a fly with a cannon... Maybe. But if you are going to keep many data on-disk, it can be a nice advantage in the medium term. HTH, -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrayswithout duplicating memory?
Hi, depending on the needs you have you might be interested in my minimal implementation of what I call a mock-ndarray. I needed somthing like this to analyze higher dimensional stacks of 2d images and what I needed was mostly the indexing features of nd-arrays. A mockarray is initialized with a list of nd-arrays. The result is a mock array having one additional dimention in front. a = N.arange(9) b = N.arange(9) a.shape=3,3 b.shape=3,3 c = F.mockNDarray(a,b) c.shape (2, 3, 3) c[2,2,2] c[1,2,2] 8 No memory copy is done. I put the module file here http://drop.io/kpu4bib/asset/mockndarray-py Otherwise this is part of my (BSD) Priithon image analysis framework. Regards Sebastian Haase On Wed, Sep 2, 2009 at 11:31 AM, V. Armando Solés...@esrf.fr wrote: Citi, Luca wrote: As Gaël pointed out you cannot create A, B and then C as the concatenation of A and B without duplicating the vectors. But you can still re-link A to the left elements and B to the right ones afterwards by using views into C. Thanks for the hint. In my case the A array is already present and the contents of the B array can be read from disk. At least I have two workarounds making use of your suggested solution of re-linking: - create the C array, copy the contents of A to it and read the contents of B directly into C with duplication of the memory of A during some time. - save the array A in disk, create the array C, read the contents of A and B into it and re-link A and B with no duplication but ugly. Thanks, Armando ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] snow leopard and Numeric
Is there a way to constrain an old-style compilation just to make a code work? I have similar problems with other old pieces of code. Use -arch i686 in the CFLAGS and LDFLAGS. I think. Unfortunately, it seems not to have any effect. I'll try something else. Thanks anyway. Stefano ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: GPU Numpy
Hi everyone, In case anyone is interested, I just set up a google group to discuss GPU-based simulation for our Python neural simulator Brian: http://groups.google.fr/group/brian-on-gpu Our simulator relies heavily Numpy. I would be very happy if the GPU experts here would like to share their expertise. Best, Romain Romain Brette a écrit : Sturla Molden a écrit : Thus, here is my plan: 1. a special context-manager class 2. immutable arrays inside with statement 3. lazy evaluation: expressions build up a parse tree 4. dynamic code generation 5. evaluation on exit There seems to be some similarity with what we want to do to accelerate our neural simulations (briansimulator.org), as described here: http://brian.svn.sourceforge.net/viewvc/brian/trunk/dev/BEPs/BEP-9-Automatic%20code%20generation.txt?view=markup (by the way BEP is Brian Enhancement Proposal) The speed-up factor we got in our experimental code with GPU is very substantial when there are many neurons (= large vectors, e.g. 10 000 elements), even when operations are simple. Romain ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Sturla Molden wrote: Dag Sverre Seljebotn skrev: Nitpick: This will fail on large arrays. I guess numpy.npy_intp is the right type to use in this case? By the way, here is a more polished version, does it look ok? http://projects.scipy.org/numpy/attachment/ticket/1213/generate_qselect.py http://projects.scipy.org/numpy/attachment/ticket/1213/quickselect.pyx I didn't look at the algorithm, but the types look OK (except for the gil as you say). Comments: a) Is the cast to numpy.npy_intp really needed? I'm pretty sure shape is defined as numpy.npy_intp*. b) If you want higher performance with contiguous arrays (which occur a lot as inplace=False is default I guess) you can do np.ndarray[T, ndim=1, mode=c] to tell the compiler the array is contiguous. That doubles the number of function instances though... Cython needs something like Java's generics by the way :-) Yes, we all long for that. It will come as soon as somebody volunteers I suppose -- it shouldn't be all that difficult, but I don't think any of the existing devs will be up for it any time soon. Dag Sverre ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Fastest way to parsing a specific binay file
Hello, I want to be able to parse a binary file which hold information regarding to experiment configuration and data obviously. Both configuration and data sections are variable-length. A chuck this data is shown as below (after a binary read operation) '\x00\...@\x00$\x00\x02\x00\x12\x00\xff\x00\x00\x00u\xaa\xfa\xffd\x00\x08\x00\x01\x00\x08\x00\xff\x00\x00\x00u\xaa\xfb\xffl\x00\xab\x00\x01\x00\xab\x00\xff\x00\x00\x00u\xaa\xe7\x03\x17\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00u\xaa\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00prj.300\x00; Version = 1\n', 'ProjectName = PME1 2009 King Air N825ST\n', 'FlightId = \n', 'AircraftType = WMI King Air 200\n', 'AircraftId = N825ST\n', 'OperatorName = Weather Modification Inc.\n', 'Comments = \n', '\x00\x00@ In binary form the file is 1.3MB, and when written to a txt file it expands to 3.7MB totalling approximately 4 million characters. When fully processed (with an IDL code) it produces 86 seperate configuration files, and 46 ascii files for data, about 10-15 different instruments and in various combinations plus sampling rates. I attemted to use RE module, however the time it takes parse the file is really longer than I expected. What would be wisest and fastest way to tackle this issue? Upon successful re-construction of the data and metadata, I am planning to use a much modular structure like HDF5 or netCDF4 for an easy data storage and analyses. Thank you. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Dag Sverre Seljebotn skrev: a) Is the cast to numpy.npy_intp really needed? I'm pretty sure shape is defined as numpy.npy_intp*. I don't know Cython internals in detail but you do, I so take your word for it. I thought shape was a tuple of Python ints. b) If you want higher performance with contiguous arrays (which occur a lot as inplace=False is default I guess) you can do np.ndarray[T, ndim=1, mode=c] to tell the compiler the array is contiguous. That doubles the number of function instances though... Thanks. I could either double the number of specialized select functions, or I could make a local copy using numpy.ascontiguousarray in the select function. Quickselect touch the discontiguous array on average 2*n times, whereas numpy.ascontiguousarray touch the discontiguous array n times (but in orderly). Then there is the question of cache use: Contiguous arrays are the more friendly case, and numpy.ascontiguousarray is more friendly than quickselect. Also if quickselect is not done inplace (the common case for medians), we always have contigous arrays, so mode=c is almost always wanted. And when quickselect is done inplace, we usually have a contiguous input. This is also why I used a C pointer instead of your buffer syntax in the first version. Then I changed my mind, not sure why. So I'll try with a local copy first then. I don't think we want close to a megabyte of Cython generated gibberish C just for the median. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Citi, Luca skrev: Hello Sturla, In _median how can you, if n==2, use s[] if s is not defined? What if n==1? That was a typo. Also, I think when returning an empty array, it should be of the same type you would get in the other cases. Currently median returns numpy.nan for empty input arrays. I'll do that instead. I want it to behave exactly as the current implementation, except for the sorting. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrays without duplicating memory?
V. Armando Solé skrev: I am looking for a way to have a non contiguous array C in which the left (1, 2000) elements point to A and the right (1, 4000) elements point to B. Any hint will be appreciated. If you know in advance that A and B are going to be duplicated, you can use views: C = np.zeros((1, 6000)) A = C[:,:2000] B = C[:,2000:] Now C is A and B concatenated horizontally. If you can't to this, you could write the data to a temporary file and read it back, but it would be slow. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 09:38, Gökhan Severgokhanse...@gmail.com wrote: Hello, I want to be able to parse a binary file which hold information regarding to experiment configuration and data obviously. Both configuration and data sections are variable-length. A chuck this data is shown as below (after a binary read operation) '\x00\...@\x00$\x00\x02\x00\x12\x00\xff\x00\x00\x00u\xaa\xfa\xffd\x00\x08\x00\x01\x00\x08\x00\xff\x00\x00\x00u\xaa\xfb\xffl\x00\xab\x00\x01\x00\xab\x00\xff\x00\x00\x00u\xaa\xe7\x03\x17\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00u\xaa\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00prj.300\x00; Version = 1\n', 'ProjectName = PME1 2009 King Air N825ST\n', 'FlightId = \n', 'AircraftType = WMI King Air 200\n', 'AircraftId = N825ST\n', 'OperatorName = Weather Modification Inc.\n', 'Comments = \n', '\x00\x00@ In binary form the file is 1.3MB, and when written to a txt file it expands to 3.7MB totalling approximately 4 million characters. When fully processed (with an IDL code) it produces 86 seperate configuration files, and 46 ascii files for data, about 10-15 different instruments and in various combinations plus sampling rates. I attemted to use RE module, however the time it takes parse the file is really longer than I expected. What would be wisest and fastest way to tackle this issue? Upon successful re-construction of the data and metadata, I am planning to use a much modular structure like HDF5 or netCDF4 for an easy data storage and analyses. Are there fixed delimiters? Like '\x00\...@\x00' perhaps? It might be faster to search for those using .find() instead of regexes. Without more information about how the file format gets split up, I'm not sure we can make good suggestions. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrayswithout duplicating memory?
Sebastian Haase skrev: A mockarray is initialized with a list of nd-arrays. The result is a mock array having one additional dimention in front. This is important, because often in the case of 'concatenation' a real concatenation is not needed. But then there is a common tool called Matlab, which unlike Python has no concept of lists and make numerical programmers think they do. C = [A, B] is a horizontal concatenation in Matlab. Too much exposure to Matlab cripples the mind easily. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
Gökhan Sever skrev: What would be wisest and fastest way to tackle this issue? Get the format, read the binary data directly, skip the ascii/regex part. I sometimes use recarrays with formatted binary data; just constructing a dtype and use numpy.fromfile to read. That works when the binary file store C structs written successively. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.bitwise_and.identity
Hello, I know I am splitting the hair, but should not np.bitwise_and.identity be -1 instead of 1? I mean, something with all the bits set? I am checking whether all elements of a vector 'v' have a certain bit 'b' set: if np.bitwise_and.reduce(v) (1 b): # do something If v is empty, the expression is true for b==0 and false otherwise. In fact np.bitwise_and.identity is 1. I like being able to use np.bitwise_and.reduce because it many times faster than (v (1 b)).all() (it does not create the temporary vector). Of course there are workarounds but I was wondering if there is a reason for this behaviour. Best, Luca ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.bitwise_and.identity
On Wed, Sep 2, 2009 at 11:11, Citi, Lucalc...@essex.ac.uk wrote: Hello, I know I am splitting the hair, but should not np.bitwise_and.identity be -1 instead of 1? I mean, something with all the bits set? Probably. However, the .identity parts of ufuncs were designed mostly to support multiply and add, so .identity is restricted to 0, 1, or nothing currently. It will take some effort to change that. In the C code, the sentinel value for no identity is -1, alas. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] adaptive interpolation on a regular 2d grid
Robert Kern robert.kern at gmail.com writes: Looks good! Where can we get the code? Can this be specialized for 1D functions? Re code: sure, I'll be happy to post it if anyone points me to a real test case or two, to help me understand the envelope -- 100^2 - 500^2 grid ? (Splines on regular grids are fast and robust, hard to beat.) Re 1d: I have an old version using 2 point - 2 slope splines, overkill, will trim it. (Is there a sandbox or wiki of interpolation testcases, not images ?) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] adaptive interpolation on a regular 2d grid
On Wed, Sep 2, 2009 at 11:33, denis bzowydenis-bz...@t-online.de wrote: Robert Kern robert.kern at gmail.com writes: Looks good! Where can we get the code? Can this be specialized for 1D functions? Re code: sure, I'll be happy to post it if anyone points me to a real test case or two, to help me understand the envelope -- 100^2 - 500^2 grid ? (Splines on regular grids are fast and robust, hard to beat.) Re 1d: I have an old version using 2 point - 2 slope splines, overkill, will trim it. (Is there a sandbox or wiki of interpolation testcases, not images ?) I have some test cases here: http://svn.scipy.org/svn/scikits/trunk/delaunay/scikits/delaunay/testfuncs.py They are meant to test scattered data interpolation. They aren't going to exercise your adaptive interpolation very hard. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to concatenate two arrayswithout duplicating memory?
I forgot to mention I also support transpose. -S. On Wed, Sep 2, 2009 at 5:23 PM, Sturla Moldenstu...@molden.no wrote: Sebastian Haase skrev: A mockarray is initialized with a list of nd-arrays. The result is a mock array having one additional dimention in front. This is important, because often in the case of 'concatenation' a real concatenation is not needed. But then there is a common tool called Matlab, which unlike Python has no concept of lists and make numerical programmers think they do. C = [A, B] is a horizontal concatenation in Matlab. Too much exposure to Matlab cripples the mind easily. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 10:11 AM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 09:38, Gökhan Severgokhanse...@gmail.com wrote: Hello, I want to be able to parse a binary file which hold information regarding to experiment configuration and data obviously. Both configuration and data sections are variable-length. A chuck this data is shown as below (after a binary read operation) '\x00\x00@ \x00$\x00\x02\x00\x12\x00\xff\x00\x00\x00U\xaa\xfa\xffd\x00\x08\x00\x01\x00\x08\x00\xff\x00\x00\x00U\xaa\xfb\xffl\x00\xab\x00\x01\x00\xab\x00\xff\x00\x00\x00U\xaa\xe7\x03\x17\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00U\xaa\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00prj.300\x00; Version = 1\n', 'ProjectName = PME1 2009 King Air N825ST\n', 'FlightId = \n', 'AircraftType = WMI King Air 200\n', 'AircraftId = N825ST\n', 'OperatorName = Weather Modification Inc.\n', 'Comments = \n', '\x00\x00@ In binary form the file is 1.3MB, and when written to a txt file it expands to 3.7MB totalling approximately 4 million characters. When fully processed (with an IDL code) it produces 86 seperate configuration files, and 46 ascii files for data, about 10-15 different instruments and in various combinations plus sampling rates. I attemted to use RE module, however the time it takes parse the file is really longer than I expected. What would be wisest and fastest way to tackle this issue? Upon successful re-construction of the data and metadata, I am planning to use a much modular structure like HDF5 or netCDF4 for an easy data storage and analyses. Are there fixed delimiters? Like '\x00\...@\x00' perhaps? It might be faster to search for those using .find() instead of regexes. Without more information about how the file format gets split up, I'm not sure we can make good suggestions. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Fixed delims... That is what I used to parse metadata with a regex. Something like: r = re.compile(\0;.+?\...@\0\$, re.DOTALL) which extracts to portions that I am interested. However I have yet to figure parsing separate data streams. Couldn't find a way find to see which data blocks goes with which device. I put the test binary file I am using at: http://drop.io/1plh5rt -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 10:34 AM, Sturla Molden stu...@molden.no wrote: Gökhan Sever skrev: What would be wisest and fastest way to tackle this issue? Get the format, read the binary data directly, skip the ascii/regex part. I sometimes use recarrays with formatted binary data; just constructing a dtype and use numpy.fromfile to read. That works when the binary file store C structs written successively. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion How to use recarrays with variable-length data fields as well as metadata? Eventually I will record the data with numpy arrays but not sure how to utilize recarrays in the first stage. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
On Wed, 2 Sep 2009, Dag Sverre Seljebotn wrote: Sturla Molden wrote: Dag Sverre Seljebotn skrev: Nitpick: This will fail on large arrays. I guess numpy.npy_intp is the right type to use in this case? By the way, here is a more polished version, does it look ok? http://projects.scipy.org/numpy/attachment/ticket/1213/generate_qselect.py http://projects.scipy.org/numpy/attachment/ticket/1213/quickselect.pyx I didn't look at the algorithm, but the types look OK (except for the gil as you say). Comments: a) Is the cast to numpy.npy_intp really needed? I'm pretty sure shape is defined as numpy.npy_intp*. b) If you want higher performance with contiguous arrays (which occur a lot as inplace=False is default I guess) you can do np.ndarray[T, ndim=1, mode=c] to tell the compiler the array is contiguous. That doubles the number of function instances though... Cython needs something like Java's generics by the way :-) Yes, we all long for that. It will come as soon as somebody volunteers I suppose -- it shouldn't be all that difficult, but I don't think any of the existing devs will be up for it any time soon. Danilo's C++ project has some baby steps in that direction, though it'll need to be expanded quite a bit to handle this. - Robert ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 11:53, Gökhan Severgokhanse...@gmail.com wrote: How to use recarrays with variable-length data fields as well as metadata? You don't. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
If I understand the problem... if you are 100% sure that ', ' only occurs between fields and never within, you can use the 'split' method of the string which could be faster than regexp in this simple case. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.bitwise_and.identity
Thank you, Robert, for the quick reply. I just saw the line #define PyUFunc_None -1 in the ufuncobject.h file. It is always the same, you choose a sentinel thinking that it doesn't conflict with any possible value and you later find there is one such case. As said it is not a big deal. I wouldn't spend time on it. Best, Luca ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:01 PM, Citi, Luca lc...@essex.ac.uk wrote: If I understand the problem... if you are 100% sure that ', ' only occurs between fields and never within, you can use the 'split' method of the string which could be faster than regexp in this simple case. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion But it is not possible to extract a pattern such as within a field. A construct like in regex starting with a ; till the end of the section. ?? -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:04 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 11:53, Gökhan Severgokhanse...@gmail.com wrote: How to use recarrays with variable-length data fields as well as metadata? You don't. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I was just confirming my guess :) The data in the binary file was written in a variable-length fashion. Although each chuck has a specific starting indication like \x00\x...@\x00$\x00\x02 the amount of the in each section varies depends on what was in the written stream. How your find suggestion work? It just returns the location of the first occurrence. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:27, Gökhan Severgokhanse...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:01 PM, Citi, Luca lc...@essex.ac.uk wrote: If I understand the problem... if you are 100% sure that ', ' only occurs between fields and never within, you can use the 'split' method of the string which could be faster than regexp in this simple case. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion But it is not possible to extract a pattern such as within a field. A construct like in regex starting with a ; till the end of the section. ?? I can't parse that sentence. Can you describe the format in a little more detail? Or point to documentation of the format? Or the IDL code that parses it? -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:33, Gökhan Severgokhanse...@gmail.com wrote: How your find suggestion work? It just returns the location of the first occurrence. http://docs.python.org/library/stdtypes.html#str.find str.find(sub[, start[, end]]) Return the lowest index in the string where substring sub is found, such that sub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found. But perhaps you should profile your code to see where it is actually taking up the time. Regexes on 1.3 MB of data should be quite fast. In [21]: marker = '\x00\x...@\x00$\x00\x02' In [22]: block = marker + '\xde\xca\xfb\xad' * ((1024-8) // 4) In [23]: data = int(round(1.3 * 1024)) * block In [24]: import re In [25]: r = re.compile(re.escape(marker)) In [26]: %time r.findall(data) CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 s -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:29 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:27, Gökhan Severgokhanse...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:01 PM, Citi, Luca lc...@essex.ac.uk wrote: If I understand the problem... if you are 100% sure that ', ' only occurs between fields and never within, you can use the 'split' method of the string which could be faster than regexp in this simple case. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion But it is not possible to extract a pattern such as within a field. A construct like in regex starting with a ; till the end of the section. ?? I can't parse that sentence. Can you describe the format in a little more detail? Or point to documentation of the format? Or the IDL code that parses it? -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Put the reference manual in: http://drop.io/1plh5rt First few pages describe the data format they use. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:29 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:27, Gökhan Severgokhanse...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:01 PM, Citi, Luca lc...@essex.ac.uk wrote: If I understand the problem... if you are 100% sure that ', ' only occurs between fields and never within, you can use the 'split' method of the string which could be faster than regexp in this simple case. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion But it is not possible to extract a pattern such as within a field. A construct like in regex starting with a ; till the end of the section. ?? I can't parse that sentence. Can you describe the format in a little more detail? Or point to documentation of the format? Or the IDL code that parses it? -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion IDL processing code is on: http://adpaa.svn.sourceforge.net/viewvc/adpaa/trunk/src/Level1/process_raw/ A part of ADPAA - Aircraft Data Processing and Analysis project. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 12:46 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 12:33, Gökhan Severgokhanse...@gmail.com wrote: How your find suggestion work? It just returns the location of the first occurrence. http://docs.python.org/library/stdtypes.html#str.find str.find(sub[, start[, end]]) Return the lowest index in the string where substring sub is found, such that sub is contained in the range [start, end]. Optional arguments start and end are interpreted as in slice notation. Return -1 if sub is not found. But perhaps you should profile your code to see where it is actually taking up the time. Regexes on 1.3 MB of data should be quite fast. In [21]: marker = '\x00\x...@\x00$\x00\x02' In [22]: block = marker + '\xde\xca\xfb\xad' * ((1024-8) // 4) In [23]: data = int(round(1.3 * 1024)) * block In [24]: import re In [25]: r = re.compile(re.escape(marker)) In [26]: %time r.findall(data) CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s Wall time: 0.01 s -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion This is what I have been using. It's not returning exactly what I want but very close besides its being slow: I[52]: mypattern = re.compile('\0\0\1\0.+?\...@\0\$', re.DOTALL) I[53]: res = mypattern.findall(ss) I[54]: len res - len(res) O[54]: 95 I[55]: %time mypattern.findall(ss); CPU times: user 9.14 s, sys: 0.00 s, total: 9.14 s Wall time: 9.16 s I[57]: res[0] O[57]: '\x00\x00\x01\x00\x00\x00\xd9\x07\x04\x00\x02\x00\r\x00\x06\x00\x03\x00\x00\x00\x01\x00\x00\x00 *prj.300*\x00; Version = 1\nProjectName = PME1 2009 King Air N825ST\nFlightId = \nAircraftType = WMI King Air 200\nAircraftId = N825ST\nOperatorName = Weather Modification Inc.\nComments = \n\x00\x00@ \x00$' I need the part starting with the bold typed section (prj.300) and till the end of the section. I need the bold part because I can construct file names from that and write the following content in it. Ohh when it works the resulting search should return me 86 occurrence. -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 13:28, Gökhan Severgokhanse...@gmail.com wrote: Put the reference manual in: http://drop.io/1plh5rt First few pages describe the data format they use. Ah. The fields are *not* delimited by a fixed value. Regexes are no help to you for pulling out the information you need, except perhaps later to parse the text fields. I think you are also getting spurious results because your regex matches things inside data fields. Instead, you have a header containing the length of the data field followed by the data field. Create a structured dtype that corresponds to the DataDir struct on page 15. Note that unsigned int there is actually a numpy.uint16, not a uint32. dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16), ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample', np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2', np.uint8), ('param3', np.uint8), ('address', np.uint16)]) Now read dt.itemsize bytes from the file and use header = fromstring(f.read(dt.itemsize), dt)[0] to get a record object that corresponds to the header. Use the dataOffset and numberBytes fields to extract the actual data bytes from the file. For example, if we go to the second header field: In [28]: f.seek(dt.itemsize,0) In [29]: header = np.fromstring(f.read(dt.itemsize), dt)[0] In [30]: header Out[30]: (65530, 100, 8, 1, 8, 255, 0, 0, 0, 43605) In [31]: f.seek(header['dataOffset'], 0) In [32]: f.read(header['numberBytes']) Out[32]: 'prj.300\x00' There are still some semantic issues you need to work out, still. There are multiple buffers per file, and the dataOffsets are relative to the start of the buffer, not the file. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
On Mon, Aug 31, 2009 at 9:06 PM, Sturla Moldenstu...@molden.no wrote: We recently has a discussion regarding an optimization of NumPy's median to average O(n) complexity. After some searching, I found out there is a selection algorithm competitive in speed with Hoare's quick select. It has the advantage of being a lot simpler to implement. In plain Python: Chad, you can continue to write quick select using NumPy's C quick sort in numpy/core/src/_sortmodule.c.src. When you are done, it might be about 10% faster than this. :-) I was sick for a bit last week, so got stalled on my version, but I'll be working on it this weekend. I'm going for a more general partition function, that could have slightly more general use cases than just a median. Nevertheless, its good to see there could be several options, hopefully at least one of which can be put into numpy. By the way, as far as I can tell, the above algorithm is exactly the same idea as a non-recursive Hoare (ie. quicksort) selection: Do the partition, then only proceed to the sub-partition that must contain the nth element.My version is a bit more general, allowing partitioning on a range of elements rather than just one, but the concept is the same. The numpy quicksort already does non recursive sorting. I'd also like to, if possible, have a specialized 2D version, since image media filtering is one of my interests, and the C version works on 1D (raveled) arrays only. -C ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy core dump on linux
This one line causes python to core dump on linux. numpy.lexsort([ numpy.array(['-','-','-','-','-','-','-','-','-','-','-','-','-'])[::-1],numpy.array([732685., 732685., 732685., 732685., 732685., 732685.,732685., 732685., 732685., 732685., 732685., 732685., 732679.])[::-1]]) Here's some version info: python 2.5.4 numpy 1.3.0 error is *** glibc detected *** free(): invalid next size (fast): 0x00526be0 *** Any ideas? -- --jlm ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy on Snow Leopard
I am unable to build numpy on Snow Leopard. The error that I am getting is shown below. It is a linking issue related to the change in the the default behavior of gcc under Snow Leopard. Before it used to compile for the 32 bit i386 architecture, now the default is the 64 bit x86_64 architecture. Has anybody successfully compiled numpy for MACOSX 10.6. If so I would appreciate if you can tell me how you fixed this issue. Regards, Celil ... C compiler: gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -O3 ... gcc: _configtest.c _configtest.c:1: warning: conflicting types for built-in function ‘exp’ _configtest.c:1: warning: conflicting types for built-in function ‘exp’ gcc _configtest.o -o _configtest ld: warning: in _configtest.o, missing required architecture x86_64 in file Undefined symbols: _main, referenced from: start in crt1.10.6.o ld: symbol(s) not found collect2: ld returned 1 exit status ld: warning: in _configtest.o, missing required architecture x86_64 in file Undefined symbols: _main, referenced from: start in crt1.10.6.o ld: symbol(s) not found collect2: ld returned 1 exit status failure. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy core dump on linux
On Wed, Sep 2, 2009 at 4:37 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 17:23, Jeremy Mayesjeremy.ma...@gmail.com wrote: This one line causes python to core dump on linux. numpy.lexsort([ numpy.array(['-','-','-','-','-','-','-','-','-','-','-','-','-'])[::-1],numpy.array([732685., 732685., 732685., 732685., 732685., 732685.,732685., 732685., 732685., 732685., 732685., 732685., 732679.])[::-1]]) Here's some version info: python 2.5.4 numpy 1.3.0 error is *** glibc detected *** free(): invalid next size (fast): 0x00526be0 *** Any ideas? Huh. The line executes for me on OS X, but the interpreter crashes when exiting. Here is my backtrace: Thread 0 Crashed: 0 org.python.python 0x00270760 collect + 288 1 org.python.python 0x002712ea PyGC_Collect + 42 2 org.python.python 0x00260390 Py_Finalize + 208 3 org.python.python 0x0026f750 Py_Main + 2768 4 org.python.python 0x1f82 0x1000 + 3970 5 org.python.python 0x1ea9 0x1000 + 3753 Can you show us a gdb backtrace on your machine? It's the [::-1] what done it. I suspect a copy is being made and has a bug. In [1]: a = np.array(['-']*100) In [2]: b = np.array([1.0]*100) In [3]: i = lexsort((a,b)) In [4]: i = lexsort((a[::-1])) In [5]: i = lexsort((b[::-1])) In [6]: i = lexsort((a,b[::-1])) In [7]: i = lexsort((a[::-1],b)) *Crash* These also work: In [3]: i = lexsort((b[::-1],a)) In [4]: i = lexsort((b[::-1],b[::-1])) In [5]: i = lexsort((a[::-1],a[::-1])) In [6]: i = lexsort((a,b[::-1])) So it seems to be the combination of reversed string a with an array of different type. Looks like a type setting is getting skipped somewhere. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] help creating a reversed cumulative histogram
Hello fellow numy users, I posted some questions on histograms recently [1, 2] but still couldn't find a solution. I am trying to create a inverse cumulative histogram [3] which shall look like [4] but with the higher values at the left. The classification shall follow this exemplary rule: class 1: 0 all values 0 class 2: 10 all values 10 class 3: 15 all values 15 class 4: 20 all values 20 class 5: 25 all values 25 [...] I could get this easily in a spreadsheet by creating a matix with conditional statements (if VALUES_COL CLASS_BOUNDARY; VALUES_COL; '-'). With python (numpy or pylab) I was not successful. The plotted histogram envelope turned out to be just the inverted curve as the one created with the spreadsheet app. I have briely visualised the issue here [5]. I hope that this makes it more understandable. Later I would like to sum and count all values in each bin as discussed in [2]. May someone give me pointer or hint on how to improve my code below to achive the desired histogram? Thanks a lot in advance, Timmie [1]: http://www.nabble.com/np.hist-with-masked-values-to25243905.html [2]: http://www.nabble.com/histogram%3A-sum-up-values-in-each-bin-to25171265.html [3]: http://en.wikipedia.org/wiki/Histogram#Cumulative_histogram [4]: http://addictedtor.free.fr/graphiques/RGraphGallery.php?graph=126 [5]: http://www.scribd.com/doc/19371606/Distribution-Histogram # CODE # normed = False values # loaded data as array bins = 10 ### sum ## taken from ## http://www.nabble.com/Scipy-and-statistics%3A-probability-density-function-to24683007.html#a24683304 sums = np.histogram(values, weights=values, normed=normed, bins=bins) ecdf_sums = np.hstack([0.0, sums[0].cumsum() ]) ecdf_inv_sums = ecdf_sums[::-1] pylab.plot(sums[1], ecdf_inv_sums) pylab.show() ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy core dump on linux
I experience the same problem. A few more additional test cases: In [1]: import numpy In [2]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5)]) Out[2]: array([0, 1, 2, 3, 4]) In [3]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5.)]) Out[3]: array([0, 1, 2, 3, 4]) In [4]: numpy.lexsort([numpy.arange(5), numpy.arange(5)]) Out[4]: array([0, 1, 2, 3, 4]) In [5]: numpy.lexsort([numpy.arange(5), numpy.arange(5.)]) Out[5]: array([0, 1, 2, 3, 4]) In [6]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5)]) Out[6]: array([0, 1, 2, 3, 4]) In [7]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5.)]) *** glibc detected *** /usr/bin/python: free(): invalid next size (fast): 0x09be6eb8 *** It looks like the problem is when the first array is reversed and the second is float. I am not familiar with gdb. If I run gdb python, run it, and give the commands above, it hangs at the glibc line without returning to gdb unless I hit CTRL-C. In this case, I guess, the backtrace I get is related to the CTRL-C rather than the error. Any hint in how to obtain useful information from gdb? Best, Luca ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
On Wed, Sep 2, 2009 at 1:25 PM, Chad Netzer chad.net...@gmail.com wrote: On Mon, Aug 31, 2009 at 9:06 PM, Sturla Moldenstu...@molden.no wrote: We recently has a discussion regarding an optimization of NumPy's median to average O(n) complexity. After some searching, I found out there is a selection algorithm competitive in speed with Hoare's quick select. It has the advantage of being a lot simpler to implement. In plain Python: Chad, you can continue to write quick select using NumPy's C quick sort in numpy/core/src/_sortmodule.c.src. When you are done, it might be about 10% faster than this. :-) I was sick for a bit last week, so got stalled on my version, but I'll be working on it this weekend. I'm going for a more general partition function, that could have slightly more general use cases than just a median. Nevertheless, its good to see there could be several options, hopefully at least one of which can be put into numpy. By the way, as far as I can tell, the above algorithm is exactly the same idea as a non-recursive Hoare (ie. quicksort) selection: Do the partition, then only proceed to the sub-partition that must contain the nth element.My version is a bit more general, allowing partitioning on a range of elements rather than just one, but the concept is the same. The numpy quicksort already does non recursive sorting. I'd also like to, if possible, have a specialized 2D version, since image media filtering is one of my interests, and the C version works on 1D (raveled) arrays only. There are special hardwired medians for 2,3,5,9 elements, which covers a lot of image processing. They aren't in numpy, though ;) David has implemented a NeighborhoodIter that could help extract the elements if you want to deal with images. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy core dump on linux
On Wed, Sep 2, 2009 at 5:19 PM, Citi, Luca lc...@essex.ac.uk wrote: I experience the same problem. A few more additional test cases: In [1]: import numpy In [2]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5)]) Out[2]: array([0, 1, 2, 3, 4]) In [3]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5.)]) Out[3]: array([0, 1, 2, 3, 4]) In [4]: numpy.lexsort([numpy.arange(5), numpy.arange(5)]) Out[4]: array([0, 1, 2, 3, 4]) In [5]: numpy.lexsort([numpy.arange(5), numpy.arange(5.)]) Out[5]: array([0, 1, 2, 3, 4]) In [6]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5)]) Out[6]: array([0, 1, 2, 3, 4]) In [7]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5.)]) *** glibc detected *** /usr/bin/python: free(): invalid next size (fast): 0x09be6eb8 *** It looks like the problem is when the first array is reversed and the second is float. I am not familiar with gdb. If I run gdb python, run it, and give the commands above, it hangs at the glibc line without returning to gdb unless I hit CTRL-C. In this case, I guess, the backtrace I get is related to the CTRL-C rather than the error. Any hint in how to obtain useful information from gdb? The actual bug is probably not where the crash occurs. I think there is enough info to track it down for anyone who wants to crawl through the relevant code. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] help creating a reversed cumulative histogram
On Wed, Sep 2, 2009 at 18:15, Tim Michelsentimmichel...@gmx-topmail.de wrote: Hello fellow numy users, I posted some questions on histograms recently [1, 2] but still couldn't find a solution. I am trying to create a inverse cumulative histogram [3] which shall look like [4] but with the higher values at the left. Okay. That is completely different from what you've asked before. The classification shall follow this exemplary rule: class 1: 0 all values 0 class 2: 10 all values 10 class 3: 15 all values 15 class 4: 20 all values 20 class 5: 25 all values 25 [...] I could get this easily in a spreadsheet by creating a matix with conditional statements (if VALUES_COL CLASS_BOUNDARY; VALUES_COL; '-'). With python (numpy or pylab) I was not successful. The plotted histogram envelope turned out to be just the inverted curve as the one created with the spreadsheet app. sums = np.histogram(values, weights=values, normed=normed, bins=bins) ecdf_sums = np.hstack([0.0, sums[0].cumsum() ]) ecdf_inv_sums = ecdf_sums[::-1] This is not the kind of inversion that you are looking for. You want ecdf_inv_sums = ecdf_sums[-1] - ecdf_sums -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy core dump on linux
On Wed, Sep 2, 2009 at 5:19 PM, Citi, Luca lc...@essex.ac.uk wrote: I experience the same problem. A few more additional test cases: In [1]: import numpy In [2]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5)]) Out[2]: array([0, 1, 2, 3, 4]) In [3]: numpy.lexsort([numpy.arange(5)[::-1].copy(), numpy.arange(5.)]) Out[3]: array([0, 1, 2, 3, 4]) In [4]: numpy.lexsort([numpy.arange(5), numpy.arange(5)]) Out[4]: array([0, 1, 2, 3, 4]) In [5]: numpy.lexsort([numpy.arange(5), numpy.arange(5.)]) Out[5]: array([0, 1, 2, 3, 4]) In [6]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5)]) Out[6]: array([0, 1, 2, 3, 4]) In [7]: numpy.lexsort([numpy.arange(5)[::-1], numpy.arange(5.)]) *** glibc detected *** /usr/bin/python: free(): invalid next size (fast): 0x09be6eb8 *** It looks like the problem is when the first array is reversed and the second is float. It's mixing types with different bit sizes, small type first. In [6]: a = np.array([1.0]*100, dtype=int16) In [7]: b = np.array([1.0]*100, dtype=int32) In [8]: lexsort((a[::-1],b)) *Crash* Probably the results are incorrect for the reverse order of types that doesn't crash, but different arrays would be needed to check that. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] help creating a reversed cumulative histogram
On Wed, Sep 2, 2009 at 7:26 PM, Robert Kernrobert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 18:15, Tim Michelsentimmichel...@gmx-topmail.de wrote: Hello fellow numy users, I posted some questions on histograms recently [1, 2] but still couldn't find a solution. I am trying to create a inverse cumulative histogram [3] which shall look like [4] but with the higher values at the left. Okay. That is completely different from what you've asked before. The classification shall follow this exemplary rule: class 1: 0 all values 0 class 2: 10 all values 10 class 3: 15 all values 15 class 4: 20 all values 20 class 5: 25 all values 25 [...] I could get this easily in a spreadsheet by creating a matix with conditional statements (if VALUES_COL CLASS_BOUNDARY; VALUES_COL; '-'). With python (numpy or pylab) I was not successful. The plotted histogram envelope turned out to be just the inverted curve as the one created with the spreadsheet app. sums = np.histogram(values, weights=values, normed=normed, bins=bins) ecdf_sums = np.hstack([0.0, sums[0].cumsum() ]) ecdf_inv_sums = ecdf_sums[::-1] This is not the kind of inversion that you are looking for. You want ecdf_inv_sums = ecdf_sums[-1] - ecdf_sums and you can plot the histogram with bar eisf_sums = ecdf_sums[-1] - ecdf_sums # empirical inverse survival function of weights width = sums[1][1] - sums[1][0] rects1 = plt.bar(sums[1], eisf_sums, width, color='b') Are you sure you want cumulative weights in the histogram? Josef -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] help creating a reversed cumulative histogram
Hello Robert and Josef, thanks for the quick answers! I really appreciate this. I am trying to create a inverse cumulative histogram [3] which shall look like [4] but with the higher values at the left. Okay. That is completely different from what you've asked before. You are right. But it's soemtimes hard to decribe a desired and expected output in python terms and pseudocode. I still have to lern more numpy vocabs... I will evalute your answers and give feedback. Regards, Timmie ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] masked arrays of structured arrays
31/08/09 @ 14:37 (-0400), thus spake Pierre GM: On Aug 31, 2009, at 2:33 PM, Ernest Adrogué wrote: 30/08/09 @ 13:19 (-0400), thus spake Pierre GM: I can't reproduce that with a recent SVN version (r7348). What version of numpy are you using ? Version 1.2.1 That must be that. Can you try w/ 1.3 ? Yes, in version 1.3.0 it's fixed. -- Ernest ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] help creating a reversed cumulative histogram
On Wed, Sep 2, 2009 at 19:11, Tim Michelsentimmichel...@gmx-topmail.de wrote: Hello Robert and Josef, thanks for the quick answers! I really appreciate this. I am trying to create a inverse cumulative histogram [3] which shall look like [4] but with the higher values at the left. Okay. That is completely different from what you've asked before. You are right. But it's soemtimes hard to decribe a desired and expected output in python terms and pseudocode. I still have to lern more numpy vocabs... Actually, I apologize. I meant to delete that line before sending the message. It was unnecessary and abusive. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy core dump on linux
On Wed, Sep 2, 2009 at 4:23 PM, Jeremy Mayes jeremy.ma...@gmail.com wrote: This one line causes python to core dump on linux. numpy.lexsort([ numpy.array(['-','-','-','-','-','-','-','-','-','-','-','-','-'])[::-1],numpy.array([732685., 732685., 732685., 732685., 732685., 732685.,732685., 732685., 732685., 732685., 732685., 732685., 732679.])[::-1]]) Here's some version info: python 2.5.4 numpy 1.3.0 error is *** glibc detected *** free(): invalid next size (fast): 0x00526be0 *** I've opened ticket #1217 for this. Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 1:58 PM, Robert Kern robert.k...@gmail.com wrote: On Wed, Sep 2, 2009 at 13:28, Gökhan Severgokhanse...@gmail.com wrote: Put the reference manual in: http://drop.io/1plh5rt First few pages describe the data format they use. Ah. The fields are *not* delimited by a fixed value. Regexes are no help to you for pulling out the information you need, except perhaps later to parse the text fields. I think you are also getting spurious results because your regex matches things inside data fields. Instead, you have a header containing the length of the data field followed by the data field. Create a structured dtype that corresponds to the DataDir struct on page 15. Note that unsigned int there is actually a numpy.uint16, not a uint32. dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16), ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample', np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2', np.uint8), ('param3', np.uint8), ('address', np.uint16)]) Now read dt.itemsize bytes from the file and use header = fromstring(f.read(dt.itemsize), dt)[0] to get a record object that corresponds to the header. Use the dataOffset and numberBytes fields to extract the actual data bytes from the file. For example, if we go to the second header field: In [28]: f.seek(dt.itemsize,0) In [29]: header = np.fromstring(f.read(dt.itemsize), dt)[0] In [30]: header Out[30]: (65530, 100, 8, 1, 8, 255, 0, 0, 0, 43605) In [31]: f.seek(header['dataOffset'], 0) In [32]: f.read(header['numberBytes']) Out[32]: 'prj.300\x00' There are still some semantic issues you need to work out, still. There are multiple buffers per file, and the dataOffsets are relative to the start of the buffer, not the file. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Robert, You must have thrown a couple RTFM's while replying my emails :) I usually take trial-error approaches initially, and don't give up unless I hit a hurdle so fast, which in this case resulted with the unsuccessful regex approach. However from the good point I have learnt the basics of regular expressions and realized how powerful could they be during a text parsing task. Enough prattle, below is what I am working on: So far I was successfully able to extract the file names and the data associated with those names (with the exception of multiple buffer per file cases). However not reading time increments correctly, I should be seeing 1 sec incremental time ticks from the time segment reading, but all it does is to return the same first time information. Furthermore, I still couldn't figure out how to wrap the main looping suite (range(500) is just a dummy number which will let me process whole binary data) I don't know yet how to make the range input generic which will work any size of similar binary file. import numpy as np import struct f = open('test.sea', 'rb') dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16), ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample', np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2', np.uint8), ('param3', np.uint8), ('address', np.uint16)]) start = 0 ct = 0 for i in range(500): header = np.fromstring(f.read(dt.itemsize), dt)[0] if header['tagNumber'] == 65530: loc = f.tell() f.seek(start + header['dataOffset']) f.read(header['numberBytes']) f.seek(loc) elif header['tagNumber'] == 65531: loc = f.tell() f.seek(start + header['dataOffset']) f.read(header['numberBytes']) start = f.tell() elif header['tagNumber'] == 0: loc = f.tell() f.seek(start + header['dataOffset']) print f.tell() k = f.read(header['numberBytes'] print struct.unpack('9h', k[:18]) f.seek(loc) ct += 1 -- Gökhan ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Chad Netzer skrev: By the way, as far as I can tell, the above algorithm is exactly the same idea as a non-recursive Hoare (ie. quicksort) selection: Do the partition, then only proceed to the sub-partition that must contain the nth element.My version is a bit more general, allowing partitioning on a range of elements rather than just one, but the concept is the same. The numpy quicksort already does non recursive sorting. I'd also like to, if possible, have a specialized 2D version, since image media filtering is one of my interests, and the C version works on 1D (raveled) arrays only. I agree. NumPy (or SciPy) could have a select module similar to the sort module. If the select function takes an axis argument similar to the sort functions, only a small change to the current np.median would needed. Take a look at this: http://projects.scipy.org/numpy/attachment/ticket/1213/_selectmodule.pyx Here is a select function that takes an axis argument. There are specialized versions for 1D, 2D, and 3D. Input can be contiguous or not. For 4D and above, axes are found by recursion on the shape array. Thus it should be fast regardless of dimensions. I haven't tested the Cython code /thoroughly/, but at least it does compile. Sturla Molden ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fastest way to parsing a specific binay file
On Wed, Sep 2, 2009 at 23:59, Gökhan Severgokhanse...@gmail.com wrote: Robert, You must have thrown a couple RTFM's while replying my emails :) Not really. There's no manual for this. Greg Wilson's _Data Crunching_ may be a good general introduction to how to think about these problems. http://www.pragprog.com/titles/gwd/data-crunching I usually take trial-error approaches initially, and don't give up unless I hit a hurdle so fast, which in this case resulted with the unsuccessful regex approach. However from the good point I have learnt the basics of regular expressions and realized how powerful could they be during a text parsing task. Enough prattle, below is what I am working on: So far I was successfully able to extract the file names and the data associated with those names (with the exception of multiple buffer per file cases). However not reading time increments correctly, I should be seeing 1 sec incremental time ticks from the time segment reading, but all it does is to return the same first time information. Furthermore, I still couldn't figure out how to wrap the main looping suite (range(500) is just a dummy number which will let me process whole binary data) I don't know yet how to make the range input generic which will work any size of similar binary file. while True: ... if no_more_data(): break import numpy as np import struct f = open('test.sea', 'rb') dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16), ('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample', np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2', np.uint8), ('param3', np.uint8), ('address', np.uint16)]) start = 0 ct = 0 for i in range(500): header = np.fromstring(f.read(dt.itemsize), dt)[0] if header['tagNumber'] == 65530: loc = f.tell() f.seek(start + header['dataOffset']) f.read(header['numberBytes']) Presumably you are doing something with this data, not just discarding it. f.seek(loc) This should be f.seek(loc, 0). f.seek(nbytes) is to seek forward from the current position by nbytes. The 0 tells it to start from the beginning. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
On Thu, Sep 3, 2009 at 00:09, Sturla Moldenstu...@molden.no wrote: Chad Netzer skrev: I'd also like to, if possible, have a specialized 2D version, since image media filtering is one of my interests, and the C version works on 1D (raveled) arrays only. I agree. NumPy (or SciPy) could have a select module similar to the sort module. If the select function takes an axis argument similar to the sort functions, only a small change to the current np.median would needed. Take a look at this: http://projects.scipy.org/numpy/attachment/ticket/1213/_selectmodule.pyx Here is a select function that takes an axis argument. There are specialized versions for 1D, 2D, and 3D. Input can be contiguous or not. For 4D and above, axes are found by recursion on the shape array. Thus it should be fast regardless of dimensions. When he is talking about 2D, I believe he is referring to median filtering rather than computing the median along an axis. I.e., replacing each pixel with the median of a specified neighborhood around the pixel. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
On Wed, Sep 2, 2009 at 10:28 PM, Robert Kernrobert.k...@gmail.com wrote: When he is talking about 2D, I believe he is referring to median filtering rather than computing the median along an axis. I.e., replacing each pixel with the median of a specified neighborhood around the pixel. That's right, Robert. Basically, I meant doing a median on a square (or rectangle) view of an array, without first having to ravel(), thus generally saving a copy. But actually, since my selection based median overwrites the source array, it may not save a copy anyway. But Charles Harris's earlier suggestion of some hard coded medians for common filter template sizes (ie 3x3, 5x5, etc.) may be a nice addition to scipy, especially if it can be generalized somewhat to other filters. -C ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Robert Kern skrev: When he is talking about 2D, I believe he is referring to median filtering rather than computing the median along an axis. I.e., replacing each pixel with the median of a specified neighborhood around the pixel. That's not something numpy's median function should be specialized to do. IMHO, median filtering belongs to scipy. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A faster median (Wirth's method)
Chad Netzer wrote: But Charles Harris's earlier suggestion of some hard coded medians for common filter template sizes (ie 3x3, 5x5, etc.) may be a nice addition to scipy, especially if it can be generalized somewhat to other filters. For 2D images try looking into PIL : ImageFilter.MedianFilter Cheers, Jon ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion