Gnata Xavier wrote:
> Ok I will try to see what I can do but it is sure that we do need the
> plug-in system first (read "before the threads in the numpy release").
> During the devel of 1.1, I will try to find some time to understand
> where I should put some pragma into ufunct using a very con
Anne Archibald wrote:
>
> Actually, there are a few places where a parallel for would serve to
> accelerate all ufuncs. There are build issues, yes, though they are
> mild;
Maybe, maybe not. Anyway, I said that I would step in to resolve those
issues if someone else does the coding.
> we would
Damian Eads wrote:
> Anne Archibald wrote:
>> On 23/03/2008, Damian Eads <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I am working on a memory-intensive experiment with very large arrays so
>>> I must be careful when allocating memory. Numpy already supports a
>>> number of in-place operations (+=
Well that's fine for binops with the same types, but it's not so
obvious which type to cast to when mixing signed and unsigned types.
Should the type of N.int32(10)+N.uint32(10) be int32, uint32 or int64?
Given your answer what should the type of N.int64(10)+N.uint64(10) be
(which is the case in th
Anne Archibald wrote:
> On 23/03/2008, Damian Eads <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I am working on a memory-intensive experiment with very large arrays so
>> I must be careful when allocating memory. Numpy already supports a
>> number of in-place operations (+=, *=) making the task much
(please copy to the trace page)
On Sat, Mar 22, 2008 at 6:08 PM, James Philbin <[EMAIL PROTECTED]> wrote:
> I'm not sure that #669
> (http://projects.scipy.org/scipy/numpy/ticket/669) is a bug, but
> probably needs some discussion (see the last reply on that page). The
> cast is made because we do
David Cournapeau wrote:
> Gnata Xavier wrote:
>
>> Well of course my goal was not to say that my simple testcase can be
>> copied/pasted into numpy :)
>> Of ourse it is one of the best case to use openmp.
>> Of course pragma can be more complex than that (you can tell variables
>> that can/can
>
> (And I suspect that OpenMP is
> smart enough to use single threads without locking when multiple
> threads won't help. Certainly all the information is available to
> OpenMP to make such decisions.)
>
Unfortunately, I don't think there is such a think. For instance the number
of threads used b
On 23/03/2008, David Cournapeau <[EMAIL PROTECTED]> wrote:
> Gnata Xavier wrote:
> >
> > Hi,
> >
> > I have a very limited knowledge of openmp but please consider this
> > testcase :
> >
> >
>
>
> Honestly, if it was that simple, it would already have been done for a
> long time. The proble
OK, i'm really impressed with the improvements in vectorization for
gcc 4.3. It really seems like it's able to work with real loops which
wasn't the case with 4.1. I think Chuck's right that we should simply
special case contiguous data and allow the auto-vectorizer to do the
rest. Something like t
On Sun, Mar 23, 2008 at 6:41 AM, Francesc Altet <[EMAIL PROTECTED]> wrote:
> A Sunday 23 March 2008, Charles R Harris escrigué:
> > gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> > cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
> >
> > Problem size Simple
Scott Ransom wrote:
> Hi David et al,
>
> Very interesting. I thought that the 64-bit gcc's automatically
> aligned memory on 16-bit (or 32-bit) boundaries.
Note that I am talking about bytes, not bits. Default alignement depend
on many parameters, like the OS, C runtime. For example, on mac os
Hi David et al,
Very interesting. I thought that the 64-bit gcc's automatically
aligned memory on 16-bit (or 32-bit) boundaries. But apparently
not. Because running your code certainly made the intrinsic code
quite a bit faster. However, another thing that I noticed was
that the "simple" code
A Sunday 23 March 2008, Francesc Altet escrigué:
> A Sunday 23 March 2008, Anne Archibald escrigué:
> > On 23/03/2008, Damian Eads <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > I am working on a memory-intensive experiment with very large
> > > arrays so I must be careful when allocating memory
>
> If the performances are so bad, ok, forget about itbut it would be
> sad because the next generation CPU will not be more powerfull, they
> will "only" have more that one or two cores on the same chip.
>
I don't think this is the worst that will happen. The worst is what has been
seen for
Gnata Xavier wrote:
> Well of course my goal was not to say that my simple testcase can be
> copied/pasted into numpy :)
> Of ourse it is one of the best case to use openmp.
> Of course pragma can be more complex than that (you can tell variables
> that can/cannot be shared for instance).
>
> The
David Cournapeau wrote:
> Francesc Altet wrote:
>
>> Why not? IMHO, complex operations requiring a great deal of operations
>> per word, like trigonometric, exponential, etc..., are the best
>> candidates to take advantage of several cores or even SSE instructions
>> (not sure whether SSE su
>
> I find the example of sse rather enlightening: in theory, you should
> expect a 100-300 % speed increase using sse, but even with pure C code
> in a controlled manner, on one platform (linux + gcc), with varying,
> recent CPU, the results are fundamentally different. So what would
> happen in n
A Sunday 23 March 2008, Anne Archibald escrigué:
> On 23/03/2008, Damian Eads <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I am working on a memory-intensive experiment with very large
> > arrays so I must be careful when allocating memory. Numpy already
> > supports a number of in-place operations
Francesc Altet wrote:
>
> Why not? IMHO, complex operations requiring a great deal of operations
> per word, like trigonometric, exponential, etc..., are the best
> candidates to take advantage of several cores or even SSE instructions
> (not sure whether SSE supports this sort of operations, t
Hi,
Here are my results for an AMD Opteron machine:
gcc version 4.1.3 (SUSE Linux) | Dual Core AMD Opteron 270 @ 2 GHz
$ gcc -msse -O2 vec_bench.c -o vec_bench
$ ./vec_bench
Testing methods...
All OK
Problem size Simple Intrin
Inline
A Sunday 23 March 2008, David Cournapeau escrigué:
> Gnata Xavier wrote:
> > Hi,
> >
> > I have a very limited knowledge of openmp but please consider this
> > testcase :
>
> Honestly, if it was that simple, it would already have been done for
> a long time. The problem is that your test-case is no
A Sunday 23 March 2008, Charles R Harris escrigué:
> gcc --version: gcc (GCC) 4.1.2 20070925 (Red Hat 4.1.2-33)
> cpu: Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz
>
> Problem size Simple Intrin
> Inline
> 100 0.0002ms (100.0%) 0.0001ms ( 6
Gnata Xavier wrote:
>
> Hi,
>
> I have a very limited knowledge of openmp but please consider this
> testcase :
>
>
Honestly, if it was that simple, it would already have been done for a
long time. The problem is that your test-case is not even remotely close
to how things have to be done in
On Thu, Mar 20, 2008 at 6:41 PM, P GM <[EMAIL PROTECTED]> wrote:
> That particular test in test_old_ma will never work: the .data of a
> masked array is implemented as a property, so its id will change from
> one test to another.
I removed the broken test in r4934.
Nils: the segfault you report
Travis E. Oliphant wrote:
> Anne Archibald wrote:
>
>> On 22/03/2008, Travis E. Oliphant <[EMAIL PROTECTED]> wrote:
>>
>>
>>> James Philbin wrote:
>>> > Personally, I think that the time would be better spent optimizing
>>> > routines for single-threaded code and relying on BLAS and LA
Hi David
On Sun, Mar 23, 2008 at 7:14 AM, David Cournapeau
<[EMAIL PROTECTED]> wrote:
>- 571: This one is fixed, no ?
Works fine on my machine, so I closed the ticket.
>- 654: is there a standardized way to handle tests to skip ? I
> posted a patch which should fix the issue, bu
Wow, a much more varied set of results than I was expecting. Could
someone who has gcc 4.3 installed compile it with:
gcc -msse -O2 -ftree-vectorize -ftree-vectorizer-verbose=5 -S
vec_bench.c -o vec_bench.s
And attach vec_bench.s and the verbose output from gcc.
James
James Philbin wrote:
> OK, i've written a simple benchmark which implements an elementwise
> multiply (A=B*C) in three different ways (standard C, intrinsics, hand
> coded assembly). On the face of things the results seem to indicate
> that the vectorization works best on medium sized inputs. If pe
29 matches
Mail list logo