Re: Create many objects using threads

2014-05-06 Thread hardcoremore via Digitalmars-d-learn

On Tuesday, 6 May 2014 at 03:26:52 UTC, Ali Çehreli wrote:

On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple
instances of class is
 actually slower.

Not at all! That statement can be true only in certain 
programs. :)


Ali



But what does exactly means that Garbage Collector blocks? What
does it blocks and in which way?


And can I use threads to create multiple instance faster or that 
is just not possible?




Thanks



Re: Create many objects using threads

2014-05-06 Thread Kapps via Digitalmars-d-learn

On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:

On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating
instances would not
 provide a significant speedup, possibly even a slowdown.

Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

I did:

import std.range;
import std.parallelism;

class C
{}

void foo()
{
auto c = new C;
}

void main(string[] args)
{
enum totalElements = 10_000_000;

if (args.length  1) {
foreach (i; iota(totalElements).parallel) {
foo();
}

} else {
foreach (i; iota(totalElements)) {
foo();
}
}
}

Typical run on my system for -O -noboundscheck -inline:

$ time ./deneme parallel

real0m4.236s
user0m4.325s
sys 0m9.795s

$ time ./deneme

real0m0.753s
user0m0.748s
sys 0m0.003s

Ali


Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is 
unfortunately still 2.055...) and got similar results. I also 
tried creating only totalCPUs threads and having each of them 
create NUM_ELEMENTS / totalCPUs objects rather than risking that 
each creation was a task, and it still seems to be the same.


Using malloc and emplace instead of new D, results are about 50% 
faster for single-threadeded and ~3-4 times faster for 
multi-threaded (4 cpu 8 thread machine, Linux 64-bit). The 
multi-threaded version is still twice as slow though. On my 
Windows laptop (with the program compiled for 32-bit), it did not 
make a significant difference and the multi-threaded version is 
still 4 times slower.


That being said, I think most malloc implementations while being 
thread-safe, usually use locks or do not scale well.


Code:
import std.range;
import std.parallelism;
import std.datetime;
import std.stdio;
import core.stdc.stdlib;
import std.conv;

class C {}

void foo() {
//auto c = new C;
enum size = __traits(classInstanceSize, C);
void[] mem = malloc(size)[0..size];
emplace!C(mem);
}

void createFoos(size_t count) {
foreach(i; 0 .. count) {
foo();
}
}

void main(string[] args) {
StopWatch sw = StopWatch(AutoStart.yes);
enum totalElements = 10_000_000;
if (args.length = 1) {
foreach (i; iota(totalElements)) {
foo();
}
} else if(args[1] == tasks) {
foreach (i; parallel(iota(totalElements))) {
foo();
}
} else if(args[1] == parallel) {
for(int i = 0; i  totalCPUs; i++) {
taskPool.put(task(createFoos, totalElements / 
totalCPUs));

}
taskPool.finish(true);
} else
writeln(Unknown argument ', args[1], '.);
sw.stop();
writeln(cast(Duration)sw.peek);
}

Results (Linux 64-bit):
shardsoft:~$ dmd -O -inline -release test.d
shardsoft:~$ ./test
552 ms, 729 μs, and 7 hnsecs
shardsoft:~$ ./test
532 ms, 139 μs, and 5 hnsecs
shardsoft:~$ ./test tasks
1 sec, 171 ms, 126 μs, and 4 hnsecs
shardsoft:~$ ./test tasks
1 sec, 38 ms, 468 μs, and 6 hnsecs
shardsoft:~$ ./test parallel
1 sec, 146 ms, 738 μs, and 2 hnsecs
shardsoft:~$ ./test parallel
1 sec, 268 ms, 195 μs, and 3 hnsecs





Re: Create many objects using threads

2014-05-06 Thread Kapps via Digitalmars-d-learn

On Tuesday, 6 May 2014 at 15:56:11 UTC, Kapps wrote:

On Monday, 5 May 2014 at 22:11:39 UTC, Ali Çehreli wrote:

On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating
instances would not
 provide a significant speedup, possibly even a slowdown.

Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

I did:

import std.range;
import std.parallelism;

class C
{}

void foo()
{
   auto c = new C;
}

void main(string[] args)
{
   enum totalElements = 10_000_000;

   if (args.length  1) {
   foreach (i; iota(totalElements).parallel) {
   foo();
   }

   } else {
   foreach (i; iota(totalElements)) {
   foo();
   }
   }
}

Typical run on my system for -O -noboundscheck -inline:

$ time ./deneme parallel

real0m4.236s
user0m4.325s
sys 0m9.795s

$ time ./deneme

real0m0.753s
user0m0.748s
sys 0m0.003s

Ali


Huh, that's a much, much, higher impact than I'd expected.
I tried with GDC as well (the one in Debian stable, which is 
unfortunately still 2.055...) and got similar results. I also 
tried creating only totalCPUs threads and having each of them 
create NUM_ELEMENTS / totalCPUs objects rather than risking 
that each creation was a task, and it still seems to be the 
same.


snip


I tried with using an allocator that never releases memory, 
rounds up to a power of 2, and is lock-free. The results are 
quite a bit better.


shardsoft:~$ ./test
1 sec, 47 ms, 474 μs, and 4 hnsecs
shardsoft:~$ ./test
1 sec, 43 ms, 588 μs, and 2 hnsecs
shardsoft:~$ ./test tasks
692 ms, 769 μs, and 8 hnsecs
shardsoft:~$ ./test tasks
692 ms, 686 μs, and 8 hnsecs
shardsoft:~$ ./test parallel
691 ms, 856 μs, and 9 hnsecs
shardsoft:~$ ./test parallel
690 ms, 22 μs, and 3 hnsecs

I get similar results on my laptop (which is much faster than the 
results I got on it using DMD's malloc):

test

1 sec, 125 ms, and 847 ╬╝s

test

1 sec, 125 ms, 741 ╬╝s, and 6 hnsecs


test tasks

556 ms, 613 ╬╝s, and 8 hnsecs

test tasks

552 ms and 287 ╬╝s


test parallel

554 ms, 542 ╬╝s, and 6 hnsecs

test parallel

551 ms, 514 ╬╝s, and 9 hnsecs


Code:
http://pastie.org/9146326

Unfortunately it doesn't compile with the ancient version of gdc 
available in Debian, so I couldn't test with that. The results 
should be quite a bit better since core.atomic would be faster. 
And frankly, I'm not sure if the allocator actually works 
properly, but it's just for testing purposes anyways.




Re: Create many objects using threads

2014-05-06 Thread Ali Çehreli via Digitalmars-d-learn

On 05/06/2014 05:46 AM, hardcoremore wrote:

 But what does exactly means that Garbage Collector blocks? What
 does it blocks and in which way?

I know this much: The current GC that comes in D runtime is a 
single-threaded GC (aka a stop-the-world GC), meaning that all threads 
are stopped when the GC is running a garbage collection cycle.


 And can I use threads to create multiple instance faster or that is just
 not possible?

My example program that did nothing but constructed objects on the GC 
heap cannot be an indicator of the performance of all multi-threaded 
programs. In real programs there will be computation-intensive parts; 
there will be parts blocked on I/O; etc. There is no way of knowing 
without measuring.


Ali



Re: Create many objects using threads

2014-05-05 Thread Ali Çehreli via Digitalmars-d-learn

On 05/05/2014 10:25 AM, Ali Çehreli wrote:

 On 05/05/2014 10:14 AM, Caslav Sabani wrote:

   I want to have one array where I will store like 10  objects.
  
   But I want to use 4 threads where each thread will create 25000 
objects

   and store them in array above mentioned.

 1) If it has to be a single array, meaning that all of the objects are
 in consecutive memory, you can create the array and give four slices of
 it to the four tasks.

 To do that, you can either create a proper D array filled with objects
 with .init values; or you can allocate any type of memory and create
 objects in place there.

Here is an example:

import std.stdio;
import std.parallelism;
import core.thread;
import std.conv;

enum elementCount = 8;
size_t elementPerThread;

static this ()
{
assert((elementCount % totalCPUs) == 0,
   Cannot distribute tasks to cores evenly);

elementPerThread = elementCount / totalCPUs;
}

void main()
{
auto arr = new int[](elementCount);

foreach (i; 0 .. totalCPUs) {
const beg = i * elementPerThread;
const end = beg + elementPerThread;
arr[beg .. end] = i.to!int;
}

thread_joinAll();// (I don't think this is necessary with 
std.parallelism)


writeln(arr);// [ 0, 0, 1, 1, 2, 2, 3, 3 ]
}

 2) If it doesn't have to a single array, you can have the four tasks
 create four separate arrays. You can then use them as a single range by
 std.range.chain.

That is a lie. :) chain would work but it had to know the number of 
total cores at compile time. Instead, joiner or join can be used:


import std.stdio;
import std.parallelism;
import core.thread;

enum elementCount = 8;
size_t elementPerThread;

static this ()
{
assert((elementCount % totalCPUs) == 0,
   Cannot distribute tasks to cores evenly);

elementPerThread = elementCount / totalCPUs;
}

void main()
{
auto arr = new int[][](totalCPUs);

foreach (i; 0 .. totalCPUs) {
foreach (e; 0 .. elementPerThread) {
arr[i] ~= i;
}
}

thread_joinAll();// (I don't think this is necessary with 
std.parallelism)


writeln(arr); // [[0, 0], [1, 1], [2, 2], [3, 3]]

import std.range;
writeln(arr.joiner);  // [ 0, 0, 1, 1, 2, 2, 3, 3 ]

import std.algorithm;
auto arr2 = arr.joiner.array;
static assert(is (typeof(arr2) == int[]));
writeln(arr2);   // [ 0, 0, 1, 1, 2, 2, 3, 3 ]

auto arr3 = arr.join;
static assert(is (typeof(arr3) == int[]));
writeln(arr3);   // [ 0, 0, 1, 1, 2, 2, 3, 3 ]
}

 This option allows you to have a single array as well.

arr2 and arr3 above are examples of that.

Ali



Re: Create many objects using threads

2014-05-05 Thread Caslav Sabani via Digitalmars-d-learn

Hi Ali,


Thanks for your reply. But I am struggling to understand from 
your example where is the code that creates or spawns new thread.



How do you create new thread and fill array with instantiated 
objects in that thread?




Thanks


Re: Create many objects using threads

2014-05-05 Thread Ali Çehreli via Digitalmars-d-learn

On 05/05/2014 01:38 PM, Caslav Sabani wrote:

 I am struggling to understand from your example where is the code that
 creates or spawns new thread.

The .parallel in the foreach loop makes the body of the loop be executed 
in parallel.


 How do you create new thread and fill array with instantiated objects in
 that thread?

It is automatic in that example but you can created thread explicitly by 
std.concurrency or core.thread as well.


Ali



Re: Create many objects using threads

2014-05-05 Thread Kapps via Digitalmars-d-learn

On Monday, 5 May 2014 at 17:14:54 UTC, Caslav Sabani wrote:

Hi,


I have just started to learn D. Its a great language. I am 
trying to achieve the following but I am not sure is it 
possible or should be done at all:


I want to have one array where I will store like 10  
objects.


But I want to use 4 threads where each thread will create 25000 
objects and store them in array above mentioned. And all 4 
threads should be working in parallel because I have 4 core 
processor for example. I do not care in which order objects are 
created nor objects should be aware of one another. I just need 
them stored in array.


Can threading help in creating many objects at once?

Note that I am beginner at working with threads so any help is 
welcome :)



Thanks


I could be wrong here, but I think that the GC actually blocks 
when creating objects, and thus multiple threads creating 
instances would not provide a significant speedup, possibly even 
a slowdown.


You'd want to benchmark this to be certain it helps.



Re: Create many objects using threads

2014-05-05 Thread Ali Çehreli via Digitalmars-d-learn

On 05/05/2014 02:38 PM, Kapps wrote:

 I think that the GC actually blocks when
 creating objects, and thus multiple threads creating instances would not
 provide a significant speedup, possibly even a slowdown.

Wow! That is the case. :)

 You'd want to benchmark this to be certain it helps.

I did:

import std.range;
import std.parallelism;

class C
{}

void foo()
{
auto c = new C;
}

void main(string[] args)
{
enum totalElements = 10_000_000;

if (args.length  1) {
foreach (i; iota(totalElements).parallel) {
foo();
}

} else {
foreach (i; iota(totalElements)) {
foo();
}
}
}

Typical run on my system for -O -noboundscheck -inline:

$ time ./deneme parallel

real0m4.236s
user0m4.325s
sys 0m9.795s

$ time ./deneme

real0m0.753s
user0m0.748s
sys 0m0.003s

Ali




Re: Create many objects using threads

2014-05-05 Thread Caslav Sabani via Digitalmars-d-learn

Hi all,


Thanks for your reply. So basically using threads in D for 
creating multiple instances of class is actually slower.



But what does exactly means that Garbage Collector blocks? What 
does it blocks and in which way?




Thanks


Re: Create many objects using threads

2014-05-05 Thread Ali Çehreli via Digitalmars-d-learn

On 05/05/2014 04:32 PM, Caslav Sabani wrote:

 So basically using threads in D for creating multiple instances of 
class is

 actually slower.

Not at all! That statement can be true only in certain programs. :)

Ali