[ 
https://issues.apache.org/jira/browse/SYSTEMML-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691852#comment-15691852
 ] 

Matthias Boehm commented on SYSTEMML-1127:
------------------------------------------

Thanks for taking care of this issue [~fschueler]. Yes, this is indeed a bit 
tricky. First of all, for REMOTE_MR, each task has its own buffer pool, while 
for REMOTE_SPARK all cores per executor share the same (thread-safe) buffer 
pool because we use a buffer pool per process. This is good and bad. The good 
thing is that global reads (e.g., a dataset read by all workers, which is often 
the case for hyper-parameter tuning) is only read once. The bad thing is that 
we need to synchronize some setup procedures. Furthermore, the individually 
names of evicted matrices/frames already use a process-wide unique ID. So the 
only problem is the creation of the cache directory. I would recommend to 
synchronize the entire buffer pool setup which includes (1) the creation of the 
cache directory, and (2) init caching. The subsequent append of task attempt 
IDs can be removed as this would anyway modify a global static variable. 

> Distributed unique IDs are not unique
> -------------------------------------
>
>                 Key: SYSTEMML-1127
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1127
>             Project: SystemML
>          Issue Type: Bug
>          Components: ParFor
>            Reporter: Felix Schüler
>
> When executing a Spark parfor, the SparkParforWorker throws an exception 
> which states that the localtmpdir could not be created. This is due to the 
> fact that multiple executors are running multithreaded on the same worker. 
> The createDistributedUniqueID() method in the IDHander.java creates unique 
> IDs only per pid and host, not per thread. This could potentially be solved 
> by adding the threadID to the unique ID. The question is if every thread 
> should have its own cache or if the logic should be changed so that the first 
> creation will be successful and then the threads share one cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to