[ https://issues.apache.org/jira/browse/SYSTEMML-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15691852#comment-15691852 ]
Matthias Boehm commented on SYSTEMML-1127: ------------------------------------------ Thanks for taking care of this issue [~fschueler]. Yes, this is indeed a bit tricky. First of all, for REMOTE_MR, each task has its own buffer pool, while for REMOTE_SPARK all cores per executor share the same (thread-safe) buffer pool because we use a buffer pool per process. This is good and bad. The good thing is that global reads (e.g., a dataset read by all workers, which is often the case for hyper-parameter tuning) is only read once. The bad thing is that we need to synchronize some setup procedures. Furthermore, the individually names of evicted matrices/frames already use a process-wide unique ID. So the only problem is the creation of the cache directory. I would recommend to synchronize the entire buffer pool setup which includes (1) the creation of the cache directory, and (2) init caching. The subsequent append of task attempt IDs can be removed as this would anyway modify a global static variable. > Distributed unique IDs are not unique > ------------------------------------- > > Key: SYSTEMML-1127 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1127 > Project: SystemML > Issue Type: Bug > Components: ParFor > Reporter: Felix Schüler > > When executing a Spark parfor, the SparkParforWorker throws an exception > which states that the localtmpdir could not be created. This is due to the > fact that multiple executors are running multithreaded on the same worker. > The createDistributedUniqueID() method in the IDHander.java creates unique > IDs only per pid and host, not per thread. This could potentially be solved > by adding the threadID to the unique ID. The question is if every thread > should have its own cache or if the logic should be changed so that the first > creation will be successful and then the threads share one cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)