Hello,

The variable argsList is an array defined above the parallel block. This
variawis accessed inside the map function. Launcher.main is not threadsafe.
Is is not possible to specify to spark that every folder needs to be
processed as a separate process in a separate working directory?

Regards
Bala
On 14-Jul-2016 2:37 pm, "Sun Rui" <sunrise_...@163.com> wrote:

> Where is argsList defined? is Launcher.main() thread-safe? Note that if
> multiple folders are processed in a node, multiple threads may concurrently
> run in the executor, each processing a folder.
>
> On Jul 14, 2016, at 12:28, Balachandar R.A. <balachandar...@gmail.com>
> wrote:
>
> Hello Ted,
>>
>
>
> Thanks for the response. Here is the additional information.
>
>
>> I am using spark 1.6.1  (spark-1.6.1-bin-hadoop2.6)
>>
>> Here is the code snippet
>>
>>
>> JavaRDD<File> add = jsc.parallelize(listFolders, listFolders.size());
>>             JavaRDD<Integer> test = add.map(new Function<File, Integer>()
>> {
>>                 @Override
>>                 public Integer call(File file) throws Exception {
>>                     String folder = file.getName();
>>                     System.out.println("[x] Processing dataset from the
>> directory " + folder);
>>                     int status = 0;
>>                    argsList[3] = argsList[3] + "/"+ folder;   // full
>> path of the input folder. Input folder is in shared file system that every
>> worker node has access to it. Something like (“/home/user/software/data/”)
>> and folder name will be like (“20161307”)
>>                     argsList[7] = argsList[7] + "/" + folder + ".csv"; //
>> full path of the output.
>>                     try{
>>                         Launcher.main(argsList);  // Launcher class is a
>> black box. It process the input folder and create a csv file which in the
>> output location (argsList[7]). This is also in a shared file system
>>                         status = 0;
>>                     }
>>                     catch(Exception e)
>>                     {
>>                         System.out.println("[x] Execution of import tool
>> for the directory " + folder + "failed");
>>                         status = 0;
>>                     }
>>                     accum.add(1);
>>                     return status;
>>                 }
>>             });
>>
>>
>> Here is the spark-env.sh
>>
>> export SPARK_WORKER_INSTANCES=1
>> export JAVA_HOME=/home/work_IW1/opt/jdk1.8.0_77/
>> export HADOOP_CONF_DIR=/home/work_IW1/opt/hadoop-2.7.2/etc/hadoop
>>
>> Here is the spark-defaults.conf
>>
>>
>>   spark.master                     spark:// master:7077
>>   spark.eventLog.enabled           true
>>   spark.eventLog.dir               hdfs://master:9000/sparkEvent
>>   spark.serializer
>> org.apache.spark.serializer.KryoSerializer
>>   spark.driver.memory              4g
>>
>>
>
>
> Hope it helps.
>
>
>

Reply via email to