RE: Run jobs in parallel in standalone mode

Onur EKİNCİ Tue, 16 Jan 2018 04:20:08 -0800

Correction. We found out that Spark extracts data from mssql database column by 
column.  Spark divides data by column. Then it executes 10 jobs to pull data 
from mssql database.


Is there a way that we can run those jobs in parallel or increse/decrease the 
number of jobs?  According to what criteria does Spark run jobs ,especially 10 
jobs?



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Onur EKİNCİ
Sent: Tuesday, January 16, 2018 2:16 PM
To: 'Eyal Zituny' <eyal.zit...@equalum.io>
Cc: user@spark.apache.org
Subject: RE: Run jobs in parallel in standalone mode

Hi Eyal,
Thank you for your help. The following commands worked in terms of running 
multiple executors simultaneously. However, Spark repeats the 10 same jobs 
consecutively.  It had been doing it before as well. The jobs are extracting 
data from Mssql. Why would it run the same job 10 times?

.option("numPartitions", 4)
.option("partitionColumn", "MUHASEBESUBE_KD")
.option("lowerBound", 0)
.option("upperBound", 1000)

[cid:image003.jpg@01D38EDD.64427570]

From: Eyal Zituny [mailto:eyal.zit...@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ <oeki...@innova.com.tr<mailto:oeki...@innova.com.tr>>
Cc: Richard Qiao <richardqiao2...@gmail.com<mailto:richardqiao2...@gmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has 
a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done 
by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc 
configuration<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
 for more info

you can also do df.repartition(10) but that might be less efficient since the 
reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ 
<oeki...@innova.com.tr<mailto:oeki...@innova.com.tr>> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 
7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image004.png@01D38EDD.64427570]<http://www.innova.com.tr/>



From: Richard Qiao 
[mailto:richardqiao2...@gmail.com<mailto:richardqiao2...@gmail.com>]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ <oeki...@innova.com.tr<mailto:oeki...@innova.com.tr>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ 
<oeki...@innova.com.tr<mailto:oeki...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database 
through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. 
However, we have tried everything but we couldnt manage to run jobs in 
parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", 
"jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable",
 "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
"Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191";, 
"jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER", 
"muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, 
true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, 
jobs work serially not in parallel. Also executors work serially and take 
turns. They donw work in parallel.

How can we make jobs work in parallel?


<image002.jpg>

<image006.jpg>

<image008.jpg>
<image012.jpg>

<image013.jpg>






<image014.jpg>


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 
7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps<http://www.innova.com.tr/istanbul.asp>


<imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png><http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

RE: Run jobs in parallel in standalone mode

Reply via email to