Re: Run jobs in parallel in standalone mode

Eyal Zituny Tue, 16 Jan 2018 04:31:05 -0800

Hi,
as far as i know, this is not a typical behavior for spark,
it might be relates to the implementation of the Kinetica spark connector
you can try to write the DF to a csv instead using
*df.write.csv("<path-to-csv-folder>")*
and see how the spark job behave


Eyal

On Tue, Jan 16, 2018 at 2:19 PM, Onur EKİNCİ <oeki...@innova.com.tr> wrote:

>
>
> Correction. We found out that Spark extracts data from mssql database
> column by column.  Spark divides data by column. Then it executes 10 jobs
> to pull data from mssql database.
>
>
>
> Is there a way that we can run those jobs in parallel or increse/decrease
> the number of jobs?  According to what criteria does Spark run jobs
> ,especially 10 jobs?
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr>
>
>
>
> *From:* Onur EKİNCİ
> *Sent:* Tuesday, January 16, 2018 2:16 PM
> *To:* 'Eyal Zituny' <eyal.zit...@equalum.io>
> *Cc:* user@spark.apache.org
> *Subject:* RE: Run jobs in parallel in standalone mode
>
>
>
> Hi Eyal,
>
> Thank you for your help. The following commands worked in terms of running
> multiple executors simultaneously. However, Spark repeats the 10 same
> jobs consecutively.  It had been doing it before as well. The jobs are
> extracting data from Mssql. Why would it run the same job 10 times?
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "MUHASEBESUBE_KD")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
>
>
> *From:* Eyal Zituny [mailto:eyal.zit...@equalum.io
> <eyal.zit...@equalum.io>]
> *Sent:* Tuesday, January 16, 2018 12:13 PM
> *To:* Onur EKİNCİ <oeki...@innova.com.tr>
> *Cc:* Richard Qiao <richardqiao2...@gmail.com>; user@spark.apache.org
>
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> hi,
>
>
>
> I'm not familiar with the Kinetica spark driver, but it seems that your
> job has a single task which might indicate that you have a single partition
> in the df
>
> i would suggest to try to create your df with more partitions, this can be
> done by adding the following options when reading the source:
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "id")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
> take a look at the spark jdbc configuration
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
>  for
> more info
>
>
>
> you can also do df.repartition(10) but that might be less efficient since
> the reading from the source will not be in parallel
>
>
>
> hope it will help
>
>
>
> Eyal
>
>
>
>
>
>
>
>
>
> On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oeki...@innova.com.tr>
> wrote:
>
> Sorry it is not oracle. It is Mssql.
>
>
>
> Do you have any opinion for the solution. I really appreciate
>
>
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/>
>
>
>
> *From:* Richard Qiao [mailto:richardqiao2...@gmail.com]
> *Sent:* Tuesday, January 16, 2018 11:59 AM
> *To:* Onur EKİNCİ <oeki...@innova.com.tr>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
>
> Also kindly reminder scrubbing your user id password.
>
> Sent from my iPhone
>
>
> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oeki...@innova.com.tr> wrote:
>
> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 <http://10.20.10.229>:*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 <http://10.20.10.228>:*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://
> 10.20.10.148:1433;databaseName=testdb").option("dbtable",
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password",
> "Kinetica2017!").load()
>
> import com.kinetica.spark._
>
> val lp = new LoaderParams("http://10.20.10.228:9191";, "jdbc:simba://
> 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20",
> false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
>
> SparkKineticaLoader.KineticaWriter(df,lp);
>
>
>
>
>
> The above commands successfully work. The data transfer completes. However,
> jobs work serially not in parallel. Also executors work serially and take
> turns. They donw work in parallel.
>
>
>
> How can we make jobs work in parallel?
>
>
>
>
>
> <image002.jpg>
>
>
>
> <image006.jpg>
>
>
>
> <image008.jpg>
>
> <image012.jpg>
>
>
>
> <image013.jpg>
>
>
>
>
>
>
>
>
>
>
>
>
>
> <image014.jpg>
>
>
>
>
>
> I really appreciate your help. We have done everything that we could.
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png>
> <http://www.innova.com.tr/>
>
>
>
>
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve
> Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp
>
>
>

Re: Run jobs in parallel in standalone mode

Reply via email to