Hi, as far as i know, this is not a typical behavior for spark, it might be relates to the implementation of the Kinetica spark connector you can try to write the DF to a csv instead using *df.write.csv("<path-to-csv-folder>")* and see how the spark job behave
Eyal On Tue, Jan 16, 2018 at 2:19 PM, Onur EKİNCİ <oeki...@innova.com.tr> wrote: > > > Correction. We found out that Spark extracts data from mssql database > column by column. Spark divides data by column. Then it executes 10 jobs > to pull data from mssql database. > > > > Is there a way that we can run those jobs in parallel or increse/decrease > the number of jobs? According to what criteria does Spark run jobs > ,especially 10 jobs? > > > > Onur EKİNCİ > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 <+90%20553%20044%2023%2041> d:+90 212 329 7000 > <(212)%20329-7000> > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google > Maps <http://www.innova.com.tr/istanbul.asp> > > <http://www.innova.com.tr/> <http://www.innova.com.tr/> > <http://www.innova.com.tr/> <http://www.innova.com.tr/> > <http://www.innova.com.tr> > > > > *From:* Onur EKİNCİ > *Sent:* Tuesday, January 16, 2018 2:16 PM > *To:* 'Eyal Zituny' <eyal.zit...@equalum.io> > *Cc:* user@spark.apache.org > *Subject:* RE: Run jobs in parallel in standalone mode > > > > Hi Eyal, > > Thank you for your help. The following commands worked in terms of running > multiple executors simultaneously. However, Spark repeats the 10 same > jobs consecutively. It had been doing it before as well. The jobs are > extracting data from Mssql. Why would it run the same job 10 times? > > > > .option("numPartitions", 4) > > .option("partitionColumn", "MUHASEBESUBE_KD") > > .option("lowerBound", 0) > > .option("upperBound", 1000) > > > > > > *From:* Eyal Zituny [mailto:eyal.zit...@equalum.io > <eyal.zit...@equalum.io>] > *Sent:* Tuesday, January 16, 2018 12:13 PM > *To:* Onur EKİNCİ <oeki...@innova.com.tr> > *Cc:* Richard Qiao <richardqiao2...@gmail.com>; user@spark.apache.org > > *Subject:* Re: Run jobs in parallel in standalone mode > > > > hi, > > > > I'm not familiar with the Kinetica spark driver, but it seems that your > job has a single task which might indicate that you have a single partition > in the df > > i would suggest to try to create your df with more partitions, this can be > done by adding the following options when reading the source: > > > > .option("numPartitions", 4) > > .option("partitionColumn", "id") > > .option("lowerBound", 0) > > .option("upperBound", 1000) > > > > take a look at the spark jdbc configuration > <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> > for > more info > > > > you can also do df.repartition(10) but that might be less efficient since > the reading from the source will not be in parallel > > > > hope it will help > > > > Eyal > > > > > > > > > > On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oeki...@innova.com.tr> > wrote: > > Sorry it is not oracle. It is Mssql. > > > > Do you have any opinion for the solution. I really appreciate > > > > > > *Onur EKİNCİ* > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 <+90%20553%20044%2023%2041> d:+90 212 329 7000 > <(212)%20329-7000> > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google > Maps <http://www.innova.com.tr/istanbul.asp> > > <http://www.innova.com.tr/> > > > > *From:* Richard Qiao [mailto:richardqiao2...@gmail.com] > *Sent:* Tuesday, January 16, 2018 11:59 AM > *To:* Onur EKİNCİ <oeki...@innova.com.tr> > *Cc:* user@spark.apache.org > *Subject:* Re: Run jobs in parallel in standalone mode > > > > Curious you are using"jdbc:sqlserve" to connect oracle, why? > > Also kindly reminder scrubbing your user id password. > > Sent from my iPhone > > > On Jan 16, 2018, at 03:00, Onur EKİNCİ <oeki...@innova.com.tr> wrote: > > Hi, > > > > We are trying to get data from an Oracle database into Kinetica database > through Apache Spark. > > > > We installed Spark in standalone mode. We executed the following commands. > However, we have tried everything but we couldnt manage to run jobs in > parallel. We use 2 IBM servers each of which has 128cores and 1TB memory. > > > > We also added in the spark-defaults.conf : > > spark.executor.memory=64g > > spark.executor.cores=32 > > spark.default.parallelism=32 > > spark.cores.max=64 > > spark.scheduler.mode=FAIR > > spark.sql.shuffle.partions=32 > > > > > > *On the machine: 10.20.10.228* > > ./start-master.sh --webui-port 8585 > > > > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > > > > *On the machine 10.20.10.229 <http://10.20.10.229>:* > > ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077 > > > > > > *On the machine: 10.20.10.228 <http://10.20.10.228>:* > > > > We start the Spark shell: > > > > spark-shell --master spark://10.20.10.228:7077 > > > > Then we make configurations: > > > > val df = spark.read.format("jdbc").option("url", "jdbc:sqlserver:// > 10.20.10.148:1433;databaseName=testdb").option("dbtable", > "dbo.temp_muh_hareket").option("user", "gpudb").option("password", > "Kinetica2017!").load() > > import com.kinetica.spark._ > > val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba:// > 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", > false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1) > > SparkKineticaLoader.KineticaWriter(df,lp); > > > > > > The above commands successfully work. The data transfer completes. However, > jobs work serially not in parallel. Also executors work serially and take > turns. They donw work in parallel. > > > > How can we make jobs work in parallel? > > > > > > <image002.jpg> > > > > <image006.jpg> > > > > <image008.jpg> > > <image012.jpg> > > > > <image013.jpg> > > > > > > > > > > > > > > <image014.jpg> > > > > > > I really appreciate your help. We have done everything that we could. > > > > *Onur EKİNCİ* > Bilgi Yönetimi Yöneticisi > Knowledge Management Manager > > m:+90 553 044 2341 <+90%20553%20044%2023%2041> d:+90 212 329 7000 > <(212)%20329-7000> > > İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google > Maps <http://www.innova.com.tr/istanbul.asp> > > <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png> > <http://www.innova.com.tr/> > > > > > Yasal Uyarı : > Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve > Şartlar dokümanına tabidir : > http://www.innova.com.tr/disclaimer-yasal-uyari.asp > > >