Re: Spark Streaming not reading missed data

2018-01-16 Thread vijay.bvp
you are creating streaming context each time

val streamingContext = new StreamingContext(sparkSession.sparkContext,
Seconds(config.getInt(Constants.Properties.BatchInterval)))

if you want fault-tolerance, to read from where it stopped between spark job
restarts, the correct way is to restore streaming context from the
checkpoint directory

// Get StreamingContext from checkpoint data or create a new one
val context = StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext _)

please refer here for checkpointing and to achieve fault-tolerance in case
of driver failures
checkpointing

  

hope this helps





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2018-01-16 Thread Anu B Nair



"Got wrong record after seeking to offset" issue

2018-01-16 Thread Justin Miller
Greetings all,

I’ve recently started hitting on the following error in Spark Streaming in 
Kafka. Adjusting maxRetries and spark.streaming.kafka.consumer.poll.ms even to 
five minutes doesn’t seem to be helping. The problem only manifested in the 
last few days, restarting with a new consumer group seems to remedy the issue 
for a few hours (< retention, which is 12 hours).

Error:
Caused by: java.lang.AssertionError: assertion failed: Got wrong record for 
spark-executor-  76 even after seeking to offset 
1759148155
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:85)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:223)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:189)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)

I guess my questions are, why is that assertion a job killer vs a warning and 
is there anything I can tweak settings wise that may keep it at bay.

I wouldn’t be surprised if this issue were exacerbated by the volume we do on 
Kafka topics (~150k/sec on the persister that’s crashing).

Thank you!
Justin


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark streaming kafka not displaying data in local eclipse

2018-01-16 Thread vr spark
Hi,
I have a simple Java program to read data from kafka using spark streaming.

When i run it from eclipse on my mac, it is connecting to the zookeeper,
bootstrap nodes,
But its not displaying any data. it does not give any error.


it just shows

18/01/16 20:49:15 INFO Executor: Finished task 96.0 in stage 0.0 (TID 0).
1412 bytes result sent to driver
18/01/16 20:49:15 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
2, localhost, partition 1, ANY, 5832 bytes)
18/01/16 20:49:15 INFO Executor: Running task 1.0 in stage 0.0 (TID 2)
18/01/16 20:49:15 INFO TaskSetManager: Finished task 96.0 in stage 0.0 (TID
0) in 111 ms on localhost (1/97)
18/01/16 20:49:15 INFO KafkaRDD: Computing topic data_stream, partition 16
offsets 25624028 -> 25624097
18/01/16 20:49:15 INFO VerifiableProperties: Verifying properties
18/01/16 20:49:15 INFO VerifiableProperties: Property auto.offset.reset is
overridden to largest
18/01/16 20:49:15 INFO VerifiableProperties: Property
fetch.message.max.bytes is overridden to 20971520
18/01/16 20:49:15 INFO VerifiableProperties: Property group.id is
overridden to VR-Test-Group
18/01/16 20:49:15 INFO VerifiableProperties: Property zookeeper.connect is
overridden to zk.kafka-cluster...:8091
18/01/16 20:49:25 INFO JobScheduler: Added jobs for time 151616456 ms
18/01/16 20:49:36 INFO JobScheduler: Added jobs for time 151616457 ms
18/01/16 20:49:45 INFO JobScheduler: Added jobs for time 151616458 ms
18/01/16 20:49:55 INFO JobScheduler: Added jobs for time 151616459 ms
18/01/16 20:50:07 INFO JobScheduler: Added jobs for time 151616460 ms
18/01/16 20:50:15 INFO JobScheduler: Added jobs for time 151616461 ms

But when i export it as jar and run it in a remote spark cluster , it does
display the actual data.

Please suggest what could be wrong.

thanks
VR


Re: Spark Streaming not reading missed data

2018-01-16 Thread KhajaAsmath Mohammed
sometimes I get this messages in logs but the job still runs. do you have
solution on how to fix this? I have added the code in my earlier email.

Exception in thread "pool-22-thread-9" java.lang.NullPointerException

at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(
Checkpoint.scala:233)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

On Tue, Jan 16, 2018 at 3:16 PM, Jörn Franke  wrote:

> It could be a missing persist before the checkpoint
>
> > On 16. Jan 2018, at 22:04, KhajaAsmath Mohammed 
> wrote:
> >
> > Hi,
> >
> > Spark streaming job from kafka is not picking the messages and is always
> taking the latest offsets when streaming job is stopped for 2 hours. It is
> not picking up the offsets that are required to be processed from
> checkpoint directory.  any suggestions on how to process the old messages
> too when there is shutdown or planned maintenance?
> >
> >  val topics = config.getString(Constants.Properties.KafkaTopics)
> >   val topicsSet = topics.split(",").toSet
> >   val kafkaParams = Map[String, String]("metadata.broker.list" ->
> config.getString(Constants.Properties.KafkaBrokerList))
> >   val sparkSession: SparkSession = runMode match {
> > case "local" => SparkSession.builder.config(
> sparkConfig).getOrCreate
> > case "yarn"  => SparkSession.builder.config(sparkConfig).
> enableHiveSupport.getOrCreate
> >   }
> >   val streamingContext = new StreamingContext(sparkSession.sparkContext,
> Seconds(config.getInt(Constants.Properties.BatchInterval)))
> >   streamingContext.checkpoint(config.getString(Constants.
> Properties.CheckPointDir))
> >   val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsSet)
> >   val datapointDStream = messages.map(_._2).map(TransformDatapoint.
> parseDataPointText)
> >   lazy val sqlCont = sparkSession.sqlContext
> >
> >   hiveDBInstance=config.getString("hiveDBInstance")
> >
> >   TransformDatapoint.readDstreamData(sparkSession,sqlCont,
> datapointDStream, runMode, includeIndex, indexNum, datapointTmpTableName,
> fencedDPTmpTableName, fencedVINDPTmpTableName,hiveDBInstance)
> >
> >   //transformDstreamData(sparkSession,datapointDStream,
> runMode,includeIndex,indexNum,datapointTmpTableName,fencedDPTmpTableName,
> fencedVINDPTmpTableName);
> >   streamingContext.checkpoint(config.getString(Constants.
> Properties.CheckPointDir))
> >   streamingContext.start()
> >   streamingContext.awaitTermination()
> >   streamingContext.stop(stopSparkContext = true, stopGracefully =
> true)
> >
> > Thanks,
> > Asmath
>


Re: Spark Streaming not reading missed data

2018-01-16 Thread Jörn Franke
It could be a missing persist before the checkpoint

> On 16. Jan 2018, at 22:04, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> Spark streaming job from kafka is not picking the messages and is always 
> taking the latest offsets when streaming job is stopped for 2 hours. It is 
> not picking up the offsets that are required to be processed from checkpoint 
> directory.  any suggestions on how to process the old messages too when there 
> is shutdown or planned maintenance?
> 
>  val topics = config.getString(Constants.Properties.KafkaTopics)
>   val topicsSet = topics.split(",").toSet
>   val kafkaParams = Map[String, String]("metadata.broker.list" -> 
> config.getString(Constants.Properties.KafkaBrokerList))
>   val sparkSession: SparkSession = runMode match {
> case "local" => SparkSession.builder.config(sparkConfig).getOrCreate
> case "yarn"  => 
> SparkSession.builder.config(sparkConfig).enableHiveSupport.getOrCreate
>   }
>   val streamingContext = new StreamingContext(sparkSession.sparkContext, 
> Seconds(config.getInt(Constants.Properties.BatchInterval)))
>   
> streamingContext.checkpoint(config.getString(Constants.Properties.CheckPointDir))
>   val messages = KafkaUtils.createDirectStream[String, String, 
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsSet)
>   val datapointDStream = 
> messages.map(_._2).map(TransformDatapoint.parseDataPointText)
>   lazy val sqlCont = sparkSession.sqlContext
>   
>   hiveDBInstance=config.getString("hiveDBInstance")
>   
>   TransformDatapoint.readDstreamData(sparkSession,sqlCont, 
> datapointDStream, runMode, includeIndex, indexNum, datapointTmpTableName, 
> fencedDPTmpTableName, fencedVINDPTmpTableName,hiveDBInstance)
>   
>   
> //transformDstreamData(sparkSession,datapointDStream,runMode,includeIndex,indexNum,datapointTmpTableName,fencedDPTmpTableName,fencedVINDPTmpTableName);
>   
> streamingContext.checkpoint(config.getString(Constants.Properties.CheckPointDir))
>   streamingContext.start()
>   streamingContext.awaitTermination()
>   streamingContext.stop(stopSparkContext = true, stopGracefully = true)
> 
> Thanks,
> Asmath

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark Streaming not reading missed data

2018-01-16 Thread KhajaAsmath Mohammed
Hi,

Spark streaming job from kafka is not picking the messages and is always
taking the latest offsets when streaming job is stopped for 2 hours. It is
not picking up the offsets that are required to be processed from
checkpoint directory.  any suggestions on how to process the old messages
too when there is shutdown or planned maintenance?

 val topics = config.getString(Constants.Properties.KafkaTopics)
  val topicsSet = topics.split(",").toSet
  val kafkaParams = Map[String, String]("metadata.broker.list" ->
config.getString(Constants.Properties.KafkaBrokerList))
  val sparkSession: SparkSession = runMode match {
case "local" => SparkSession.builder.config(sparkConfig).getOrCreate
case "yarn"  =>
SparkSession.builder.config(sparkConfig).enableHiveSupport.getOrCreate
  }
  val streamingContext = new
StreamingContext(sparkSession.sparkContext,
Seconds(config.getInt(Constants.Properties.BatchInterval)))

streamingContext.checkpoint(config.getString(Constants.Properties.CheckPointDir))
  val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topicsSet)
  val datapointDStream =
messages.map(_._2).map(TransformDatapoint.parseDataPointText)
  lazy val sqlCont = sparkSession.sqlContext

  hiveDBInstance=config.getString("hiveDBInstance")

  TransformDatapoint.readDstreamData(sparkSession,sqlCont,
datapointDStream, runMode, includeIndex, indexNum, datapointTmpTableName,
fencedDPTmpTableName, fencedVINDPTmpTableName,hiveDBInstance)


//transformDstreamData(sparkSession,datapointDStream,runMode,includeIndex,indexNum,datapointTmpTableName,fencedDPTmpTableName,fencedVINDPTmpTableName);

streamingContext.checkpoint(config.getString(Constants.Properties.CheckPointDir))
  streamingContext.start()
  streamingContext.awaitTermination()
  streamingContext.stop(stopSparkContext = true, stopGracefully = true)

Thanks,
Asmath


Null pointer exception in checkpoint directory

2018-01-16 Thread KhajaAsmath Mohammed
Hi,

I keep getting null pointer exception in the spark streaming job with
checkpointing. any suggestions to resolve this.

Exception in thread "pool-22-thread-9" java.lang.NullPointerException

at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Exception in thread "pool-22-thread-10" java.lang.NullPointerException

at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Exception in thread "pool-22-thread-11" java.lang.NullPointerException

at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Exception in thread "pool-22-thread-12" java.lang.NullPointerException

at
org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Thanks,
Asmath


Mail List Daily Archive

2018-01-16 Thread dmp
Hello,

Upon joining this mailing list I did not receive it seems a
link to configure options. Is there a way to set for a daily
archive of correspondences like mailman.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2018-01-16 Thread karthik
unsubscribe


Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-16 Thread Matt Hicks
If I try to use LogisticRegression with only positive training it always gives
me positive results:

Positive Only private def positiveOnly(): Unit = {val 
training = spark.createDataFrame(Seq(  (1.0, Vectors.dense(0.0, 1.1, 0.1)), 
 (1.0, Vectors.dense(0.0, 1.0, -1.0)),  (1.0, Vectors.dense(0.2, 1.3, 
1.0)),  (1.0, Vectors.dense(0.1, 1.2, -0.5)))).toDF("label", 
"features")val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)val model = lr.fit(training)val test 
= spark.createDataFrame(Seq(  (1.0, Vectors.dense(-1.0, 1.5, 1.3)),  
(0.0, Vectors.dense(3.0, 2.0, -0.1)),  (1.0, Vectors.dense(0.0, 2.2, -1.5)) 
   )).toDF("label", "features")model.transform(test)  
.select("features", "label", "probability", "prediction")  .collect()  
.foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: 
Double) =>println(s"($features, $label) -> prob=$prob, 
prediction=$prediction")  }  }


Not using Mixmax yet?  

The results look like this:
[info] ([-1.0,1.5,1.3], 1.0) -> prob=[0.0,1.0], prediction=1.0[info]
([3.0,2.0,-0.1], 0.0) -> prob=[0.0,1.0], prediction=1.0[info] ([0.0,2.2,-1.5],
1.0) -> prob=[0.0,1.0], prediction=1.0  





On Tue, Jan 16, 2018 8:51 AM, Matt Hicks m...@outr.com  wrote:
Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to Spark
and Spark ML.  Can you point me to some example code or documentation that would
more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1...@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks"  wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-16 Thread Matt Hicks
Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to
Spark and Spark ML.  Can you point me to some example code or documentation that
would more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1...@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks"  wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Eyal Zituny
Hi,
as far as i know, this is not a typical behavior for spark,
it might be relates to the implementation of the Kinetica spark connector
you can try to write the DF to a csv instead using
*df.write.csv("")*
and see how the spark job behave

Eyal

On Tue, Jan 16, 2018 at 2:19 PM, Onur EKİNCİ  wrote:

>
>
> Correction. We found out that Spark extracts data from mssql database
> column by column.  Spark divides data by column. Then it executes 10 jobs
> to pull data from mssql database.
>
>
>
> Is there a way that we can run those jobs in parallel or increse/decrease
> the number of jobs?  According to what criteria does Spark run jobs
> ,especially 10 jobs?
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps 
>
>  
>  
> 
>
>
>
> *From:* Onur EKİNCİ
> *Sent:* Tuesday, January 16, 2018 2:16 PM
> *To:* 'Eyal Zituny' 
> *Cc:* user@spark.apache.org
> *Subject:* RE: Run jobs in parallel in standalone mode
>
>
>
> Hi Eyal,
>
> Thank you for your help. The following commands worked in terms of running
> multiple executors simultaneously. However, Spark repeats the 10 same
> jobs consecutively.  It had been doing it before as well. The jobs are
> extracting data from Mssql. Why would it run the same job 10 times?
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "MUHASEBESUBE_KD")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
>
>
> *From:* Eyal Zituny [mailto:eyal.zit...@equalum.io
> ]
> *Sent:* Tuesday, January 16, 2018 12:13 PM
> *To:* Onur EKİNCİ 
> *Cc:* Richard Qiao ; user@spark.apache.org
>
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> hi,
>
>
>
> I'm not familiar with the Kinetica spark driver, but it seems that your
> job has a single task which might indicate that you have a single partition
> in the df
>
> i would suggest to try to create your df with more partitions, this can be
> done by adding the following options when reading the source:
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "id")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
> take a look at the spark jdbc configuration
> 
>  for
> more info
>
>
>
> you can also do df.repartition(10) but that might be less efficient since
> the reading from the source will not be in parallel
>
>
>
> hope it will help
>
>
>
> Eyal
>
>
>
>
>
>
>
>
>
> On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ 
> wrote:
>
> Sorry it is not oracle. It is Mssql.
>
>
>
> Do you have any opinion for the solution. I really appreciate
>
>
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps 
>
> 
>
>
>
> *From:* Richard Qiao [mailto:richardqiao2...@gmail.com]
> *Sent:* Tuesday, January 16, 2018 11:59 AM
> *To:* Onur EKİNCİ 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
>
> Also kindly reminder scrubbing your user id password.
>
> Sent from my iPhone
>
>
> On Jan 16, 2018, at 03:00, Onur EKİNCİ  wrote:
>
> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 :*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 :*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = 

unsubscribe

2018-01-16 Thread Jose Pedro de Santana Neto
unsubscribe


unsubscribe

2018-01-16 Thread Muhammad Yaseen Aftab
unsubscribe


RE: Run jobs in parallel in standalone mode

2018-01-16 Thread Onur EKİNCİ

Correction. We found out that Spark extracts data from mssql database column by 
column.  Spark divides data by column. Then it executes 10 jobs to pull data 
from mssql database.

Is there a way that we can run those jobs in parallel or increse/decrease the 
number of jobs?  According to what criteria does Spark run jobs ,especially 10 
jobs?



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps


[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]



From: Onur EKİNCİ
Sent: Tuesday, January 16, 2018 2:16 PM
To: 'Eyal Zituny' 
Cc: user@spark.apache.org
Subject: RE: Run jobs in parallel in standalone mode

Hi Eyal,
Thank you for your help. The following commands worked in terms of running 
multiple executors simultaneously. However, Spark repeats the 10 same jobs 
consecutively.  It had been doing it before as well. The jobs are extracting 
data from Mssql. Why would it run the same job 10 times?

.option("numPartitions", 4)
.option("partitionColumn", "MUHASEBESUBE_KD")
.option("lowerBound", 0)
.option("upperBound", 1000)

[cid:image003.jpg@01D38EDD.64427570]

From: Eyal Zituny [mailto:eyal.zit...@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ >
Cc: Richard Qiao >; 
user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has 
a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done 
by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc 
configuration
 for more info

you can also do df.repartition(10) but that might be less efficient since the 
reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ 
> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 
7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps


[cid:image004.png@01D38EDD.64427570]



From: Richard Qiao 
[mailto:richardqiao2...@gmail.com]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ >
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ 
> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database 
through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. 
However, we have tried everything but we couldnt manage to run jobs in 
parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077


On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077


On the machine: 10.20.10.228:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", 
"jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable",
 "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
"Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191;, 
"jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", 

RE: Run jobs in parallel in standalone mode

2018-01-16 Thread Onur EKİNCİ
Hi Eyal,
Thank you for your help. The following commands worked in terms of running 
multiple executors simultaneously. However, Spark repeats the 10 same jobs 
consecutively.  It had been doing it before as well. The jobs are extracting 
data from Mssql. Why would it run the same job 10 times?

.option("numPartitions", 4)
.option("partitionColumn", "MUHASEBESUBE_KD")
.option("lowerBound", 0)
.option("upperBound", 1000)

[cid:image002.jpg@01D38ED4.90C09EF0]




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps


[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]



From: Eyal Zituny [mailto:eyal.zit...@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ 
Cc: Richard Qiao ; user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has 
a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done 
by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc 
configuration
 for more info

you can also do df.repartition(10) but that might be less efficient since the 
reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ 
> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 
7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps


[cid:image001.png@01D38ED3.A894BD00]



From: Richard Qiao 
[mailto:richardqiao2...@gmail.com]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ >
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ 
> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database 
through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. 
However, we have tried everything but we couldnt manage to run jobs in 
parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077


On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 
spark://10.20.10.228:7077


On the machine: 10.20.10.228:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", 
"jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable",
 "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
"Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191;, 
"jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", 
"muh_hareket_20", false,"",10,true,true,"admin","Kinetica2017!",4, true, 
true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, 
jobs work serially not in parallel. Also executors work serially and take 
turns. They donw work in parallel.

How can we make jobs work in parallel?



















I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 
7000

İTÜ Ayazağa 

Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
2 points to consider:
1. Check sql server/simba max connection number
2. Allocate 3-5 cores for each executor and allocate more executors.

Sent from my iPhone

> On Jan 16, 2018, at 04:01, Onur EKİNCİ  wrote:
> 
> Sorry it is not oracle. It is Mssql.
>  
> Do you have any opinion for the solution. I really appreciate
>  
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> From: Richard Qiao [mailto:richardqiao2...@gmail.com] 
> Sent: Tuesday, January 16, 2018 11:59 AM
> To: Onur EKİNCİ 
> Cc: user@spark.apache.org
> Subject: Re: Run jobs in parallel in standalone mode
>  
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
> Also kindly reminder scrubbing your user id password.
> 
> Sent from my iPhone
> 
> On Jan 16, 2018, at 03:00, Onur EKİNCİ  wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database 
> through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. 
> However, we have tried everything but we couldnt manage to run jobs in 
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", 
> "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", 
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
> "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191;, 
> "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", 
> false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, 
> jobs work serially not in parallel. Also executors work serially and take 
> turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
>  
>  
>  
>  
>  
>  
> 
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
> dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp


Re: Inner join with the table itself

2018-01-16 Thread Michael Shtelma
Hi Jacek,

Thank you for the workaround.
It is really working in this way:
pos.as("p1").join(pos.as("p2")).filter($"p1.POSITION_ID0"===$"p2.POSITION_ID")
I have checked, that in this way I get the same execution plan as for
the join with renamed columns.

Best,
Michael


On Mon, Jan 15, 2018 at 10:33 PM, Jacek Laskowski  wrote:
> Hi Michael,
>
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
>
> scala> val r1 = spark.range(1)
> r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>
> scala> r1.as("left").join(r1.as("right")).filter($"left.id" ===
> $"right.id").show
> +---+---+
> | id| id|
> +---+---+
> |  0|  0|
> +---+---+
>
> Am I missing something? When aliasing a table, use the identifier in column
> refs (inside).
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma  wrote:
>>
>> Hi Jacek & Gengliang,
>>
>> let's take a look at the following query:
>>
>> val pos = spark.read.parquet(prefix + "POSITION.parquet")
>> pos.createOrReplaceTempView("POSITION")
>> spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
>> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
>> ").collect()
>>
>> This query is working for me right now using spark 2.2.
>>
>> Now we can try implementing the same logic with DataFrame API:
>>
>> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()
>>
>> I am getting the following error:
>>
>> "Join condition is missing or trivial.
>>
>> Use the CROSS JOIN syntax to allow cartesian products between these
>> relations.;"
>>
>> I have tried using alias function, but without success:
>>
>> val pos2 = pos.alias("P2")
>> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()
>>
>> This also leads us to the same error.
>> Am  I missing smth about the usage of alias?
>>
>> Now let's rename the columns:
>>
>> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
>> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()
>>
>> It works!
>>
>> There is one more really odd thing about all this: a colleague of mine
>> has managed to get the same exception ("Join condition is missing or
>> trivial") also using original SQL query, but I think he has been using
>> empty tables.
>>
>> Thanks,
>> Michael
>>
>>
>> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
>>  wrote:
>> > Hi Michael,
>> >
>> > You can use `Explain` to see how your query is optimized.
>> >
>> > https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
>> > I believe your query is an actual cross join, which is usually very slow
>> > in
>> > execution.
>> >
>> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>> >
>> >
>> > 在 2018年1月15日,下午6:09,Jacek Laskowski  写道:
>> >
>> > Hi Michael,
>> >
>> > -dev +user
>> >
>> > What's the query? How do you "fool spark"?
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://about.me/JacekLaskowski
>> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma 
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> If I try joining the table with itself using join columns, I am
>> >> getting the following error:
>> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> >> allow cartesian products between these relations.;"
>> >>
>> >> This is not true, and my join is not trivial and is not a real cross
>> >> join. I am providing join condition and expect to get maybe a couple
>> >> of joined rows for each row in the original table.
>> >>
>> >> There is a workaround for this, which implies renaming all the columns
>> >> in source data frame and only afterwards proceed with the join. This
>> >> allows us to fool spark.
>> >>
>> >> Now I am wondering if there is a way to get rid of this problem in a
>> >> better way? I do not like the idea of renaming the columns because
>> >> this makes it really difficult to keep track of the names in the
>> >> columns in result data frames.
>> >> Is it possible to deactivate this check?
>> >>
>> >> Thanks,
>> >> Michael
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Eyal Zituny
hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job
has a single task which might indicate that you have a single partition in
the df
i would suggest to try to create your df with more partitions, this can be
done by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc configuration

for
more info

you can also do df.repartition(10) but that might be less efficient since
the reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ  wrote:

> Sorry it is not oracle. It is Mssql.
>
>
>
> Do you have any opinion for the solution. I really appreciate
>
>
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps 
>
>  
>  
> 
>
>
>
> *From:* Richard Qiao [mailto:richardqiao2...@gmail.com]
> *Sent:* Tuesday, January 16, 2018 11:59 AM
> *To:* Onur EKİNCİ 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
>
> Also kindly reminder scrubbing your user id password.
>
> Sent from my iPhone
>
>
> On Jan 16, 2018, at 03:00, Onur EKİNCİ  wrote:
>
> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 :*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 :*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://
> 10.20.10.148:1433;databaseName=testdb").option("dbtable",
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password",
> "Kinetica2017!").load()
>
> import com.kinetica.spark._
>
> val lp = new LoaderParams("http://10.20.10.228:9191;, "jdbc:simba://
> 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20",
> false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
>
> SparkKineticaLoader.KineticaWriter(df,lp);
>
>
>
>
>
> The above commands successfully work. The data transfer completes. However,
> jobs work serially not in parallel. Also executors work serially and take
> turns. They donw work in parallel.
>
>
>
> How can we make jobs work in parallel?
>
>
>
>
>
> 
>
>
>
> 
>
>
>
> 
>
> 
>
>
>
> 
>
>
>
>
>
>
>
>
>
>
>
>
>
> 
>
>
>
>
>
> I really appreciate your help. We have done everything that we could.
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps 
>
> 
> 
>
>
>
>
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve
> Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp
>
>


RE: Run jobs in parallel in standalone mode

2018-01-16 Thread Onur EKİNCİ
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps


[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]



From: Richard Qiao [mailto:richardqiao2...@gmail.com]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ 
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ 
> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database 
through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. 
However, we have tried everything but we couldnt manage to run jobs in 
parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine: 10.20.10.228:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", 
"jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", 
"dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
"Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191;, 
"jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", 
false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, 
jobs work serially not in parallel. Also executors work serially and take 
turns. They donw work in parallel.

How can we make jobs work in parallel?



















I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google 
Maps







Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp


Re: Run jobs in parallel in standalone mode

2018-01-16 Thread Richard Qiao
Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

> On Jan 16, 2018, at 03:00, Onur EKİNCİ  wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database 
> through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. 
> However, we have tried everything but we couldnt manage to run jobs in 
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", 
> "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", 
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password", 
> "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191;, 
> "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", 
> false,"",10,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, 
> jobs work serially not in parallel. Also executors work serially and take 
> turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> 
>  
> 
>  
> 
> 
>  
> 
>  
>  
>  
>  
>  
>  
> 
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar 
> dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp


Re: [Spark ML] Positive-Only Training Classification in Scala

2018-01-16 Thread hosur narahari
You can make use of probability vector from spark classification.
When you run spark classification model for prediction, along with
classifying into its class spark also gives probability vector(what's the
probability that this could belong to each individual class) . So just take
the probability corresponding to the donor class. And it'll be same as
what's the probability the a person will become donor.

Best Regards,
Hari

On 15 Jan 2018 11:51 p.m., "Matt Hicks"  wrote:

> I'm attempting to create a training classification, but only have positive
> information.  Specifically in this case it is a donor list of users, but I
> want to use it as training in order to determine classification for new
> contacts to give probabilities that they will donate.
>
> Any insights or links are appreciated. I've gone through the documentation
> but have been unable to find any references to how I might do this.
>
> Thanks
>
> ---*Matt Hicks*
>
> *Chief Technology Officer*
>
> 405.283.6887 | http://outr.com
>
> [image: logo 2 small.png]
>
>