Re: Re: how to generate a larg dataset paralleled

2018-12-13 Thread lk_spark
generate some data in Spark .



发件人:Jean Georges Perrin 
发送时间:2018-12-14 11:10
主题:Re: how to generate a larg dataset paralleled

You just want to generate some data in Spark or ingest a large dataset outside 
of Spark? What’s the ultimate goal you’re pursuing?


On Dec 13, 2018, at 21:38, lk_spark  wrote:

I want't to generate some test data , which contained about one hundred 
million rows .
I create a dataset have ten rows ,and I do df.union operation in 'for' 
circulation , but this will case the operation only happen on driver node.
how can I do it on the whole cluster.



Re: how to generate a larg dataset paralleled

2018-12-13 Thread Jean Georges Perrin
You just want to generate some data in Spark or ingest a large dataset outside 
of Spark? What’s the ultimate goal you’re pursuing?


> On Dec 13, 2018, at 21:38, lk_spark  wrote:
> hi,all:
> I want't to generate some test data , which contained about one hundred 
> million rows .
> I create a dataset have ten rows ,and I do df.union operation in 'for' 
> circulation , but this will case the operation only happen on driver node.
> how can I do it on the whole cluster.
> 2018-12-14
> lk_spark

how to generate a larg dataset paralleled

2018-12-13 Thread lk_spark
I want't to generate some test data , which contained about one hundred 
million rows .
I create a dataset have ten rows ,and I do df.union operation in 'for' 
circulation , but this will case the operation only happen on driver node.
how can I do it on the whole cluster.



Re: Problem running Spark on Kubernetes: Certificate error

2018-12-13 Thread Matt Cheah
Hi Steven,


What I think is happening is that your machine has a CA certificate that is 
used for communicating with your API server, particularly because you’re using 
Digital Ocean’s cluster manager. However, it’s unclear if your pod has the same 
CA certificate or if the pod needs that certificate file. You can use 
configurations to have your pod use a particular CA certificate file to 
communicate with the APi server. If you set 
spark.kubernetes.authenticate.driver.caCertFile to the path of your CA 
certificate on your local disk, spark-submit will create a secret that contains 
that certificate file and use that certificate to configure TLS for the driver 
pod’s communication with the API server.


It's clear that your driver pod doesn’t have the right TLS certificate to 
communicate with the API server, so I would try to introspect the driver pod 
and see what certificate it’s using for that communication. If there’s a fix 
that needs to happen in Spark, feel free to indicate as such.


-Matt Cheah


From: Steven Stetzler 
Date: Thursday, December 13, 2018 at 1:49 PM
To: "" 
Subject: Problem running Spark on Kubernetes: Certificate error



I am following the tutorial here 
[]) to get spark running on a Kubernetes cluster. My Kubernetes 
cluster is hosted with Digital Ocean's kubernetes cluster manager. I have 
change the KUBECONFIG environment variable to point to my cluster access 
credentials, so both Spark and kubectl can speak with the nodes. 

I am running into an issue when trying to run the SparkPi example as described 
in the Spark on Kubernetes tutorials. The command I am running is: 

./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name 
spark-pi --class org.apache.spark.examples.SparkPi --conf 
spark.executor.instances=1 --conf spark.kubernetes.container.image=$IMAGEURL 
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark 

where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL of 
the Spark docker image I am using 
( []). This docker 
image was built and pushed with the script included in the Spark 2.4 
distribution. I have created a service account for Spark to ensure that it has 
proper permissions to create pods etc., which I checked using 

kubectl auth can-i create pods --as=system:serviceaccount:default:spark 

When I try to run the SparkPi example using the above command, I get the 
following output: 

2018-12-12 06:26:15 WARN  Utils:66 - Your hostname, docker-test resolves to a 
loopback address:; using instead (on interface eth0) 
2018-12-12 06:26:15 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to 
another address 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: N/A 
 start time: N/A 
 container images: N/A 
 phase: Pending 
 status: [] 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: flamboyant-darwin-3rhc 
 start time: N/A 
 container images: N/A 
 phase: Pending 
 status: [] 
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed, new 
 pod name: spark-pi-1544595975520-driver 
 namespace: default 
 labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, 
spark-role -> driver 
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 
 creation time: 2018-12-12T06:26:18Z 
 service account name: spark 
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt 
 node name: flamboyant-darwin-3rhc 
 start time: 2018-12-12T06:26:18Z 
 container images: [] 
 phase: Pending 
 status: [ContainerStatus(containerID=null, [], imageID=, 
lastState=ContainerState(running=null, term

Problem running Spark on Kubernetes: Certificate error

2018-12-13 Thread Steven Stetzler

I am following the tutorial here ( to get
spark running on a Kubernetes cluster. My Kubernetes cluster is hosted with
Digital Ocean's kubernetes cluster manager. I have change the KUBECONFIG
environment variable to point to my cluster access credentials, so both
Spark and kubectl can speak with the nodes.

I am running into an issue when trying to run the SparkPi example as
described in the Spark on Kubernetes tutorials. The command I am running

./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name
spark-pi --class org.apache.spark.examples.SparkPi --conf
spark.executor.instances=1 --conf
spark.kubernetes.container.image=$IMAGEURL --conf

where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL
of the Spark docker image I am using ( This docker image was
built and pushed with the script included in the Spark 2.4 distribution. I
have created a service account for Spark to ensure that it has proper
permissions to create pods etc., which I checked using

kubectl auth can-i create pods --as=system:serviceaccount:default:spark

When I try to run the SparkPi example using the above command, I get the
following output:

2018-12-12 06:26:15 WARN  Utils:66 - Your hostname, docker-test resolves to
a loopback address:; using instead (on interface eth0)
2018-12-12 06:26:15 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind
to another address
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
2018-12-12 06:26:19 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: 2018-12-12T06:26:18Z
 container images:
 phase: Pending
 status: [ContainerStatus(containerID=null, image=, imageID=,
lastState=ContainerState(running=null, terminated=null, waiting=null,
additionalProperties={}), name=spark-kubernetes-driver, ready=false,
restartCount=0, state=ContainerState(running=null, terminated=null,
waiting=ContainerStateWaiting(message=null, reason=ContainerCreating,
additionalProperties={}), additionalProperties={}),
2018-12-12 06:26:19 INFO  Client:54 - Waiting for application spark-pi to
2018-12-12 06:26:21 INFO  LoggingPodStatusWatcherImpl:54 - State changed,
new state:
 pod name: spark-pi-1544595975520-driver
 namespace: default
 labels: spark-app-selector ->
spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver
 pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2
 creation time: 2018-12-12T06:26:18Z
 service account name: spark
 volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt
 node name: flamboyant-darwin-3rhc
 start time: 2018-12-12T06:26:18Z
 container images: stevenstetzler/spark:v1
 phase: Running

Kalman filter with spark

2018-12-13 Thread Laurent Thiebaud

Is there any built-in implementation of Kalman filter with spark mllib? Or
any other filter to achieve the samz result? What's the state of the art
about it?
