Re: Re: how to generate a larg dataset paralleled
generate some data in Spark . 2018-12-14 lk_spark 发件人:Jean Georges Perrin 发送时间:2018-12-14 11:10 主题:Re: how to generate a larg dataset paralleled 收件人:"lk_spark" 抄送:"user.spark" You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing? jg On Dec 13, 2018, at 21:38, lk_spark wrote: hi,all: I want't to generate some test data , which contained about one hundred million rows . I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node. how can I do it on the whole cluster. 2018-12-14 lk_spark
Re: how to generate a larg dataset paralleled
You just want to generate some data in Spark or ingest a large dataset outside of Spark? What’s the ultimate goal you’re pursuing? jg > On Dec 13, 2018, at 21:38, lk_spark wrote: > > hi,all: > I want't to generate some test data , which contained about one hundred > million rows . > I create a dataset have ten rows ,and I do df.union operation in 'for' > circulation , but this will case the operation only happen on driver node. > how can I do it on the whole cluster. > > 2018-12-14 > lk_spark
how to generate a larg dataset paralleled
hi,all: I want't to generate some test data , which contained about one hundred million rows . I create a dataset have ten rows ,and I do df.union operation in 'for' circulation , but this will case the operation only happen on driver node. how can I do it on the whole cluster. 2018-12-14 lk_spark
Re: Problem running Spark on Kubernetes: Certificate error
Hi Steven, What I think is happening is that your machine has a CA certificate that is used for communicating with your API server, particularly because you’re using Digital Ocean’s cluster manager. However, it’s unclear if your pod has the same CA certificate or if the pod needs that certificate file. You can use configurations to have your pod use a particular CA certificate file to communicate with the APi server. If you set spark.kubernetes.authenticate.driver.caCertFile to the path of your CA certificate on your local disk, spark-submit will create a secret that contains that certificate file and use that certificate to configure TLS for the driver pod’s communication with the API server. It's clear that your driver pod doesn’t have the right TLS certificate to communicate with the API server, so I would try to introspect the driver pod and see what certificate it’s using for that communication. If there’s a fix that needs to happen in Spark, feel free to indicate as such. -Matt Cheah From: Steven Stetzler Date: Thursday, December 13, 2018 at 1:49 PM To: "user@spark.apache.org" Subject: Problem running Spark on Kubernetes: Certificate error Hello, I am following the tutorial here (https://spark.apache.org/docs/latest/running-on-kubernetes.html [spark.apache.org]) to get spark running on a Kubernetes cluster. My Kubernetes cluster is hosted with Digital Ocean's kubernetes cluster manager. I have change the KUBECONFIG environment variable to point to my cluster access credentials, so both Spark and kubectl can speak with the nodes. I am running into an issue when trying to run the SparkPi example as described in the Spark on Kubernetes tutorials. The command I am running is: ./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=$IMAGEURL --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL of the Spark docker image I am using (https://hub.docker.com/r/stevenstetzler/spark/ [hub.docker.com]). This docker image was built and pushed with the script included in the Spark 2.4 distribution. I have created a service account for Spark to ensure that it has proper permissions to create pods etc., which I checked using kubectl auth can-i create pods --as=system:serviceaccount:default:spark When I try to run the SparkPi example using the above command, I get the following output: 2018-12-12 06:26:15 WARN Utils:66 - Your hostname, docker-test resolves to a loopback address: 127.0.1.1; using 10.46.0.6 instead (on interface eth0) 2018-12-12 06:26:15 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: N/A start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: 2018-12-12T06:26:18Z container images: docker.io/stevenstetzler/spark:v1 [docker.io] phase: Pending status: [ContainerStatus(containerID=null, image=docker.io/stevenstetzler/spark:v1 [docker.io], imageID=, lastState=ContainerState(running=null, term
Problem running Spark on Kubernetes: Certificate error
Hello, I am following the tutorial here ( https://spark.apache.org/docs/latest/running-on-kubernetes.html) to get spark running on a Kubernetes cluster. My Kubernetes cluster is hosted with Digital Ocean's kubernetes cluster manager. I have change the KUBECONFIG environment variable to point to my cluster access credentials, so both Spark and kubectl can speak with the nodes. I am running into an issue when trying to run the SparkPi example as described in the Spark on Kubernetes tutorials. The command I am running is: ./bin/spark-submit --master k8s://$CLUSTERIP --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=$IMAGEURL --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar where CLUSTERIP contains the ip of my cluster and IMAGEURL contains the URL of the Spark docker image I am using ( https://hub.docker.com/r/stevenstetzler/spark/). This docker image was built and pushed with the script included in the Spark 2.4 distribution. I have created a service account for Spark to ensure that it has proper permissions to create pods etc., which I checked using kubectl auth can-i create pods --as=system:serviceaccount:default:spark When I try to run the SparkPi example using the above command, I get the following output: 2018-12-12 06:26:15 WARN Utils:66 - Your hostname, docker-test resolves to a loopback address: 127.0.1.1; using 10.46.0.6 instead (on interface eth0) 2018-12-12 06:26:15 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: N/A start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: N/A container images: N/A phase: Pending status: [] 2018-12-12 06:26:19 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: 2018-12-12T06:26:18Z container images: docker.io/stevenstetzler/spark:v1 phase: Pending status: [ContainerStatus(containerID=null, image= docker.io/stevenstetzler/spark:v1, imageID=, lastState=ContainerState(running=null, terminated=null, waiting=null, additionalProperties={}), name=spark-kubernetes-driver, ready=false, restartCount=0, state=ContainerState(running=null, terminated=null, waiting=ContainerStateWaiting(message=null, reason=ContainerCreating, additionalProperties={}), additionalProperties={}), additionalProperties={})] 2018-12-12 06:26:19 INFO Client:54 - Waiting for application spark-pi to finish... 2018-12-12 06:26:21 INFO LoggingPodStatusWatcherImpl:54 - State changed, new state: pod name: spark-pi-1544595975520-driver namespace: default labels: spark-app-selector -> spark-ec5eb54644d348e7a213f8178b8ef61f, spark-role -> driver pod uid: d5d6bdc7-fdd6-11e8-b666-8e815d3815b2 creation time: 2018-12-12T06:26:18Z service account name: spark volumes: spark-local-dir-1, spark-conf-volume, spark-token-qf9dt node name: flamboyant-darwin-3rhc start time: 2018-12-12T06:26:18Z container images: stevenstetzler/spark:v1 phase: Running status: [ContainerStatus(containerID=docker://b923c6ff02b93557c8c104c01a4eeb1c05f3d0c0123ec4e5895bfd6be398a03a, image=stevenstetzler/spark:v1, imageID=docker-pullable://stevenstetzler/spark@sha256:dc4bce1e410ebd7b14a88caa46a4282a61ff058c6374b7cf721b7498829bb041, lastState=ContainerState(ru
Kalman filter with spark
Hi, Is there any built-in implementation of Kalman filter with spark mllib? Or any other filter to achieve the samz result? What's the state of the art about it? Thanks. Laurent