Filter one dataset based on values from another

2018-04-30 Thread lsn24
Hi, I have one dataset with parameters and another with data that needs to apply/ filter based on the first dataset (Parameter dataset). *Scenario is as follows:* For each row in parameter dataset, I need to apply the parameter row to the second dataset.I will end up having multiple

Re: Spark launcher listener not getting invoked k8s Spark 2.3

2018-04-30 Thread Marcelo Vanzin
Please include the mailing list in your replies. Yes, you'll be able to launch the jobs, but the k8s backend isn't hooked up to the listener functionality. On Mon, Apr 30, 2018 at 8:13 PM, purna m wrote: > I’m able to submit the job though !! I mean spark application is

NullPointerException when scanning HBase table

2018-04-30 Thread Huiliang Zhang
Hi, In my spark job, I need to scan HBase table. I set up a scan with custom filters. Then I use newAPIHadoopRDD function to get a JavaPairRDD variable X. The problem is when no records inside HBase matches my filters, the call X.isEmpty() or X.count() will cause a

Spark launcher listener not getting invoked k8s Spark 2.3

2018-04-30 Thread purna m
HI im using below code to submit a spark 2.3 application on kubernetes cluster in scala using play framework I have also tried as a simple scala program without using play framework im trying to spark submit which was mentioned below programatically

Re: [EXT] [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Michael Mansour
Well, if you don't need to actually evaluate the information on the driver, but just need to trigger some sort of action, then you may want to consider using the `forEach` or `forEachPartition` method, which is an action and will execute your process. It won't return anything to the driver and

re: spark streaming / AnalysisException on collect()

2018-04-30 Thread Peter Liu
Hello there, I have a quick question regarding how to share data (a small data collection) between a kafka producer and consumer using spark streaming (spark 2.2): (A) the data published by a kafka producer is received in order on the kafka consumer side (see (a) copied below). (B) however,

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Yinan Li
Also looks like you are mixing configuration properties from different versions of Spark on Kubernetes. "spark.kubernetes.{driver|executor}.docker.image" is only available in the apache-spark-on-k8s fork, whereas "spark.kubernetes.container.image" is new in Spark 2.3.0. Please make sure you use

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Eric Wang
Thanks so much! I'll take a look at the guide right now. The versions should all be 2.2 of spark. In my configuration, I'm using --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0 \ --conf

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Yinan Li
Which version of Spark are you using to run spark-submit, and which version of Spark your container image is based off? This looks to be caused my mismatched versions of Spark used for spark-submit and for the driver/executor at runtime. On Mon, Apr 30, 2018 at 12:00 PM, Holden Karau

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark on GKE https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc and if you want to run pre-built Spark on GKE there is a solutions article

[Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Eric Wang
Hello all, I've been trying to spark-submit a job to the Google Kubernetes Engine but I keep encountering a "Exception in thread "main" java.lang.IllegalArgumentException: Server properties file given at /opt/spark/work-dir/driver does not exist or is not a file." error. I'm unsure of how to even

Best practices to keep multiple version of schema in Spark

2018-04-30 Thread unk1102
Hi I have a couple of datasets where schema keep on changing and I store it as parquet files. Now I use mergeSchema option while loading these different schema parquet files in a DataFrame and it works all fine. Now I have a requirement of maintaining difference between schema over time basically

Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Irving Duran
I don't think there is a magic number, so I would say that it will depend on how big your dataset is and the size of your worker(s). Thank You, Irving Duran On Sat, Apr 28, 2018 at 10:41 AM klrmowse wrote: > i am currently trying to find a workaround for the Spark

Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Vadim Semenov
`.collect` returns an Array, and array's can't have more than Int.MaxValue elements, and in most JVMs it's lower: `Int.MaxValue - 8` So it puts upper limit, however, you can just create Array of Arrays, and so on, basically limitless, albeit with some gymnastics.

Any good book recommendations for SparkR

2018-04-30 Thread @Nandan@
Hi Team, Any good book recommendations for get in-depth knowledge from zero to production. Let me know. Thanks.

Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Deepak Goel
Could you please help us and provide the source which says about the general guidelines (80-85)? Even if there is a general guideline, it is probably to keep the performance of Spark application high (And to *distinguish* it from Hadoop). But if you are not too concerned about the *performance*

Re: [Spark 2.x Core] .collect() size limit

2018-04-30 Thread Lalwani, Jayesh
Although there is such a thing as virtualization of memory done at the OS layer, JVM imposes it’s own limit that is controlled by the spark.executor.memory and spark.driver.memory configurations. The amount of memory allocated by JVM will be controlled by those parameters. General guidelines

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-30 Thread Javier Pareja
Hi Saulo, If the CPU is close to 100% then you are hitting the limit. I don't think that moving to Scala will make a difference. Both Spark and Cassandra are CPU hungry, your setup is small in terms of CPUs. Try running Spark on another (physical) machine so that the 2 cores are dedicated to

Re: [Spark2.1] SparkStreaming to Cassandra performance problem

2018-04-30 Thread Saulo Sobreiro
Hi Javier, I will try to implement this in scala then. As far as I can see in the documentation there is no SaveToCassandra in the python interface unless you are working with dataframes and the kafkaStream instance does not provide methods to convert an RDD into DF. Regarding my table, it is

Re: Connect to postgresql with pyspark

2018-04-30 Thread 刘虓
hi, what's the problem you are facing ? 2018-04-30 6:15 GMT+08:00 dimitris plakas : > I am new in pyspark and i am learning it in order to complete my Thesis > project in university. > > > > I am trying to create a dataframe by reading from a postgresql database > table,