Hi,
I have one dataset with parameters and another with data that needs to
apply/ filter based on the first dataset (Parameter dataset).
*Scenario is as follows:*
For each row in parameter dataset, I need to apply the parameter row to
the second dataset.I will end up having multiple
Please include the mailing list in your replies.
Yes, you'll be able to launch the jobs, but the k8s backend isn't
hooked up to the listener functionality.
On Mon, Apr 30, 2018 at 8:13 PM, purna m wrote:
> I’m able to submit the job though !! I mean spark application is
Hi,
In my spark job, I need to scan HBase table. I set up a scan with custom
filters. Then I use
newAPIHadoopRDD function to get a JavaPairRDD variable X.
The problem is when no records inside HBase matches my filters,
the call X.isEmpty() or X.count() will cause a
HI im using below code to submit a spark 2.3 application on kubernetes
cluster in scala using play framework
I have also tried as a simple scala program without using play framework
im trying to spark submit which was mentioned below programatically
Well, if you don't need to actually evaluate the information on the driver, but
just need to trigger some sort of action, then you may want to consider using
the `forEach` or `forEachPartition` method, which is an action and will execute
your process. It won't return anything to the driver and
Hello there,
I have a quick question regarding how to share data (a small data
collection) between a kafka producer and consumer using spark streaming
(spark 2.2):
(A)
the data published by a kafka producer is received in order on the kafka
consumer side (see (a) copied below).
(B)
however,
Also looks like you are mixing configuration properties from different
versions of Spark on Kubernetes.
"spark.kubernetes.{driver|executor}.docker.image" is only available in the
apache-spark-on-k8s fork, whereas "spark.kubernetes.container.image" is new
in Spark 2.3.0. Please make sure you use
Thanks so much! I'll take a look at the guide right now. The versions
should all be 2.2 of spark. In my configuration, I'm using
--conf
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
\
--conf
Which version of Spark are you using to run spark-submit, and which
version of Spark your container image is based off? This looks to be caused
my mismatched versions of Spark used for spark-submit and for the
driver/executor at runtime.
On Mon, Apr 30, 2018 at 12:00 PM, Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark
on GKE
https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc
and
if you want to run pre-built Spark on GKE there is a solutions article
Hello all,
I've been trying to spark-submit a job to the Google Kubernetes Engine but
I keep encountering a "Exception in thread "main"
java.lang.IllegalArgumentException:
Server properties file given at /opt/spark/work-dir/driver does not exist
or is not a file."
error. I'm unsure of how to even
Hi I have a couple of datasets where schema keep on changing and I store it
as parquet files. Now I use mergeSchema option while loading these different
schema parquet files in a DataFrame and it works all fine. Now I have a
requirement of maintaining difference between schema over time basically
I don't think there is a magic number, so I would say that it will depend
on how big your dataset is and the size of your worker(s).
Thank You,
Irving Duran
On Sat, Apr 28, 2018 at 10:41 AM klrmowse wrote:
> i am currently trying to find a workaround for the Spark
`.collect` returns an Array, and array's can't have more than Int.MaxValue
elements, and in most JVMs it's lower: `Int.MaxValue - 8`
So it puts upper limit, however, you can just create Array of Arrays, and
so on, basically limitless, albeit with some gymnastics.
Hi Team,
Any good book recommendations for get in-depth knowledge from zero to
production.
Let me know.
Thanks.
Could you please help us and provide the source which says about the
general guidelines (80-85)?
Even if there is a general guideline, it is probably to keep the
performance of Spark application high (And to *distinguish* it from
Hadoop). But if you are not too concerned about the *performance*
Although there is such a thing as virtualization of memory done at the OS
layer, JVM imposes it’s own limit that is controlled by the
spark.executor.memory and spark.driver.memory configurations. The amount of
memory allocated by JVM will be controlled by those parameters. General
guidelines
Hi Saulo,
If the CPU is close to 100% then you are hitting the limit. I don't think
that moving to Scala will make a difference. Both Spark and Cassandra are
CPU hungry, your setup is small in terms of CPUs. Try running Spark on
another (physical) machine so that the 2 cores are dedicated to
Hi Javier,
I will try to implement this in scala then. As far as I can see in the
documentation there is no SaveToCassandra in the python interface unless you
are working with dataframes and the kafkaStream instance does not provide
methods to convert an RDD into DF.
Regarding my table, it is
hi,
what's the problem you are facing ?
2018-04-30 6:15 GMT+08:00 dimitris plakas :
> I am new in pyspark and i am learning it in order to complete my Thesis
> project in university.
>
>
>
> I am trying to create a dataframe by reading from a postgresql database
> table,
20 matches
Mail list logo