Hello Users,
I am using Spark 3.0.1 Structuring streaming with Pyspark.
My use case::
I get so many records in kafka(essentially some metadata with the location
of actual data). I have to take that metadata from kafka and apply some
processing.
Processing includes : Reading the actual data locati
Hi Stephane,
If you are currently using on-premisses then you should also consider
Google Cloud platform (GCP). As a practitioner I see a number of customers
migrating from others to GCP.
Databricks on GCP will be available (if I am correct) in April this year. GCP
already offers Google Compute
unsubscribe--
Roland Johann
Data Architect/Data Engineer
phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany
Mobil: +49 172 365 26 46
Mail: roland.joh...@phenetic.io
Web: phenetic.io
Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
hi!
It is because of "spark.sql.shuffle.partitions". See the value 200 in the
physical plan at the rangepartitioning:
scala> val df = sc.parallelize(1 to 1000, 10).toDF("v").sort("v")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [v: int]
scala> df.explain()
== Physical Plan ==
*
Thanks Jungtaek.
I am stuck on how to add rows to BigQuery. Spark API in PySpark does it
fine. However, we are talking about structured streaming with PySpark.
This is my code that reads and display data on the console fine
class MDStreaming:
def __init__(self, spark_session,spark_context):
AWS has 2 offerings built on top of Spark: EMR and Glue. You can, of course,
spin up your EC2 instances and deploy Spark on it. The 3 offerings allows you
to tradeoff between flexibility and infrastructure management. EC2 gives you
the most flexibility, because it's basically a bunch of nodes,
Hi, I'm trying to control the size and/or count of spark output.
Here is my code. I expect to get 5 files but I get dozens of small files.
Why?
dataset
.repartition(5)
.sort("long_repeated_string_in_this_column") // should be better compressed
with snappy
.write
.parquet(outputPath)
Hello,
We have been using Spark on a on-premise cluster for several years and
looking at moving to a cloud deployment.
I was wondering what is your current favorite cloud setup. Just simple
AWR / Azure, or something on top like Databricks ?
This would support a on demand report application