Structured Streaming With Kafka - processing each event

2021-02-24 Thread Sachit Murarka
Hello Users, I am using Spark 3.0.1 Structuring streaming with Pyspark. My use case:: I get so many records in kafka(essentially some metadata with the location of actual data). I have to take that metadata from kafka and apply some processing. Processing includes : Reading the actual data

Re: Spark on the cloud deployments

2021-02-24 Thread Mich Talebzadeh
Hi Stephane, If you are currently using on-premisses then you should also consider Google Cloud platform (GCP). As a practitioner I see a number of customers migrating from others to GCP. Databricks on GCP will be available (if I am correct) in April this year. GCP already offers Google

Unsubscribe

2021-02-24 Thread Roland Johann
unsubscribe-- Roland Johann Data Architect/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann

Re: How to control count / size of output files for

2021-02-24 Thread Attila Zsolt Piros
hi! It is because of "spark.sql.shuffle.partitions". See the value 200 in the physical plan at the rangepartitioning: scala> val df = sc.parallelize(1 to 1000, 10).toDF("v").sort("v") df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [v: int] scala> df.explain() == Physical Plan ==

Re: Structured streaming, Writing Kafka topic to BigQuery table, throws error

2021-02-24 Thread Mich Talebzadeh
Thanks Jungtaek. I am stuck on how to add rows to BigQuery. Spark API in PySpark does it fine. However, we are talking about structured streaming with PySpark. This is my code that reads and display data on the console fine class MDStreaming: def __init__(self, spark_session,spark_context):

Re: Spark on the cloud deployments

2021-02-24 Thread Lalwani, Jayesh
AWS has 2 offerings built on top of Spark: EMR and Glue. You can, of course, spin up your EC2 instances and deploy Spark on it. The 3 offerings allows you to tradeoff between flexibility and infrastructure management. EC2 gives you the most flexibility, because it's basically a bunch of nodes,

How to control count / size of output files for

2021-02-24 Thread Ivan Petrov
Hi, I'm trying to control the size and/or count of spark output. Here is my code. I expect to get 5 files but I get dozens of small files. Why? dataset .repartition(5) .sort("long_repeated_string_in_this_column") // should be better compressed with snappy .write .parquet(outputPath)

Spark on the cloud deployments

2021-02-24 Thread Stephane Verlet
Hello, We have been using Spark on a on-premise cluster for several years and looking at moving to a cloud deployment. I was wondering what is your current favorite cloud setup.  Just simple AWR / Azure, or something on top like Databricks ? This would support a on demand report