date:20200117

Re: Spark Executor OOMs when writing Parquet

2020-01-17 Thread Chris Teoh

Yes. Disk spill can be a huge performance hit, with smaller partitions you may avoid this and possibly complete your job faster. I hope you don't get OOM. On Sat, 18 Jan 2020 at 10:06, Arwin Tio wrote: > Okay! I didn't realize you can pump those partition numbers up that high. > 15000

Re: Spark Executor OOMs when writing Parquet

2020-01-17 Thread Arwin Tio

Okay! I didn't realize you can pump those partition numbers up that high. 15000 partitions still failed. I am trying 3 partitions now. There is still some disk spill but it is not that high. Thanks, Arwin From: Chris Teoh Sent: January 17, 2020 7:32 PM

Extract value from streaming Dataframe to a variable

2020-01-17 Thread Nick Dawes

I need to extract a value from a PySpark structured streaming Dataframe to a string variable to check something. I tried this code. agentName = kinesisDF.select(kinesisDF.agentName.getItem(0).alias("agentName")).collect()[0][0] This works on a non-streaming Dataframe only. In a streaming

Is there a way to get the final web URL from an active Spark context

2020-01-17 Thread Jeff Evans

Given a session/context, we can get the UI web URL like this: sparkSession.sparkContext.uiWebUrl This gives me something like http://node-name.cluster-name:4040. If opening this from outside the cluster (ex: my laptop), this redirects via HTTP 302 to something like

unsubscribe

2020-01-17 Thread Bruno S. de Barros

- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2020-01-17 Thread Christian Acuña

Record count query parallel processing in databricks spark delta lake

2020-01-17 Thread anbutech

Hi, I have a question on the design of monitoring pyspark script on the large number of source json data coming from more than 100 kafka topics. These multiple topics are store under separate bucket in aws s3.each of the kafka topics having more Terabytes of json data with respect to the

unsubscribe

2020-01-17 Thread Sethupathi T

Re: Cannot read case-sensitive Glue table backed by Parquet

2020-01-17 Thread oripwk

Sorry, but my original solution is incorrect 1. Glue Crawlers are not supposed to set the spark.sql.sources.schema.* properties, but Spark SQL should. The default in Spark 2.4 for spark.sql.hive.caseSensitiveInferenceMode is INFER_AND_SAVE which means that Spark infers the schema from the

Re: Cannot read case-sensitive Glue table backed by Parquet

2020-01-17 Thread oripwk

This bug happens because the Glue table's SERDEPROPERTIES is missing two important properties: spark.sql.sources.schema.numParts spark.sql.sources.schema.part.0 To solve the problem, I had to add those two properties via the Glue console (couldn't do it with ALTER TABLE …) I guess