Re: configuring .sparkStaging with group rwx

2021-02-25 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Spark-submit --conf spark.hadoop.fs.permissions.umask-mode=007 You may also set sticky bit on staging dir Sent from my iPhone > On 26 Feb 2021, at 03:29, Bulldog20630405 wrote: > >  > > we have a spark cluster running on with multiple users... > when running with the user owning the cluster

configuring .sparkStaging with group rwx

2021-02-25 Thread Bulldog20630405
we have a spark cluster running on with multiple users... when running with the user owning the cluster jobs run fine... however, when trying to run pyspark with a different user it fails because the .sparkStaging/application_* is written with 700 so the user cannot write to that directory

Re: Structured streaming, Writing Kafka topic to BigQuery table, throws error

2021-02-25 Thread Mich Talebzadeh
Hi, I managed to make mine work using the *foreachBatch function *in writeStream. "foreach" performs custom write logic on each row and "foreachBatch" performs custom write logic on each micro-batch through SendToBigQuery function here foreachBatch(SendToBigQuery) expects 2 parameters, first:

Re: Structured Streaming With Kafka - processing each event

2021-02-25 Thread Mich Talebzadeh
Hi Sachit, I managed to make mine work using the *foreachBatch function *in writeStream. "foreach" performs custom write logic on each row and "foreachBatch" performs custom write logic on each micro-batch through SendToBigQuery function here foreachBatch(SendToBigQuery) expects 2 parameters,

Re: How to control count / size of output files for

2021-02-25 Thread Gourav Sengupta
Hi Ivan, sorry but it always helps to know the version of SPARK you are using, its environment, and the format that you are writing out your files to, and any other details if possible. Regards, Gourav Sengupta On Wed, Feb 24, 2021 at 3:43 PM Ivan Petrov wrote: > Hi, I'm trying to control

Re: How to control count / size of output files for

2021-02-25 Thread Ivan Petrov
Ah... makes sense, thank you. i tried sortWithinPartition before and replaced with sort. It was a mistake. чт, 25 февр. 2021 г. в 15:25, Pietro Gentile < pietro.gentile89.develo...@gmail.com>: > Hi, > > It is because of *repartition* before the *sort* method invocation. If > you reverse them

Aggregating large objects and reducing memory pressure

2021-02-25 Thread Augusto
Hi I am writing here because I need help/advice on how to perform aggregations more efficiently. In my current setup I have a Accumulator object which is used as zeroValue for the foldByKey function. This Accumulator object can get very large since the accumulations also include lists and maps.

Re: Structured Streaming With Kafka - processing each event

2021-02-25 Thread Mich Talebzadeh
BTW you intend to process these in 30 seconds? processingTime="30 seconds So how many rows of data are sent in microbatch and what is the interval at which you receive the data in batches from the producer? LinkedIn *

Re: Structured Streaming With Kafka - processing each event

2021-02-25 Thread Mich Talebzadeh
If you are receiving data from Kafka, Wouldn't that be better in Json format? . try: # construct a streaming dataframe streamingDataFrame that subscribes to topic config['MDVariables']['topic']) -> md (market data) streamingDataFrame = self.spark \