Re: Creating spark context outside of the driver throws error

2021-03-08 Thread Mich Talebzadeh
Ok so I am wondering. Calling this outside of the driver appName = config['common']['appName'] * spark_session = s.spark_session(appName)* def spark_session(appName): return SparkSession.builder \ .appName(appName) \ .enableHiveSupport() \ .getOrCreate() It

Detecting latecomer events in Spark structured streaming

2021-03-08 Thread Sergey Oboguev
I have a Spark structured streaming based application that performs window(...) construct followed by aggregation. This construct discards latecomer events that arrive past the watermark. I need to be able to detect these late events to handle them out-of-band. The application maintains a raw

Re: Creating spark context outside of the driver throws error

2021-03-08 Thread Sean Owen
Yep, you can never use Spark inside Spark. You could run N jobs in parallel from the driver using Spark, however. On Mon, Mar 8, 2021 at 3:14 PM Mich Talebzadeh wrote: > > In structured streaming with pySpark, I need to do some work on the row > *foreach(process_row)* > > below > > > *def

Creating spark context outside of the driver throws error

2021-03-08 Thread Mich Talebzadeh
In structured streaming with pySpark, I need to do some work on the row *foreach(process_row)* below *def process_row(row):* ticker = row['ticker'] price = row['price'] if ticker == 'IBM': print(ticker, price) # read data from BigQuery table for analysis

Call for Presentations for ApacheCon 2021 now open

2021-03-08 Thread Rich Bowen
[Note: You are receiving this because you are subscribed to a users@ list on one or more Apache Software Foundation projects.] The ApacheCon Planners and the Apache Software Foundation are pleased to announce that ApacheCon@Home will be held online, September 21-23, 2021. Once again, we’ll be

Re: Single executor processing all tasks in spark structured streaming kafka

2021-03-08 Thread Kapil Garg
Hi Sachit, What do you mean by "spark is running only 1 executor with 1 task" ? Did you submit the spark application with multiple executors but only 1 is being used and rest are idle ? If that's the case, then it might happen due to spark.locality.wait setting which is by default set to 3s. This

Single executor processing all tasks in spark structured streaming kafka

2021-03-08 Thread Sachit Murarka
Hi All, I am using Spark 3.0.1 Structuring streaming with Pyspark. The problem is spark is running only 1 executor with 1 task. Following is the summary of what I am doing. Can anyone help on why my executor is 1 only? def process_events(event): fetch_actual_data() #many more steps def

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
Thanks Sean. Kind Regards, Sachit Murarka On Mon, Mar 8, 2021 at 6:23 PM Sean Owen wrote: > It's there in the error: No space left on device > You ran out of disk space (local disk) on one of your machines. > > On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka > wrote: > >> Hi All, >> >> I am

Re: How to control count / size of output files for

2021-03-08 Thread Gourav Sengupta
Hi, firstly there is no need to use repartition by range. The repartition, or coalesce clause can come after the sort and everything will be fine. Secondly to reduce the number of records per file there is no need to use repartition, just try to sort and then write out the files using the

Re: How to control count / size of output files for

2021-03-08 Thread m li
Hi Ivan, If the error you are referring to is that the data is out of order, it may be that the data is out of order due to the “repartition”. You can try to use the “repartitionByRange” scala> val df = sc.parallelize (1 to 1000, 10).toDF("v") scala>

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Mich, I will check GCP buckets , don’t have much idea about how it works. It will be easy for me to study GCP bucket if you validate my understanding below: Are you looking for performance or durability? [Ranju]:Durability or I would say feasibility. In general, every executor on every node

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Mich Talebzadeh
Hi Ranju, In your statement: "What is the best shared storage can be used to collate all executors part files at one place." Are you looking for performance or durability? In general, every executor on every node should have access to GCP buckets created under project (assuming you are using

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Mich, Purpose is all spark executors running on K8s worker nodes writes their processed task data [part files] to some shared storage , and now the Driver pod running on same kubernetes Cluster will access that shared storage and convert all those part files to single file. So I am looking

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Mich Talebzadeh
If the purpose is to use for temporary work and write put it in temporary sub-directory under a give bucket spark.conf.set("temporaryGcsBucket", config['GCPVariables']['tmp_bucket']) That dict reference is to this yml file entry CPVariables: tmp_bucket: "tmp_storage_bucket/tmp" just create

Re: Structured Streaming Microbatch Semantics

2021-03-08 Thread Mich Talebzadeh
BTW what you pickup when you start the job depends on the setting in readStream: .option("startingOffsets", "*latest*") \ in my previous example I had it "*earliest*" so setting it to "latest" will result in starting from the latest topic arrival as shown below None 0 DataFrame

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, On GCP I'd go for buckets in Google Storage. Not sure how reliable it is in production deployments though. Only demo experience here. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sean Owen
It's there in the error: No space left on device You ran out of disk space (local disk) on one of your machines. On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka wrote: > Hi All, > > I am getting the following error in my spark job. > > Can someone please have a look ? > >

Re: Structured Streaming Microbatch Semantics

2021-03-08 Thread Mich Talebzadeh
Ok thanks for the diagram. So you have ~ 30 seconds duration of each bach as in foreachBatch and 60 rows per batch Back to your question: "My question is now: Is it guaranteed by Spark that all output records of one event are always contained in a single batch or can the records also be split

RE: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Ranju Jain
Hi Jacek, I am using this property spark.kubernetes.executor.deleteOnTermination=true only to troubleshoot else I am freeing up resources after executors complete their job. Now I want to use some Shared storage which can be shared by all executors to write the part files. Which Kubernetes

Re: Spark 3.0.1 | Volume to use For Spark Kubernetes Executor Part Files Storage

2021-03-08 Thread Jacek Laskowski
Hi, > as Executors terminates after their work completes. --conf spark.kubernetes.executor.deleteOnTermination=false ? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jaceklaskowski

退订

2021-03-08 Thread 韩天罡
| | 韩天罡 | | 邮箱:15175667...@163.com | 签名由 网易邮箱大师 定制

(无主题)

2021-03-08 Thread 韩天罡
退订 | | 韩天罡 | | 邮箱:15175667...@163.com | 签名由 网易邮箱大师 定制

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
Hi Gourav, I am using Pyspark . Spark version 2.4.4. I have checked its not an space issue. Also I am using mount directory for storing temp files. Thanks Sachit On Mon, 8 Mar 2021, 13:53 Gourav Sengupta, wrote: > Hi, > > it will be much help if you could at least format the message before >

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Gourav Sengupta
Hi, it will be much help if you could at least format the message before asking people to go through it. Also I am pretty sure that the error is mentioned in the first line itself. Any ideas regarding the SPARK version, and environment that you are using? Thanks and Regards, Gourav Sengupta

com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sachit Murarka
Hi All, I am getting the following error in my spark job. Can someone please have a look ? org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 41.0 (TID 80817, executor 193):