Re: The trigger interval in spark structured streaming

2021-03-26 Thread Mich Talebzadeh
Thanks for the insight. Appreciated Well deploying IaaS for example using Google Dataproc clusters for handling Spark will certainly address both the size of the cluster and the MIPS power provided by each node of the cluster(that can be adjusted by adding more resources to the existing nodes,

Re: The trigger interval in spark structured streaming

2021-03-26 Thread Lalwani, Jayesh
Short Answer: Yes Long answer: You need to understand your load characteristics to size your cluster. Most applications have 3 components to their load. A) a predictable amount of expected load. This usually changes based on time of day, and day of week The main thing is that it’s predictable.

The trigger interval in spark structured streaming

2021-03-26 Thread Mich Talebzadeh
One thing I noticed is that when the trigger interval in foreachBatch is set to something low (in this case 2 seconds, equivalent to the batch interval that source sends data to Kafka topic (every 2 seconds) trigger(processingTime='2 seconds') Spark sends the warning that the queue is falling

Re: Spark Views Functioning

2021-03-26 Thread Mich Talebzadeh
My view is that temporary views *createOrReplaceTempView* or its predecessor *registerTempTable* are created in the driver memory. The dag states scala> val sales = spark.read.format("jdbc").options( |Map("url" -> _ORACLEserver, |"dbtable" -> "(SELECT * FROM

Re: Spark Views Functioning

2021-03-26 Thread Sean Owen
Views are simply bookkeeping about how the query is executed, like a DataFrame. There is no data or result to store; it's just how to run a query. The views exist on the driver. The query executes like any other, on the cluster. On Fri, Mar 26, 2021 at 3:38 AM Mich Talebzadeh wrote: > > As a

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
Right, could also be the case that the overhead of distributing it is just dominating. You wouldn't use sklearn with Spark, just use sklearn at this scale. What you _can_ use Spark for easily in this case is to distribute parameter tuning with something like hyperopt. If you're building hundreds

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
Classification: Public Thanks again Sean. We did try increasing the partitions but to no avail. Maybe it's because of the low dataset volumes as you say so the overhead is the bottleneck. If we use sklearn in Spark, we have to make some changes to utilize the distributed cluster. So if we

Re: FW: Email to Spark Org please

2021-03-26 Thread Sean Owen
Simply because the data set is so small. Anything that's operating entirely in memory is faster than something splitting the same data across multiple machines, running multiple processes, and incurring all the overhead of sending the data and results, combining them, etc. That said, I suspect

Re: convert java dataframe to pyspark dataframe

2021-03-26 Thread Sean Owen
The problem is that both of these are not sharing a SparkContext as far as I can see, so there is no way to share the object across them, let alone languages. You can of course write the data from Java, read it from Python. In some hosted Spark products, you can access the same session from two

RE: FW: Email to Spark Org please

2021-03-26 Thread Williams, David (Risk Value Stream)
Classification: Limited Many thanks for your response Sean. Question - why spark is overkill for this and why is sklearn is faster please? It's the same algorithm right? Thanks again, Dave Williams From: Sean Owen mailto:sro...@gmail.com>> Sent: 25 March 2021 16:40 To: Williams, David (Risk

convert java dataframe to pyspark dataframe

2021-03-26 Thread Aditya Singh
Hi All, I am a newbie to spark and trying to pass a java dataframe to pyspark. Foloowing link has details about what I am trying to do:- https://stackoverflow.com/questions/66797382/creating-pysparks-spark-context-py4j-java-gateway-object Can someone please help me with this? Thanks,

Re: Spark Views Functioning

2021-03-26 Thread Mich Talebzadeh
As a first guess, where do you think this view is created in a distributed environment? The whole purpose is fast access to this temporary storage (shared among executors in this job) and that storage is only materialised after an action is performed. scala> val sales =

Spark Views Functioning

2021-03-26 Thread Kushagra Deep
Hi all, I just wanted to know that when we create a 'createOrReplaceTempView' on a spark dataset, where does the view reside ? Does all the data come to driver and the view is created ? Or individual executors have part of the views (based on the data each executor has) with them , so that