Re: Dataproc serverless for Spark

Mich Talebzadeh Mon, 28 Nov 2022 15:21:35 -0800

Thanks

Can you please confirm when that work was being carried out if you recall?


I opened the same question in Google Cloud Dataproc Discussions <
cloud-dataproc-disc...@googlegroups.com>, see someone will have a better
answer
Also there is another feature called Dataproc on GKE which currently
supports spark 2.4 and spark 3.1, Dataproc on GKE
<https://cloud.google.com/dataproc/docs/guides/dpgke/dataproc-gke-overview>
deploys
Dataproc virtual clusters on a GKE cluster. Unlike Dataproc on Compute
Engine clusters
<https://cloud.google.com/dataproc/docs/guides/create-cluster>, Dataproc on
GKE virtual clusters do not include separate master and worker VMs.
Instead, when you create a Dataproc on GKE virtual cluster, Dataproc on GKE
creates node pools within a GKE cluster. Dataproc on GKE jobs are run as
pods on these node pools. The node pools and scheduling of pods on the node
pools are managed by GKE.

I guess all these features are added to enable those customers that cannot
migrate from Dataproc run on compute engines to GKEs to benefit from the
look and feel of GKE.


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 28 Nov 2022 at 18:10, Holden Karau <hol...@pigscanfly.ca> wrote:

> This sounds like a great question for the Google DataProc folks (I know
> there was some interesting work being done around it but I left before it
> was finished so I don't want to provide a possibly incorrect answer).
>
> If your a GCP customer try reaching out to their support for details.
>
> On Mon, Nov 21, 2022 at 1:47 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I have not used standalone for a good while. The standard dataproc uses
>> YARN as the resource manager. The vanilla dataproc is Google's answer to
>> Hadoop on the cloud. Move your analytics workload from on-premise to Cloud
>> with little effort with the same look and feel. Google then introduced  
>> dynamic
>> allocation of resources to cater for those apps that could not be easily
>> migrated to Kubernetes (GKE). so the  doc states that  without dynamic
>> allocation, it only asks for containers at the beginning of the job. With
>> dynamic allocation, it will remove containers, or ask for new ones, as
>> necessary. This is still using YARN. See here
>> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
>>
>> <https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling#background_autoscaling_with_apache_hadoop_and_apache_spark>
>>  This
>> approach was as not necessarily very successful as adding executors
>> dynamically for larger workloads could freeze the spark application itself.
>> Reading the doc it says startup time for serverless is 60 seconds compared
>> to dataproc on Compute engine (the one you setup your own spark cluster on
>> dataproc tin boxes) of 90 seconds
>>
>> Dataproc serverless for Spark autoscaling
>> <https://cloud.google.com/dataproc-serverless/docs/concepts/autoscaling> 
>> makes
>> a reference to  "Dataproc Serverless autoscaling is the default
>> behavior, and uses Spark dynamic resource allocation
>> <https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation>
>>  to
>> determine whether, how, and when to scale your workload" So the key point
>> is Not standalone mode but generally references to "Spark provides a
>> mechanism to dynamically adjust the resources your application occupies
>> based on the workload. This means that your application may give resources
>> back to the cluster if they are no longer used and request them again later
>> when there is demand. This feature is particularly useful if multiple
>> applications share resources in your Spark cluster."
>>
>> Is'nt this the standard Spark resource allocation? So why has this
>> suddenly been elevated from Spark 3.2?
>>
>> Someone may give a more qualified answer here :)
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>> On Mon, 21 Nov 2022 at 17:32, Stephen Boesch <java...@gmail.com> wrote:
>>
>>> Out of curiosity : are there functional limitations in Spark Standalone
>>> that are of concern?  Yarn is more configurable for running non-spark
>>> workloads and how to run multiple spark jobs in parallel. But for a single
>>> spark job it seems standalone launches more quickly and does not miss any
>>> features. Are there specific limitations you are aware of / run into?
>>>
>>> stephen b
>>>
>>> On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have not tested this myself but Google have brought up *Dataproc 
>>>> Serverless
>>>> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
>>>> workloads without requiring you to provision and manage your own cluster.
>>>> Specify workload parameters, and then submit the workload to the Dataproc
>>>> Serverless service. The service will run the workload on a managed compute
>>>> infrastructure, autoscaling resources as needed. Dataproc Serverless
>>>> charges apply only to the time when the workload is executing. Google
>>>> Dataproc is similar to Amazon EMR
>>>>
>>>> So in short you don't need to provision your own Dataproc cluster etc.
>>>> One thing Inoticed from release doc
>>>> <https://cloud.google.com/dataproc-serverless/docs/overview>is that
>>>> the resource management is *spark based a*s opposed to standard
>>>> Dataproc which iis YARN based. It is available for Spark 3.2. My
>>>> assumption is that by Spark based it means that spark is running in
>>>> standalone mode. Has there been much improvement in release 3.2 for
>>>> standalone mode?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Dataproc serverless for Spark

Reply via email to