Re: docker image distribution in Kubernetes cluster

2021-12-09 Thread Mich Talebzadeh
Thanks Prasad.

My understanding to what you are implying is that we can have multiple
docker images to use for different use cases

gcloud container images list-tags eu.gcr.io//spark-py
e2e71387c295  3.1.1-scala_2.12-8-jre-slim-buster-java8WithPyyaml
2021-12-08T22:56:17
d0bcc195a35f  3.1.2-scala_2.12-8-jre-slim-buster-addedpackages
2021-08-27T20:43:11
229e03971f73  3.1.1-scala_2.12-8-jre-slim-buster-addedpackages
2021-08-22T17:23:50

So the spark-submit can utilise either of these in


   --conf spark.kubernetes.driver.docker.image=${IMAGEGCP} \

   --conf spark.kubernetes.executor.docker.image=${IMAGEGCP} \


Note that in this case both the driver and executors will use the same
image and that ${IMAGEGCP}  can be set to whatever is in the repository.
Now the points made by previous comments implied that the drive could have
the basic package identified here say with
3.1.1-scala_2.12-8-jre-slim-buster-java8WithPyyaml and the executors will
have 3.1.1-scala_2.12-8-jre-slim-buster-addedpackages with the additional
packages.


Cheers


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Dec 2021 at 05:59, Prasad Paravatha 
wrote:

> I agree with Khalid and Rob. We absolutely need different properties for
> Driver and Executor images for ML use-cases.
>
> Here is a real-world example of setup at our company
>
>- Default setup via configmaps: When our Data scientists request Spark
>on k8s clusters (they are not familiar with Docker or k8s), we inject spark
>default Driver/Executor images (and whole lot of other default properties)
>- Our ML Engineers frequently build new Driver and Executor images to
>include new experimental ML libraries/packages, test and release to the
>wider Data scientist community.
>
> Regards,
> Prasad
>
> On Thu, Dec 9, 2021 at 12:25 AM Mich Talebzadeh 
> wrote:
>
>>
>> Fine. If I go back to the list itself
>>
>>
>> Property NameDefaultMeaning
>> spark.kubernetes.container.image (none) Container image to use for the
>> Spark application. This is usually of the form
>> example.com/repo/spark:v1.0.0. This configuration is required and must
>> be provided by the user, unless explicit images are provided for each
>> different container type. 2.3.0
>> spark.kubernetes.driver.container.image (value of
>> spark.kubernetes.container.image) Custom container image to use for the
>> driver. 2.3.0
>> spark.kubernetes.executor.container.image (value of
>> spark.kubernetes.container.image) Custom container image to use for
>> executors.
>>
>> If I specify* both* the driver and executor images, then there is no
>> need for a generic container type image,  it will be ignored.  So either
>> one specifies the driver AND executor images explicitly and excludes the
>> container image or
>>
>> specifies one of the driver *or* container images explicitly and then it
>> has to set the container image as well for the default to work. A bit of a
>> long shot.
>>
>>
>> cheers
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 8 Dec 2021 at 18:21, Rob Vesse  wrote:
>>
>>> So the point Khalid was trying to make is that there are legitimate
>>> reasons you might use different container images for the driver pod vs the
>>> executor pod.  It has nothing to do with Docker versions.
>>>
>>>
>>>
>>> Since the bulk of the actual work happens on the executor you may want
>>> additional libraries, tools or software in that image that your job code
>>> can call.  This same software may be entirely unnecessary on the driver
>>> allowing you to use a smaller image for that versus the executor image.
>>>
>>>
>>>
>>> As a practical example for a ML use case you might want to have the
>>> optional Intel MKL or OpenBLAS dependencies which can significantly bloat
>>> the size of your container image (by hundreds of megabytes) and would only
>>> be needed by the executor pods.
>>>
>>>
>>>
>>> Rob
>>>
>>>
>>>
>>> *From: *Mich Talebzadeh 
>>> *Date: *Wednesday, 8 December 2021 at 17:42
>>> *To: *Khalid Mammadov 
>>> *Cc: *"user @spark" , Spark dev list <
>>> dev@spark.apache.org>
>>> *Subject: *Re: docker image distribution in Kubernetes cluster
>>>
>>>
>>>
>>> Thanks Khalid for your notes
>>>
>>>
>>>

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
I'm trying to create a new aggregate function. It's my first time working
with Catalyst, so it's exciting---but I'm also in a bit over my head.

My goal is to create a function to calculate the median
.

As a very simple solution, I could just define median to be an alias
of `Percentile(col,
0.5)`. However, the leading comment on the Percentile expression

highlights that it's very memory-intensive and can easily lead to
OutOfMemory errors.

So instead of using Percentile, I'm trying to create an Expression that
calculates the median without needing to hold everything in memory at once.
I'm considering two different approaches:

1. Define Median as a combination of existing expressions: The median can
perhaps be built out of the existing expressions for Count

and NthValue

.

I don't see a template I can follow for building a new expression out of
existing expressions (i.e. without having to implement a bunch of methods
for DeclarativeAggregate or ImperativeAggregate). I also don't know how I
would wrap NthValue to make it usable as a regular aggregate function. The
wrapped NthValue would need an implicit window that provides the necessary
ordering.


Is there any potential to this idea? Any pointers on how to implement it?


2. Another memory-light approach to calculating the median requires
multiple passes over the data to converge on the answer. The approach
is described
here
.
(I posted a sketch implementation of this approach using Spark's user-level
API here

.)

I am also struggling to understand how I would build an aggregate function
like this, since it requires multiple passes over the data. From what I can
see, Catalyst's aggregate functions are designed to work with a single pass
over the data.

We don't seem to have an interface for AggregateFunction that supports
multiple passes over the data. Is there some way to do this?


Again, this is my first serious foray into Catalyst. Any specific
implementation guidance is appreciated!

Nick