Re: Time to start publishing Spark Docker Images?

2021-08-16 Thread Holden Karau
These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

On Mon, Aug 16, 2021 at 10:46 AM Maciej  wrote:

> I have a few concerns regarding PySpark and SparkR images.
>
> First of all, how do we plan to handle interpreter versions? Ideally, we
> should provide images for all supported variants, but based on the
> preceding discussion and the proposed naming convention, I assume it is not
> going to happen. If that's the case, it would be great if we could fix
> interpreter versions based on some support criteria (lowest supported,
> lowest non-deprecated, highest supported at the time of release, etc.)
>
> Currently, we use the following:
>
>- for R use buster-cran35 Debian repositories which install R 3.6
>(provided version already changed in the past and broke image build ‒
>SPARK-28606).
>- for Python we depend on the system provided python3 packages, which
>currently provides Python 3.7.
>
> which don't guarantee stability over time and might be hard to synchronize
> with our support matrix.
>
> Secondly, omitting libraries which are required for the full functionality
> and performance, specifically
>
>- Numpy, Pandas and Arrow for PySpark
>- Arrow for SparkR
>
> is likely to severely limit usability of the images (out of these, Arrow
> is probably the hardest to manage, especially when you already depend on
> system packages to provide R or Python interpreter).
>
> On 8/14/21 12:43 AM, Mich Talebzadeh wrote:
>
> Hi,
>
> We can cater for multiple types (spark, spark-py and spark-r) and spark
> versions (assuming they are downloaded and available).
> The challenge is that these docker images built are snapshots. They cannot
> be amended later and if you change anything by going inside docker, as soon
> as you are logged out whatever you did is reversed.
>
> For example, I want to add tensorflow to my docker image. These are my
> images
>
> REPOSITORYTAG   IMAGE ID
>  CREATED SIZE
> eu.gcr.io/axial-glow-224522/spark-py  java8_3.1.1   cfbb0e69f204   5
> days ago  2.37GB
> eu.gcr.io/axial-glow-224522/spark 3.1.1 8d1bf8e7e47d   5
> days ago  805MB
>
> using image ID I try to log in as root to the image
>
> *docker run -u0 -it cfbb0e69f204 bash*
>
> root@b542b0f1483d:/opt/spark/work-dir# pip install keras
> Collecting keras
>   Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
>  || 1.3 MB 1.1 MB/s
> Installing collected packages: keras
> Successfully installed keras-2.6.0
> WARNING: Running pip as the 'root' user can result in broken permissions
> and conflicting behaviour with the system package manager. It is
> recommended to use a virtual environment instead:
> https://pip.pypa.io/warnings/venv
> root@b542b0f1483d:/opt/spark/work-dir# pip list
> Package   Version
> - ---
> asn1crypto0.24.0
> cryptography  2.6.1
> cx-Oracle 8.2.1
> entrypoints   0.3
> *keras 2.6.0  <--- it is here*
> keyring   17.1.1
> keyrings.alt  3.1.1
> numpy 1.21.1
> pip   21.2.3
> py4j  0.10.9
> pycrypto  2.6.1
> PyGObject 3.30.4
> pyspark   3.1.2
> pyxdg 0.25
> PyYAML5.4.1
> SecretStorage 2.3.1
> setuptools57.4.0
> six   1.12.0
> wheel 0.32.3
> root@b542b0f1483d:/opt/spark/work-dir# exit
>
> Now I exited from the image and try to log in again
> (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run -u0
> -it cfbb0e69f204 bash
>
> root@5231ee95aa83:/opt/spark/work-dir# pip list
> Package   Version
> - ---
> asn1crypto0.24.0
> cryptography  2.6.1
> cx-Oracle 8.2.1
> entrypoints   0.3
> keyring   17.1.1
> keyrings.alt  3.1.1
> numpy 1.21.1
> pip   21.2.3
> py4j  0.10.9
> pycrypto  2.6.1
> PyGObject 3.30.4
> pyspark   3.1.2
> pyxdg 0.25
> PyYAML5.4.1
> SecretStorage 2.3.1
> setuptools57.4.0
> six   1.12.0
> wheel 0.32.3
>
> *Hm that keras is not there*. The docker Image cannot be altered after
> build! So once the docker image is created that is just a snapshot.
> However, it will still have tons of useful stuff for most
> users/organisations. My suggestions is to create for a given type (spark,
> spark-py etc):
>
>
>1. One vanilla flavour for everyday use with few useful packages
>2. One for medium use with most common packages for ETL/ELT stuff
>3. One specialist for ML etc with keras, tensorflow and anything else
>needed
>
>
> These images should be maintained as we currently maintain spark releases
> with accompanying documentation. Any reason why we cannot maintain
> ourselves?
>
> HTH
>
>view my Linkedin profile
> 

Re: Time to start publishing Spark Docker Images?

2021-08-16 Thread Maciej
I have a few concerns regarding PySpark and SparkR images.

First of all, how do we plan to handle interpreter versions? Ideally, we
should provide images for all supported variants, but based on the
preceding discussion and the proposed naming convention, I assume it is
not going to happen. If that's the case, it would be great if we could
fix interpreter versions based on some support criteria (lowest
supported, lowest non-deprecated, highest supported at the time of
release, etc.)

Currently, we use the following:

  * for R use buster-cran35 Debian repositories which install R 3.6
(provided version already changed in the past and broke image build
‒ SPARK-28606).
  * for Python we depend on the system provided python3 packages, which
currently provides Python 3.7.

which don't guarantee stability over time and might be hard to
synchronize with our support matrix.

Secondly, omitting libraries which are required for the full
functionality and performance, specifically

  * Numpy, Pandas and Arrow for PySpark
  * Arrow for SparkR

is likely to severely limit usability of the images (out of these, Arrow
is probably the hardest to manage, especially when you already depend on
system packages to provide R or Python interpreter).


On 8/14/21 12:43 AM, Mich Talebzadeh wrote:
> Hi,
>
> We can cater for multiple types (spark, spark-py and spark-r) and
> spark versions (assuming they are downloaded and available).
> The challenge is that these docker images built are snapshots. They
> cannot be amended later and if you change anything by going inside
> docker, as soon as you are logged out whatever you did is reversed.
>
> For example, I want to add tensorflow to my docker image. These are my
> images
>
> REPOSITORY                                TAG           IMAGE ID     
>  CREATED         SIZE
> eu.gcr.io/axial-glow-224522/spark-py
>       java8_3.1.1 
>  cfbb0e69f204   5 days ago      2.37GB
> eu.gcr.io/axial-glow-224522/spark
>          3.1.1       
>  8d1bf8e7e47d   5 days ago      805MB
>
> using image ID I try to log in as root to the image
>
> *docker run -u0 -it cfbb0e69f204 bash*
>
> root@b542b0f1483d:/opt/spark/work-dir# pip install keras
> Collecting keras
>   Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
>      || 1.3 MB 1.1 MB/s
> Installing collected packages: keras
> Successfully installed keras-2.6.0
> WARNING: Running pip as the 'root' user can result in broken
> permissions and conflicting behaviour with the system package manager.
> It is recommended to use a virtual environment instead:
> https://pip.pypa.io/warnings/venv 
> root@b542b0f1483d:/opt/spark/work-dir# pip list
> Package       Version
> - ---
> asn1crypto    0.24.0
> cryptography  2.6.1
> cx-Oracle     8.2.1
> entrypoints   0.3
> *keras         2.6.0      <--- it is here* 
> keyring       17.1.1
> keyrings.alt  3.1.1
> numpy         1.21.1
> pip           21.2.3
> py4j          0.10.9
> pycrypto      2.6.1
> PyGObject     3.30.4
> pyspark       3.1.2
> pyxdg         0.25
> PyYAML        5.4.1
> SecretStorage 2.3.1
> setuptools    57.4.0
> six           1.12.0
> wheel         0.32.3
> root@b542b0f1483d:/opt/spark/work-dir# exit
>
> Now I exited from the image and try to log in again
> (pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run
> -u0 -it cfbb0e69f204 bash
>
> root@5231ee95aa83:/opt/spark/work-dir# pip list
> Package       Version
> - ---
> asn1crypto    0.24.0
> cryptography  2.6.1
> cx-Oracle     8.2.1
> entrypoints   0.3
> keyring       17.1.1
> keyrings.alt  3.1.1
> numpy         1.21.1
> pip           21.2.3
> py4j          0.10.9
> pycrypto      2.6.1
> PyGObject     3.30.4
> pyspark       3.1.2
> pyxdg         0.25
> PyYAML        5.4.1
> SecretStorage 2.3.1
> setuptools    57.4.0
> six           1.12.0
> wheel         0.32.3
>
> *Hm that keras is not there*. The docker Image cannot be altered after
> build! So once the docker image is created that is just a snapshot.
> However, it will still have tons of useful stuff for most
> users/organisations. My suggestions is to create for a given type
> (spark, spark-py etc):
>
>  1. One vanilla flavour for everyday use with few useful packages
>  2. One for medium use with most common packages for ETL/ELT stuff 
>  3. One specialist for ML etc with keras, tensorflow and anything else
> needed 
>
>
> These images should be maintained as we currently maintain spark
> releases with accompanying documentation. Any reason why we cannot
> maintain ourselves? 
>
> HTH
>
>   ** view my Linkedin profile
> 
>
>  
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's 

Spark docker image convention proposal for docker repository

2021-08-16 Thread Mich Talebzadeh
Hi,

I propose that for Spark docker images we follow the following convention
similar to flink as shown in the attached
file

So for Spark we will have


--

3.1.2-scala_2.12-java11
3.1.2_sparkpy-scala_2.12-java11
3.1.2_sparkR-scala_2.12-java11


If this makes sense please respond, otherwise state your preference


HTH


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org