Re: Can’t write to PVC in K8S

2021-09-02 Thread Bjørn Jørgensen
Well, I have tried almost everything the last 2 days now. 

There is no user spark, and whatever I do with the executor image it only runs 
for 2 minutes in k8s and then restarts. 


The problem seems to be the nogroup that is writing files from executors. 
drwxr-xr-x  2185 nogroup4096 Sep  2 18:43 test14


So is there anything that I can do with that? Or should I move on to minio or 
something else? 
I need to ETL 500 K - 94 GB of json files and save them somewhere. 

On 2021/08/31 21:09:25, Mich Talebzadeh  wrote: 
> I think Holden alluded to that.
> 
> In a nutshell, users in Linux can belong to more than one group. In this
> case you want to create a new group newgroup and add two users to that
> group.Do this in the docker file as USER 0
> 
> RUN groupadd newgroup
> ## Now add the two users (these users need to exist)
> RUN usermod -a -G newgroup jovyan
> RUN usermod -a -G newgroup spark
> ## set permission on the directory
> RUN chgrp -R newgroup /path/to/the/directory
> RUN chmod -R 770 /path/to/the/directory
> 
> Check this thread as well
> 
> https://superuser.com/questions/280994/give-write-permissions-to-multiple-users-on-a-folder-in-ubuntu
> 
> HTH
> 
> 
> 
>view my Linkedin profile
> 
> 
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> 
> On Tue, 31 Aug 2021 at 20:50, Holden Karau  wrote:
> 
> > You can change the UID of one of them to match, or you could add them both
> > to a group and set permissions to 770.
> >
> > On Tue, Aug 31, 2021 at 12:18 PM Bjørn Jørgensen 
> > wrote:
> >
> >> Hi and thanks for all the good help.
> >>
> >> I will build jupyter on top of spark to be able to run jupyter in local
> >> mode with the new koalas library. The new koalas library can be imported as
> >> "from pyspark import pandas as ps".
> >>
> >> Then you can run spark on K8S the same way that you use pandas in a
> >> notebook.
> >>
> >> The easiest way to get a PV in K8S is with NFS. And with NFS you will
> >> find your files outside K8S without having to copy files out of a K8S PVC.
> >>
> >> With this setup I can use pandas code in a notebook with the power from a
> >> K8S cluster, as a normal notebook with pandas code.
> >> I hope that this project will be a easy way to convert from pandas to
> >> spark on K8S.
> >>
> >>
> >> I did some testing to day with file permission. Like  RUN mkdir -p
> >> /home/files and RUN chmod g+w /home/files
> >> But
> >>
> >> 185@myapp-38a8887b9cedae97-exec-1:~/work-dir$ id
> >> uid=185(185) gid=0(root) groups=0(root)
> >>
> >>
> >> jovyan@my-pyspark-notebook-f6d497958-t9rpk:~$ id
> >> uid=1000(jovyan) gid=100(users) groups=100(users)
> >>
> >> so it did't work.
> >>
> >> What will be the best way to make jovyan and 185 write to the same
> >> folder?
> >> On 2021/08/30 23:00:40, Mich Talebzadeh 
> >> wrote:
> >> > To be specific uid=185 (spark user, AKA anonymous) and root are in the
> >> same
> >> > group in the docker image itself
> >> >
> >> >
> >> > id
> >> >
> >> > uid=185(185) gid=0(root) groups=0(root)
> >> >
> >> >
> >> > So in the docker image conf file, you can create your permanent
> >> directory
> >> > as root off /home say
> >> >
> >> > do it as root (USER 0)
> >> >
> >> >
> >> > RUN mkdir -p /home/
> >> >
> >> > RUN chmod g+w /home/  ## give write permission to spark
> >> >
> >> >
> >> > ARG spark_uid=185
> >> > ..
> >> >
> >> > # Specify the User that the actual main process will run as
> >> >
> >> > USER ${spark_uid}
> >> >
> >> >
> >> >view my Linkedin profile
> >> > 
> >> >
> >> >
> >> >
> >> > *Disclaimer:* Use it at your own risk. Any and all responsibility for
> >> any
> >> > loss, damage or destruction of data or any other property which may
> >> arise
> >> > from relying on this email's technical content is explicitly disclaimed.
> >> > The author will in no case be liable for any monetary damages arising
> >> from
> >> > such loss, damage or destruction.
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, 30 Aug 2021 at 22:26, Mich Talebzadeh <
> >> mich.talebza...@gmail.com>
> >> > wrote:
> >> >
> >> > > Forgot to mention that Spark uses that work directory to unzip the
> >> zipped
> >> > > files or gunzip archive files
> >> > >
> >> > > For example
> >> > >
> >> > > pyFiles
> >>  gs://axial-glow-224522-spark-on-k8s/codes/DSBQ.zip
> >> > >
> >> > >
> >> > > Spark will use that $SPARK_HOME/work-dir to unzip DSBQ.zip which is
> >> the
> >> > > application package here
> >> > >
> >> > >
> >> > > The alternative is to hack the docker file to create a directory for
> >> > > yourself
> >> > >
> >> > >
> >> > > RUN 

Re: Can’t write to PVC in K8S

2021-08-31 Thread Bjørn Jørgensen
Hi and thanks for all the good help. 

I will build jupyter on top of spark to be able to run jupyter in local mode 
with the new koalas library. The new koalas library can be imported as "from 
pyspark import pandas as ps".  

Then you can run spark on K8S the same way that you use pandas in a notebook. 

The easiest way to get a PV in K8S is with NFS. And with NFS you will find your 
files outside K8S without having to copy files out of a K8S PVC.

With this setup I can use pandas code in a notebook with the power from a K8S 
cluster, as a normal notebook with pandas code.
I hope that this project will be a easy way to convert from pandas to spark on 
K8S.


I did some testing to day with file permission. Like  RUN mkdir -p /home/files 
and RUN chmod g+w /home/files 
But 

185@myapp-38a8887b9cedae97-exec-1:~/work-dir$ id
uid=185(185) gid=0(root) groups=0(root)


jovyan@my-pyspark-notebook-f6d497958-t9rpk:~$ id
uid=1000(jovyan) gid=100(users) groups=100(users)

so it did't work.

What will be the best way to make jovyan and 185 write to the same folder? 
On 2021/08/30 23:00:40, Mich Talebzadeh  wrote: 
> To be specific uid=185 (spark user, AKA anonymous) and root are in the same
> group in the docker image itself
> 
> 
> id
> 
> uid=185(185) gid=0(root) groups=0(root)
> 
> 
> So in the docker image conf file, you can create your permanent directory
> as root off /home say
> 
> do it as root (USER 0)
> 
> 
> RUN mkdir -p /home/
> 
> RUN chmod g+w /home/  ## give write permission to spark
> 
> 
> ARG spark_uid=185
> ..
> 
> # Specify the User that the actual main process will run as
> 
> USER ${spark_uid}
> 
> 
>view my Linkedin profile
> 
> 
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> 
> On Mon, 30 Aug 2021 at 22:26, Mich Talebzadeh 
> wrote:
> 
> > Forgot to mention that Spark uses that work directory to unzip the zipped
> > files or gunzip archive files
> >
> > For example
> >
> > pyFiles gs://axial-glow-224522-spark-on-k8s/codes/DSBQ.zip
> >
> >
> > Spark will use that $SPARK_HOME/work-dir to unzip DSBQ.zip which is the
> > application package here
> >
> >
> > The alternative is to hack the docker file to create a directory for
> > yourself
> >
> >
> > RUN mkdir -p /home/conf
> >
> > RUN chmod g+w /home/conf
> >
> >
> > HTH
> >
> >
> > *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> > The author will in no case be liable for any monetary damages arising from
> > such loss, damage or destruction.
> >
> >
> >
> >
> >
> >
> > On Mon, 30 Aug 2021 at 22:13, Mich Talebzadeh 
> > wrote:
> >
> >> I am not familiar with  jupyterlab  so cannot comment on that.
> >>
> >> However, once your parquet file is written to the work-dir, how are you
> >> going to utilise it?
> >>
> >> HTH
> >>
> >>
> >>
> >>
> >>view my Linkedin profile
> >> 
> >>
> >>
> >>
> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> >> any loss, damage or destruction of data or any other property which may
> >> arise from relying on this email's technical content is explicitly
> >> disclaimed. The author will in no case be liable for any monetary damages
> >> arising from such loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On Mon, 30 Aug 2021 at 22:05, Bjørn Jørgensen 
> >> wrote:
> >>
> >>> ok, so when I use spark on k8s I can only save files to s3 buckets or to
> >>> a database?
> >>>
> >>> Note my setup, its spark with jupyterlab on top on k8s.
> >>>
> >>> What are those for if I cant write files from spark in k8s to disk?
> >>>
> >>> "spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs100.mount.readOnly",
> >>> "False"
> >>> "spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs100.mount.readOnly",
> >>> "False"
> >>>
> >>> On 2021/08/30 20:50:22, Mich Talebzadeh 
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > You are trying to write to work-dir inside the docker and create
> >>> > sub-directories:
> >>> >
> >>> > The error you are getting is this
> >>> >
> >>> > Mkdirs failed to create
> >>> >
> >>> file:/opt/spark/work-dir/falk/F01test_df.parquet/_temporary/0/_temporary/attempt_202108291906304682784428756208427_0026_m_00_9563
> >>> > (exists=false, cwd=file:/opt/spark/work-dir)
> >>> >
> >>> > That directory /work-dir is not recognised as a valid directory
> >>> > for storage. It is not in HDFS or HCFS format
> >>> >
> >>> >
> >>> > From Spark you can write to

Re: Can’t write to PVC in K8S

2021-08-30 Thread Bjørn Jørgensen
ok, so when I use spark on k8s I can only save files to s3 buckets or to a 
database? 

Note my setup, its spark with jupyterlab on top on k8s. 

What are those for if I cant write files from spark in k8s to disk? 

"spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs100.mount.readOnly", 
"False"
"spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs100.mount.readOnly",
 "False"

On 2021/08/30 20:50:22, Mich Talebzadeh  wrote: 
> Hi,
> 
> You are trying to write to work-dir inside the docker and create
> sub-directories:
> 
> The error you are getting is this
> 
> Mkdirs failed to create
> file:/opt/spark/work-dir/falk/F01test_df.parquet/_temporary/0/_temporary/attempt_202108291906304682784428756208427_0026_m_00_9563
> (exists=false, cwd=file:/opt/spark/work-dir)
> 
> That directory /work-dir is not recognised as a valid directory
> for storage. It is not in HDFS or HCFS format
> 
> 
> From Spark you can write to a bucket outside as a permanent storage.
> 
> HTH
> 
> 
>view my Linkedin profile
> 
> 
> 
> 
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> 
> 
> 
> 
> On Mon, 30 Aug 2021 at 14:11, Bjørn Jørgensen 
> wrote:
> 
> > Hi, I have built and running spark on k8s. A link to my repo
> > https://github.com/bjornjorgensen/jlpyk8s
> >
> > Everything seems to be running fine, but I can’t save to PVC.
> > If I convert the dataframe to pandas, then I can save it.
> >
> >
> >
> > from pyspark.sql import SparkSession
> > spark = SparkSession.builder \
> > .master("k8s://https://kubernetes.default.svc.cluster.local:443";) \
> > .config("spark.kubernetes.container.image",
> > "bjornjorgensen/spark-py:v3.2-290821") \
> > .config("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/
> > kubernetes.io/serviceaccount/ca.crt") \
> > .config("spark.kubernetes.authenticate.oauthTokenFile",
> > "/var/run/secrets/kubernetes.io/serviceaccount/token") \
> > .config("spark.kubernetes.authenticate.driver.serviceAccountName",
> > "my-pyspark-notebook") \
> > .config("spark.executor.instances", "10") \
> > .config("spark.driver.host",
> > "my-pyspark-notebook-spark-driver.default.svc.cluster.local") \
> > .config("spark.driver.port", "29413") \
> >
> > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs100.options.claimName",
> > "nfs100") \
> >
> > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs100.mount.path",
> > "/opt/spark/work-dir") \
> >
> > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs100.options.claimName",
> > "nfs100") \
> >
> > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs100.mount.path",
> > "/opt/spark/work-dir") \
> >
> > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.nfs100.mount.readOnly",
> > "False") \
> >
> > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.nfs100.mount.readOnly",
> > "False") \
> > .appName("myApp") \
> > .config("spark.sql.repl.eagerEval.enabled", "True") \
> > .config("spark.driver.memory", "4g") \
> > .config("spark.executor.memory", "4g") \
> > .getOrCreate()
> > sc = spark.sparkContext
> >
> > pdf.to_parquet("/opt/spark/work-dir/falk/test/F01test.parquet")
> >
> >
> > 21/08/30 12:20:34 WARN WindowExec: No Partition Defined for Window
> > operation! Moving all data to a single partition, this can cause serious
> > performance degradation.
> > 21/08/30 12:20:34 WARN WindowExec: No Partition Defined for Window
> > operation! Moving all data to a single partition, this can cause serious
> > performance degradation.
> > 21/08/30 12:20:37 WARN WindowExec: No Partition Defined for Window
> > operation! Moving all data to a single partition, this can cause serious
> > performance degradation.
> > 21/08/30 12:20:39 WARN TaskSetManager: Lost task 0.0 in stage 25.0 (TID
> > 9497) (10.42.0.16 executor 3): java.io.IOException: Mkdirs failed to create
> > file:/opt/spark/work-dir/falk/test/F01test.parquet/_temporary/0/_temporary/attempt_202108301220375889526593865835092_0025_m_00_9497
> > (exists=false, cwd=file:/opt/spark/work-dir)
> > at
> > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:515)
> > at
> > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:500)
> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195)
> > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175)
> > at
> > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
> > at
> > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:329)
> >