Yeah, seems like that option with making emptyDir larger is something that we need to consider.
Cheers Tomasz Krol On Fri, 1 Mar 2019 at 19:30, Matt Cheah <mch...@palantir.com> wrote: > Ah I see: We always force the local directory to use emptyDir and it > cannot be configured to use any other volume type. See here > <https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala> > . > > > > I am a bit conflicted on this. On one hand, it makes sense to allow for > users to be able to mount their own volumes to handle spill data. On the > other hand, I get the impression that emptyDir is the right kind of > volume for this in a majority of cases ā emptyDir is meant to be used for > temporary storage and is meant to be fast to make workflows like Spark > performant. Finally, a significant benefit of emptyDir is that Kubernetes > will handle the cleanup of the directory for you if the pod exits ā if you > use a persistent volume claim you will need to ensure the files are cleaned > up in the case that the pod exits abruptly. > > > > Iād wonder if your organization can consider modifying your Kubernetes > setup to make your emptyDir volumes larger and faster? > > > > -Matt Cheah > > > > *From: *Tomasz Krol <patric...@gmail.com> > *Date: *Friday, March 1, 2019 at 10:53 AM > *To: *Matt Cheah <mch...@palantir.com> > *Cc: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Re: Spark on k8s - map persistentStorage for data spilling > > > > Hi Matt, > > > > Thanks for coming back to me. Yeah that doesn't work. Basically in the > properties I set Volume and mounting point as below; > > > > > spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint > > > spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false > > > spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage > > > > > spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint > > > spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false > > > spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage > > > > That works as expected and PVC is mounted in the driver and executor PODs > on /checkpoint directory. > > > > As you suggested, first thing what I was trying it was set spark.local.dir > or env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was > that it will be spilling to my PVC. However this is throwing following > error: > > > > "spark-kube-driver" is invalid: > spec.containers[0].volumeMounts[3].mountPath: Invalid value: "/checkpoint": > must be unique" > > > > It seems like it's trying to mount emptyDir with mounting point > "/checkpoint", but it can't because "/checkpoint" is the directory where > the PVC is already mounted. > > > > At the moment it looks like to me, the emptyDir is always used for > spilling data. The question is how to mount it on the PVC. Unless I miss > something here. > > I can't really run any bigger jobs at the moment because of that. > Appreciate any feedback :) > > > > Thanks > > > > Tom > > > > On Thu, 28 Feb 2019 at 17:23, Matt Cheah <mch...@palantir.com> wrote: > > I think we want to change the value of spark.local.dir to point to where > your PVC is mounted. Can you give that a try and let us know if that moves > the spills as expected? > > > > -Matt Cheah > > > > *From: *Tomasz Krol <patric...@gmail.com> > *Date: *Wednesday, February 27, 2019 at 3:41 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Spark on k8s - map persistentStorage for data spilling > > > > Hey Guys, > > > > I hope someone will be able to help me, as I've stuck with this for a > while:) Basically I am running some jobs on kubernetes as per documentation > > > > https://spark.apache.org/docs/latest/running-on-kubernetes.html > [spark.apache.org] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=pl7iQpYOLmjHJrzMaSyfQ56-lmUgnrE-__71VnhN_t0&s=KRpveGOSKlQ8zkPxuwZCAiXRMqVh9nE7B2aU_fN-bFg&e=> > > > > All works fine, however if I run queries on bigger data volume, then jobs > failing that there is not enough space in /var/data/spark-1xxx directory. > > > > Obviously the reason for this is that emptyDir mounted doesnt have enough > space. > > > > I also mounted pvc to the driver and executors pods which I can see during > the runtime. I am wondering if someone knows how to set that data will be > spilled to different directory (i.e my persistent storage directory) > instead of empyDir with some limitted space. Or if I can mount the empyDir > somehow on my pvc. Basically at the moment I cant run any jobs as they are > failing due to insufficient space in that /var/data directory. > > > > Thanks > > -- > > Tomasz Krol > patric...@gmail.com > > > > > -- > > Tomasz Krol > patric...@gmail.com > -- Tomasz Krol patric...@gmail.com