Re: Spark on k8s - map persistentStorage for data spilling

Tomasz Krol Fri, 01 Mar 2019 11:41:51 -0800

Yeah, seems like that option with making emptyDir larger is something that
we need to consider.


Cheers

Tomasz Krol

On Fri, 1 Mar 2019 at 19:30, Matt Cheah <mch...@palantir.com> wrote:

> Ah I see: We always force the local directory to use emptyDir and it
> cannot be configured to use any other volume type. See here
> <https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/LocalDirsFeatureStep.scala>
> .
>
>
>
> I am a bit conflicted on this. On one hand, it makes sense to allow for
> users to be able to mount their own volumes to handle spill data. On the
> other hand, I get the impression that emptyDir is the right kind of
> volume for this in a majority of cases – emptyDir is meant to be used for
> temporary storage and is meant to be fast to make workflows like Spark
> performant. Finally, a significant benefit of emptyDir is that Kubernetes
> will handle the cleanup of the directory for you if the pod exits – if you
> use a persistent volume claim you will need to ensure the files are cleaned
> up in the case that the pod exits abruptly.
>
>
>
> I’d wonder if your organization can consider modifying your Kubernetes
> setup to make your emptyDir volumes larger and faster?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tomasz Krol <patric...@gmail.com>
> *Date: *Friday, March 1, 2019 at 10:53 AM
> *To: *Matt Cheah <mch...@palantir.com>
> *Cc: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Re: Spark on k8s - map persistentStorage for data spilling
>
>
>
> Hi Matt,
>
>
>
> Thanks for coming back to me. Yeah that doesn't work. Basically in the
> properties I set Volume and mounting point as below;
>
>
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
>
>
> spark.kubernetes.driver.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
>
>
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.path=/checkpoint
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.readOnly=false
>
>
> spark.kubernetes.executor.volumes.persistentVolumeClaim.checkvolume.mount.claimName=sparkstorage
>
>
>
> That works as expected and PVC is mounted in the driver and executor PODs
> on /checkpoint directory.
>
>
>
> As you suggested, first thing what I was trying it was set spark.local.dir
> or env SPARK_LOCAL_DIRS to directory /checkpoint. As my expectation was
> that it will be spilling to my PVC. However this is throwing following
> error:
>
>
>
> "spark-kube-driver" is invalid:
> spec.containers[0].volumeMounts[3].mountPath: Invalid value: "/checkpoint":
> must be unique"
>
>
>
> It seems like it's trying to mount emptyDir with mounting point
> "/checkpoint", but it can't because "/checkpoint" is the directory where
> the PVC is already mounted.
>
>
>
> At the moment it looks like to me, the emptyDir is always used for
> spilling data. The question is how to mount it on the PVC. Unless I miss
> something here.
>
> I can't really run any bigger jobs at the moment because of that.
> Appreciate any feedback :)
>
>
>
> Thanks
>
>
>
> Tom
>
>
>
> On Thu, 28 Feb 2019 at 17:23, Matt Cheah <mch...@palantir.com> wrote:
>
> I think we want to change the value of spark.local.dir to point to where
> your PVC is mounted. Can you give that a try and let us know if that moves
> the spills as expected?
>
>
>
> -Matt Cheah
>
>
>
> *From: *Tomasz Krol <patric...@gmail.com>
> *Date: *Wednesday, February 27, 2019 at 3:41 AM
> *To: *"user@spark.apache.org" <user@spark.apache.org>
> *Subject: *Spark on k8s - map persistentStorage for data spilling
>
>
>
> Hey Guys,
>
>
>
> I hope someone will be able to help me, as I've stuck with this for a
> while:) Basically I am running some jobs on kubernetes as per documentation
>
>
>
> https://spark.apache.org/docs/latest/running-on-kubernetes.html
> [spark.apache.org]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_running-2Don-2Dkubernetes.html&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=pl7iQpYOLmjHJrzMaSyfQ56-lmUgnrE-__71VnhN_t0&s=KRpveGOSKlQ8zkPxuwZCAiXRMqVh9nE7B2aU_fN-bFg&e=>
>
>
>
> All works fine, however if I run queries on bigger data volume, then jobs
> failing that there is not enough space in /var/data/spark-1xxx directory.
>
>
>
> Obviously the reason for this is that emptyDir mounted doesnt have enough
> space.
>
>
>
> I also mounted pvc to the driver and executors pods which I can see during
> the runtime. I am wondering if someone knows how to set that data will be
> spilled to different directory (i.e my persistent storage directory)
> instead of empyDir with some limitted space. Or if I can mount the empyDir
> somehow on my pvc. Basically at the moment I cant run any jobs as they are
> failing due to insufficient space in that /var/data directory.
>
>
>
> Thanks
>
> --
>
> Tomasz Krol
> patric...@gmail.com
>
>
>
>
> --
>
> Tomasz Krol
> patric...@gmail.com
>


-- 
Tomasz Krol
patric...@gmail.com

Re: Spark on k8s - map persistentStorage for data spilling

Reply via email to