Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Setting PYSPARK_ARCHIVES_PATH to hfds:// did the tricky. But don't
understand a few things

1) The default behaviour is if PYSPARK_ARCHIVES_PATH is empty, pyspark.zip
is uploaded from the local SPARK_HOME. If it is set to "local://" the
upload is skipped. I would expect the latter to be the default. What's the
use case for uploading the local pyspark.zip every time?
2) It seems like the localConfigs are meant to be copied every time (code
) what's the use case for that? Why not just
use the cluster config?



On Sun, Dec 10, 2023 at 1:15 PM Eugene Miretsky  wrote:

> Thanks Mich,
>
> Tried this and still getting
> INF Client: "Uploading resource
> file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
> hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and
> __spark_conf__.zip). It is working now because I enabled direct
> access to HDFS to allow copying the files. But ideally I would like to not
> have to copy any files directly to HDFS.
>
> 1) We would expect pyspark as well as the relevant configs to already be
> available on the cluster - why are they being copied over? (we can always
> provide the extra libraries needed using py-files the way you did)
> 2) If we wanted users to be able to use custom pyspark, we would rather
> just copy the file HDFS/GCS in other ways, and let users reference it in
> their job
> 3) What are the PYTHONPATH and SPARK_HOME env variables in your script?
> Are they local paths, or paths on the spark cluster?
>
> On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> How are you submitting your spark job from your client?
>>
>> Your files can either be on HDFS or HCFS such as gs, s3 etc.
>>
>> With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
>> assume you want your
>>
>> spark-submit --verbose \
>>--deploy-mode cluster \
>>--conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
>>--conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
>>--conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
>>--py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \
>>--conf "spark.driver.memory"=4G \
>>--conf "spark.executor.memory"=4G \
>>--conf "spark.num.executors"=4 \
>>--conf "spark.executor.cores"=2 \
>>$CODE_DIRECTORY_CLOUD/${APPLICATION}
>>
>> in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage
>>
>> CODE_DIRECTORY="/home/hduser/dba/bin/python/"
>> CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes"
>> cd $CODE_DIRECTORY
>> [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
>> echo `date` ", ===> creating source zip directory from  ${source_code}"
>> # zip needs to be done at root directory of code
>> zip -rq ${source_code}.zip ${source_code}
>> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD
>> gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION}
>> $CODE_DIRECTORY_CLOUD
>>
>> So in summary I create a zip  file of my project and copy it across to
>> the cloud storage and then put the application (py file) there as well and
>> use them in spark-submit
>>
>> I trust this answers your question.
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>> Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky 
>> wrote:
>>
>>> Hey All,
>>>
>>> We are running Pyspark spark-submit from a client outside the cluster.
>>> The client has network connectivity only to the Yarn Master, not the HDFS
>>> Datanodes. How can we submit the jobs? The idea would be to preload all the
>>> dependencies (job code, libraries, etc) to HDFS, and just submit the job
>>> from the client.
>>>
>>> We tried something like this
>>> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit
>>> --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url
>>> hdfs://foo.py'
>>>
>>> The error we are getting is
>>> "
>>>
>>> org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout
>>> while waiting for channel to be ready for connect. ch :
>>> java.nio.channels.SocketChannel[connection-pending remote=/
>>> 10.117.110.19:9866]
>>>
>>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
>>> /user/users/.sparkStaging/application_1698216436656_0104/
>>> *spark_conf.zip* could only be written 

Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-12-10 Thread Eugene Miretsky
Thanks Mich,

Tried this and still getting
INF Client: "Uploading resource
file:/opt/spark/spark-3.5.0-bin-hadoop3/python/lib/pyspark.zip ->
hdfs:/". It is also doing it for (py4j.-0.10.9.7-src.zip and
__spark_conf__.zip). It is working now because I enabled direct
access to HDFS to allow copying the files. But ideally I would like to not
have to copy any files directly to HDFS.

1) We would expect pyspark as well as the relevant configs to already be
available on the cluster - why are they being copied over? (we can always
provide the extra libraries needed using py-files the way you did)
2) If we wanted users to be able to use custom pyspark, we would rather
just copy the file HDFS/GCS in other ways, and let users reference it in
their job
3) What are the PYTHONPATH and SPARK_HOME env variables in your script? Are
they local paths, or paths on the spark cluster?

On Fri, Nov 17, 2023 at 8:57 AM Mich Talebzadeh 
wrote:

> Hi,
>
> How are you submitting your spark job from your client?
>
> Your files can either be on HDFS or HCFS such as gs, s3 etc.
>
> With reference to --py-files hdfs://yarn-master-url hdfs://foo.py', I
> assume you want your
>
> spark-submit --verbose \
>--deploy-mode cluster \
>--conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
>--conf "spark.yarn.appMasterEnv.PYTHONPATH=${PYTHONPATH}" \
>--conf "spark.executorEnv.PYTHONPATH=${PYTHONPATH}" \
>--py-files $CODE_DIRECTORY_CLOUD/dataproc_on_gke.zip \
>--conf "spark.driver.memory"=4G \
>--conf "spark.executor.memory"=4G \
>--conf "spark.num.executors"=4 \
>--conf "spark.executor.cores"=2 \
>$CODE_DIRECTORY_CLOUD/${APPLICATION}
>
> in my case I define $CODE_DIRECTORY_CLOUD as below on google cloud storage
>
> CODE_DIRECTORY="/home/hduser/dba/bin/python/"
> CODE_DIRECTORY_CLOUD="gs://,${PROJECT}-spark-on-k8s/codes"
> cd $CODE_DIRECTORY
> [ -f ${source_code}.zip ] && rm -r -f ${source_code}.zip
> echo `date` ", ===> creating source zip directory from  ${source_code}"
> # zip needs to be done at root directory of code
> zip -rq ${source_code}.zip ${source_code}
> gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD
> gsutil cp /home/hduser/dba/bin/python/${source_code}/src/${APPLICATION}
> $CODE_DIRECTORY_CLOUD
>
> So in summary I create a zip  file of my project and copy it across to the
> cloud storage and then put the application (py file) there as well and use
> them in spark-submit
>
> I trust this answers your question.
>
> HTH
>
>
>
> Mich Talebzadeh,
> Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 15 Nov 2023 at 21:33, Eugene Miretsky 
> wrote:
>
>> Hey All,
>>
>> We are running Pyspark spark-submit from a client outside the cluster.
>> The client has network connectivity only to the Yarn Master, not the HDFS
>> Datanodes. How can we submit the jobs? The idea would be to preload all the
>> dependencies (job code, libraries, etc) to HDFS, and just submit the job
>> from the client.
>>
>> We tried something like this
>> 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master
>> yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'
>>
>> The error we are getting is
>> "
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while
>> waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending remote=/
>> 10.117.110.19:9866]
>>
>> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
>> /user/users/.sparkStaging/application_1698216436656_0104/*spark_conf.zip*
>> could only be written to 0 of the 1 minReplication nodes. There are 2
>> datanode(s) running and 2 node(s) are excluded in this operation.
>> "
>>
>> A few question
>> 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf
>> files? Why would the client send them to the cluster? (the cluster already
>> has all that info - this would make sense in client mode, but not cluster
>> mode )
>> 2) Is it possible to use spark-submit without HDFS access?
>> 3) How would we fix this?
>>
>> Cheers,
>> Eugene
>>
>> --
>>
>> *Eugene Miretsky*
>> Managing Partner |  Badal.io | Book a meeting /w me!
>> 
>> mobile:  416-568-9245
>> email: eug...@badal.io 
>>
>

-- 

*Eugene Miretsky*
Managing Partner |  Badal.io | Book a meeting /w me!


unsubscribe

2023-12-10 Thread Rajanikant V