Hey Karan, you can get the jar from here<https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#non-dataproc_clusters> ________________________________ From: karan alang <karan.al...@gmail.com> Sent: 13 February 2022 20:08 To: Gourav Sengupta <gourav.sengu...@gmail.com> Cc: Holden Karau <hol...@pigscanfly.ca>; Mich Talebzadeh <mich.talebza...@gmail.com>; user @spark <user@spark.apache.org> Subject: [EXTERNAL] Re: Unable to access Google buckets using spark-submit
Caution! This email originated outside of FedEx. Please do not open attachments or click links from an unknown or suspicious origin. Hi Gaurav, All, I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This is more for dev/testing. I can run a -- 'gcloud dataproc jobs submit' command as well, which is what will be done in Production. Hope that clarifies. regds, Karan Alang On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta <gourav.sengu...@gmail.com<mailto:gourav.sengu...@gmail.com>> wrote: Hi, agree with Holden, have faced quite a few issues with FUSE. Also trying to understand "spark-submit from local" . Are you submitting your SPARK jobs from a local laptop or in local mode from a GCP dataproc / system? If you are submitting the job from your local laptop, there will be performance bottlenecks I guess based on the internet bandwidth and volume of data. Regards, Gourav On Sat, Feb 12, 2022 at 7:12 PM Holden Karau <hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote: You can also put the GS access jar with your Spark jars — that’s what the class not found exception is pointing you towards. On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: BTW I also answered you in in stackoverflow : https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit<https://urldefense.com/v3/__https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4BmPwNAw$> HTH [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ] view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$> https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4hPaytxY$> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: You are trying to access a Google storage bucket gs:// from your local host. It does not see it because spark-submit assumes that it is a local file system on the host which is not. You need to mount gs:// bucket as a local file system. You can use the tool called gcsfuse https://cloud.google.com/storage/docs/gcs-fuse<https://urldefense.com/v3/__https://cloud.google.com/storage/docs/gcs-fuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4fYDEO3c$> . Cloud Storage FUSE is an open source FUSE<https://urldefense.com/v3/__http://fuse.sourceforge.net/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4H2bW-18$> adapter that allows you to mount Cloud Storage buckets as file systems on Linux or macOS systems. You can download gcsfuse from here<https://urldefense.com/v3/__https://github.com/GoogleCloudPlatform/gcsfuse__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4Y3qR8x0$> Pretty simple. It will be installed as /usr/bin/gcsfuse and you can mount it by creating a local mount file like /mnt/gs as root and give permission to others to use it. As a normal user that needs to access gs:// bucket (not as root), use gcsfuse to mount it. For example I am mounting a gcs bucket called spark-jars-karan here Just use the bucket name itself gcsfuse spark-jars-karan /mnt/gs Then you can refer to it as /mnt/gs in spark-submit from on-premise host spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar HTH [https://docs.google.com/uc?export=download&id=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ&revid=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ] view my Linkedin profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4pVNnS44$> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 12 Feb 2022 at 04:31, karan alang <karan.al...@gmail.com<mailto:karan.al...@gmail.com>> wrote: Hello All, I'm trying to access gcp buckets while running spark-submit from local, and running into issues. I'm getting error : ``` 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs" ``` I tried adding the --conf spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem to the spark-submit command, but getting ClassNotFoundException Details are in stackoverflow : https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit<https://urldefense.com/v3/__https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4BmPwNAw$> Any ideas on how to fix this ? tia ! -- Twitter: https://twitter.com/holdenkarau<https://urldefense.com/v3/__https://twitter.com/holdenkarau__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4R_YF9Rw$> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://urldefense.com/v3/__https://amzn.to/2MaRAG9__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4phcMLVc$> YouTube Live Streams: https://www.youtube.com/user/holdenkarau<https://urldefense.com/v3/__https://www.youtube.com/user/holdenkarau__;!!AhGNFqKB8wRZstQ!U0djZv5WOzsCdTjYEf-0_JLfIAAZN--SIjd7qnXKMviwDhKwI9JEVrG7aPr4NJA8-ro$>