date:20220213

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-13 Thread Gourav Sengupta

Hi,
may be this is useful in case someone is testing SPARK in containers for
developing SPARK.

*From a production scale work point of view:*
But if I am in AWS, I will just use GLUE if I want to use containers for
SPARK, without massively increasing my costs for operations unnecessarily.

Also, in case I am not wrong, GCP already has SPARK running in serverless
mode.  Personally I would never create the overhead of additional costs and
issues to my clients of deploying SPARK when those solutions are already
available by Cloud vendors. Infact, that is one of the precise reasons why
people use cloud - to reduce operational costs.

Sorry, just trying to understand what is the scope of this work.


Regards,
Gourav Sengupta

On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh 
wrote:

> The equivalent of Google GKE autopilot
>  
> in
> AWS is AWS Fargate 
>
>
> I have not used the AWS Fargate so I can only mension Google's GKE
> Autopilot.
>
>
> This is developed from the concept of containerization and microservices.
> In the standard mode of creating a GKE cluster users can customize their
> configurations based on the requirements, GKE manages the control plane and
> users manually provision and manage their node infrastructure. So you
> choose your hardware type and memory/CPU where your spark containers will
> be running and they will be shown as VM hosts in your account. In GKE
> Autopilot mode, GKE manages the nodes, pre-configures the cluster with
> adds-on for auto-scaling, auto-upgrades, maintenance, Day 2 operations and
> security hardening. So there is a lot there. You don't choose your nodes
> and their sizes. You are effectively paying for the pods you use.
>
>
> Within spark-submit, you still need to specify the number of executors,
> driver and executor memory plus cores for each driver and executor when
> doing spark-submit. The theory is that the k8s cluster will deploy suitable
> nodes and will create enough pods on those nodes. With the standard k8s
> cluster you choose your nodes and you ensure that one core on each node is
> reserved for the OS itself. Otherwise if you allocate all cores to spark
> with --conf spark.executor.cores, you will receive this error
>
>
> kubctl describe pods -n spark
>
> ...
>
> Events:
>
>   Type Reason Age From
> Message
>
>    -- 
> ---
>
>   Warning  FailedScheduling   9s (x17 over 15m)   default-scheduler   0/3
> nodes are available: 3 Insufficient cpu.
>
> So with the standard k8s you have a choice of selecting your core sizes.
> With autopilot this node selection is left to autopilot to deploy suitable
> nodes and this will be a trial and error at the start (to get the
> configuration right). You may be lucky if the history of executions are
> kept current and the same job can be repeated. However, in my experience,
> to procedure the driver pod in "running state" is expensive timewise and
> without an executor in running state, there is no chance of spark job doing
> anything
>
>
> NAME READY   STATUSRESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-1   0/1 Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-2   0/1 Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-3   0/1 Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-4   0/1 Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-5   0/1 Pending   0
> 31s
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1 Pending   0
> 31s
>
> sparkbq-37405a7eea6b9468-driver  1/1 Running   0
> 3m4s
>
>
> NAME READY   STATUS
> RESTARTS   AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   0/1 ContainerCreating
>  0  112s
>
> sparkbq-37405a7eea6b9468-driver  1/1 Running
>  0  4m25s
>
> NAME READY   STATUSRESTARTS
>  AGE
>
> randomdatabigquery-cebab77eea6de971-exec-6   1/1 Running   0
> 114s
>
> sparkbq-37405a7eea6b9468-driver  1/1 Running   0
> 4m27s
>
> Basically I told Spak to have 6 executors but could only bring into
> running state one executor after the driver pod spinning for 4 minutes.
>
> 22/02/11 20:16:18 INFO SparkKubernetesClientFactory: Auto-configuring K8S
> client using current context from users K8S config file
>
> 22/02/11 20:16:19 INFO Utils: Using initial executors = 6, max of
> spark.dynamicAllocation.initialExecutors,
> spark.dynamicAllocation.minExecutors and spark.executor.instances
>
> 22/02/11 20:16:19 INFO ExecutorPodsAllocator: Going to request 3 executors
> from Kubernetes for ResourceProfile Id: 0, target: 6 running: 0.
>
> 22/02/11 20:16:20 INFO BasicExecutorFeatureStep: Decommissioning not
> enabled, skipping shutdown

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang

Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster ..
This is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what
will be done in Production.

Hope that clarifies.

regds,
Karan Alang

On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
wrote:

> Hi,
>
> agree with Holden, have faced quite a few issues with FUSE.
>
> Also trying to understand "spark-submit from local" . Are you submitting
> your SPARK jobs from a local laptop or in local mode from a GCP dataproc /
> system?
>
> If you are submitting the job from your local laptop, there will be
> performance bottlenecks I guess based on the internet bandwidth and volume
> of data.
>
> Regards,
> Gourav
>
>
> On Sat, Feb 12, 2022 at 7:12 PM Holden Karau  wrote:
>
>> You can also put the GS access jar with your Spark jars — that’s what the
>> class not found exception is pointing you towards.
>>
>> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> BTW I also answered you in in stackoverflow :
>>>
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>>> wrote:
>>>
 You are trying to access a Google storage bucket gs:// from your local
 host.

 It does not see it because spark-submit assumes that it is a local file
 system on the host which is not.

 You need to mount gs:// bucket as a local file system.

 You can use the tool called gcsfuse
 https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
 an open source FUSE  adapter that allows
 you to mount Cloud Storage buckets as file systems on Linux or macOS
 systems. You can download gcsfuse from here

 Pretty simple.

 It will be installed as /usr/bin/gcsfuse and you can mount it by
 creating a local mount file like /mnt/gs as root and give permission to
 others to use it.

 As a normal user that needs to access gs:// bucket (not as root), use
 gcsfuse to mount it. For example I am mounting a gcs bucket called
 spark-jars-karan here

 Just use the bucket name itself

 gcsfuse spark-jars-karan /mnt/gs

 Then you can refer to it as /mnt/gs in spark-submit from on-premise host

 spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
 --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

 HTH

view my Linkedin profile

 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.

 On Sat, 12 Feb 2022 at 04:31, karan alang 
 wrote:

> Hello All,
>
> I'm trying to access gcp buckets while running spark-submit from
> local, and running into issues.
>
> I'm getting error :
> ```
>
> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> Exception in thread "main" 
> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
> scheme "gs"
>
> ```
> I tried adding the --conf
> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>
> to the spark-submit command, but getting ClassNotFoundException
>
> Details are in stackoverflow :
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> Any ideas on how to fix this ?
> tia !
>
> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang

Hi Holden,

when you mention - GS Access jar -  which jar is this ?
Can you pls clarify ?

thanks,
Karan Alang

On Sat, Feb 12, 2022 at 11:10 AM Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE  adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> 
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
 Hello All,

 I'm trying to access gcp buckets while running spark-submit from local,
 and running into issues.

 I'm getting error :
 ```

 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 Exception in thread "main" 
 org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
 scheme "gs"

 ```
 I tried adding the --conf
 spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

 to the spark-submit command, but getting ClassNotFoundException

 Details are in stackoverflow :

 https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

 Any ideas on how to fix this ?
 tia !

 --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang

Thanks, Mich - will check this and update.

regds,
Karan Alang

On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh 
wrote:

> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
> wrote:
>
>> You are trying to access a Google storage bucket gs:// from your local
>> host.
>>
>> It does not see it because spark-submit assumes that it is a local file
>> system on the host which is not.
>>
>> You need to mount gs:// bucket as a local file system.
>>
>> You can use the tool called gcsfuse
>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>> an open source FUSE  adapter that allows
>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>> systems. You can download gcsfuse from here
>> 
>>
>>
>> Pretty simple.
>>
>>
>> It will be installed as /usr/bin/gcsfuse and you can mount it by creating
>> a local mount file like /mnt/gs as root and give permission to others to
>> use it.
>>
>>
>> As a normal user that needs to access gs:// bucket (not as root), use
>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>> spark-jars-karan here
>>
>>
>> Just use the bucket name itself
>>
>>
>> gcsfuse spark-jars-karan /mnt/gs
>>
>>
>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>
>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>
>> HTH
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>
>>> Hello All,
>>>
>>> I'm trying to access gcp buckets while running spark-submit from local,
>>> and running into issues.
>>>
>>> I'm getting error :
>>> ```
>>>
>>> 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
>>> library for your platform... using builtin-java classes where applicable
>>> Exception in thread "main" 
>>> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
>>> scheme "gs"
>>>
>>> ```
>>> I tried adding the --conf
>>> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
>>>
>>> to the spark-submit command, but getting ClassNotFoundException
>>>
>>> Details are in stackoverflow :
>>>
>>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>>
>>> Any ideas on how to fix this ?
>>> tia !
>>>
>>>

Re: Help With unstructured text file with spark scala

2022-02-13 Thread Rafael Mendes

Hi, Danilo.
Do you have a single large file, only?
If so, I guess you can use tools like sed/awk to split it into more files
based on layout, so you can read these files into Spark.


Em qua, 9 de fev de 2022 09:30, Bitfox  escreveu:

> Hi
>
> I am not sure about the total situation.
> But if you want a scala integration I think it could use regex to match
> and capture the keywords.
> Here I wrote one you can modify by your end.
>
> import scala.io.Source
>
> import scala.collection.mutable.ArrayBuffer
>
>
> val list1 = ArrayBuffer[(String,String,String)]()
>
> val list2 = ArrayBuffer[(String,String)]()
>
>
>
> val patt1 = """^(.*)#(.*)#([^#]*)$""".r
>
> val patt2 = """^(.*)#([^#]*)$""".r
>
>
> val file = "1.txt"
>
> val lines = Source.fromFile(file).getLines()
>
>
> for ( x <- lines ) {
>
>   x match {
>
> case patt1(k,v,z) => list1 += ((k,v,z))
>
> case patt2(k,v) => list2 += ((k,v))
>
> case _ => println("no match")
>
>   }
>
> }
>
>
>
> Now the list1 and list2 have the elements you wanted, you can convert them
> to a dataframe easily.
>
>
> Thanks.
>
> On Wed, Feb 9, 2022 at 7:20 PM Danilo Sousa 
> wrote:
>
>> Hello
>>
>>
>> Yes, for this block I can open as csv with # delimiter, but have the
>> block that is no csv format.
>>
>> This is the likely key value.
>>
>> We have two different layouts in the same file. This is the “problem”.
>>
>> Thanks for your time.
>>
>>
>>
>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>
>>
>> On 9 Feb 2022, at 00:58, Bitfox  wrote:
>>
>> Hello
>>
>> You can treat it as a csf file and load it from spark:
>>
>> >>> df = spark.read.format("csv").option("inferSchema",
>> "true").option("header", "true").option("sep","#").load(csv_file)
>> >>> df.show()
>> ++---+-+
>> |   Plano|Código Beneficiário|Nome Beneficiário|
>> ++---+-+
>> |58693 - NACIONAL ...|   65751353|   Jose Silva|
>> |58693 - NACIONAL ...|   65751388|  Joana Silva|
>> |58693 - NACIONAL ...|   65751353| Felipe Silva|
>> |58693 - NACIONAL ...|   65751388|  Julia Silva|
>> ++---+-+
>>
>>
>> cat csv_file:
>>
>> Plano#Código Beneficiário#Nome Beneficiário
>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>
>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>
>>
>> Regards
>>
>>
>> On Wed, Feb 9, 2022 at 12:50 AM Danilo Sousa 
>> wrote:
>>
>>> Hi
>>> I have to transform unstructured text to dataframe.
>>> Could anyone please help with Scala code ?
>>>
>>> Dataframe need as:
>>>
>>> operadora filial unidade contrato empresa plano codigo_beneficiario
>>> nome_beneficiario
>>>
>>> Relação de Beneficiários Ativos e Excluídos
>>> Carteira em#27/12/2019##Todos os Beneficiários
>>> Operadora#AMIL
>>> Filial#SÃO PAULO#Unidade#Guarulhos
>>>
>>> Contrato#123456 - Test
>>> Empresa#Test
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#073930312#Joao Silva
>>> 58693 - NACIONAL R COPART PJCE#073930313#Maria Silva
>>>
>>> Contrato#898011000 - FUNDACAO GERDAU
>>> Empresa#FUNDACAO GERDAU
>>> Plano#Código Beneficiário#Nome Beneficiário
>>> 58693 - NACIONAL R COPART PJCE#065751353#Jose Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Joana Silva
>>> 58693 - NACIONAL R COPART PJCE#065751353#Felipe Silva
>>> 58693 - NACIONAL R COPART PJCE#065751388#Julia Silva
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread Mich Talebzadeh

 Putting the GS access jar with Spark jars may technically resolve the
issue of spark-submit  but it is not a recommended practice to create a
local copy of jar files.

The approach that the thread owner adopted by putting the files in Google
cloud bucket is correct. Indeed this is what he states and I quote "I'm
trying to access google buckets, when using spark-submit and running into
issues.,  What needs to be done to debug/fix this". Quote from stack
overflow

Hence the approach adopted is correct. He has created a bucket in GCP
called gs://spark-jars-karan/ and wants to access it. I presume *he wants
to test i*t *locally *(on prem I assume) so he just needs to be able to
access the bucket in GCP remotely. The recommendation of using gcsfuse to
resolve this issue is sound.

HTH

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Sat, 12 Feb 2022 at 19:10, Holden Karau  wrote:

> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you towards.
>
> On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> BTW I also answered you in in stackoverflow :
>>
>>
>> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
>> wrote:
>>
>>> You are trying to access a Google storage bucket gs:// from your local
>>> host.
>>>
>>> It does not see it because spark-submit assumes that it is a local file
>>> system on the host which is not.
>>>
>>> You need to mount gs:// bucket as a local file system.
>>>
>>> You can use the tool called gcsfuse
>>> https://cloud.google.com/storage/docs/gcs-fuse . Cloud Storage FUSE is
>>> an open source FUSE  adapter that allows
>>> you to mount Cloud Storage buckets as file systems on Linux or macOS
>>> systems. You can download gcsfuse from here
>>> 
>>>
>>>
>>> Pretty simple.
>>>
>>>
>>> It will be installed as /usr/bin/gcsfuse and you can mount it by
>>> creating a local mount file like /mnt/gs as root and give permission to
>>> others to use it.
>>>
>>>
>>> As a normal user that needs to access gs:// bucket (not as root), use
>>> gcsfuse to mount it. For example I am mounting a gcs bucket called
>>> spark-jars-karan here
>>>
>>>
>>> Just use the bucket name itself
>>>
>>>
>>> gcsfuse spark-jars-karan /mnt/gs
>>>
>>>
>>> Then you can refer to it as /mnt/gs in spark-submit from on-premise host
>>>
>>> spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 
>>> --jars /mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar
>>>
>>> HTH
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sat, 12 Feb 2022 at 04:31, karan alang  wrote:
>>>
 Hello All,

 I'm trying to access gcp buckets while running spark-submit from local,
 and running into issues.

 I'm getting error :
 ```

 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where applicable
 Exception in thread "main" 
 org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for 
 scheme "gs"

 ```
 I tried adding the --conf
 spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

 to the spark-submit command, but getting ClassNotFoundException

 Details are in stackoverflow :

 https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit

 Any ideas

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

Re: Unable to access Google buckets using spark-submit

Re: Unable to access Google buckets using spark-submit

Re: Unable to access Google buckets using spark-submit

Re: Help With unstructured text file with spark scala

Re: Unable to access Google buckets using spark-submit

6 matches

Site Navigation

Mail list logo

Footer information