RE: Re: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang
Hi Steve, 

You’re correct about the '--packages' option, seems my memory does not serve me 
well :) 

On 2022/02/15 07:04:27 Stephen Coy wrote:
> Hi Morven,
> 
> We use —packages for all of our spark jobs. Spark downloads the specified jar 
> and all of its dependencies from a Maven repository.
> 
> This means we never have to build fat or uber jars.
> 
> It does mean that the Apache Ivy configuration has to be set up correctly 
> though.
> 
> Cheers,
> 
> Steve C
> 
> > On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
> >
> > I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> > spark-avro to my pom.xml. After putting spark-avro dependency to my 
> > pom.xml, everything works fine.
> >
> > Another thing is, if my memory serves me right, the spark-submit options 
> > for extra jars is ‘--jars’ , not ‘--packages’.
> >
> > Regards,
> >
> > Morven Huang
> >
> >
> > On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
> >> Hello,
> >>
> >> I have been trying to use spark SQL’s operations that are related to the 
> >> Avro file format,
> >> e.g., stored as, save, load, in a Java class but they keep failing with 
> >> the following stack trace:
> >>
> >> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
> >> to find data source: avro. Avro is built-in but external data source 
> >> module since Spark 2.4. Please deploy the application as per the 
> >> deployment section of "Apache Avro Data Source Guide".
> >>at 
> >> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
> >>at 
> >> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
> >>at 
> >> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
> >>at 
> >> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
> >>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
> >>at 
> >> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> >> Method)
> >>at 
> >> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> >>at 
> >> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> >>at 
> >> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> >>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> >>at 
> >> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> >>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> >>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> >>
> >> For context, I am invoking spark-submit and adding arguments --packages 
> >> org.apache.spark:spark-avro_2.12:3.2.0.
> >> Yet, Spark responds as if the dependency was not added.
> >> I am running spark-v3.2.0 (Scala 2.12).
> >>
> >> On the other hand, everything works great with spark-shell or spark-sql.
> >>
> >> I would appreciate any advice or feedback to get this running.
> >>
> >> Thank you,
> >> Anna
> >>
> >>
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> 
> This email contains confidential information of and is the copyright of 
> Infomedia. It must not be forwarded, amended or disclosed without consent of 
> the sender. If you received this message by mistake, please advise the sender 
> and delete all copies. Security of transmission on the internet cannot be 
> guaranteed, could be infected, intercepted, or corrupted and you should 
> ensure you have suitable antivirus protection in place. By sending us your or 
> any third party personal details, you consent to (or confirm you have 
> obtained consent from such third parties) to Infomedia’s privacy policy. 
> http://www.infomedia.com.au/privacy-policy/
> 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven,

We use —packages for all of our spark jobs. Spark downloads the specified jar 
and all of its dependencies from a Maven repository.

This means we never have to build fat or uber jars.

It does mean that the Apache Ivy configuration has to be set up correctly 
though.

Cheers,

Steve C

> On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
>
> I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
> everything works fine.
>
> Another thing is, if my memory serves me right, the spark-submit options for 
> extra jars is ‘--jars’ , not ‘--packages’.
>
> Regards,
>
> Morven Huang
>
>
> On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
>> Hello,
>>
>> I have been trying to use spark SQL’s operations that are related to the 
>> Avro file format,
>> e.g., stored as, save, load, in a Java class but they keep failing with the 
>> following stack trace:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
>> to find data source: avro. Avro is built-in but external data source module 
>> since Spark 2.4. Please deploy the application as per the deployment section 
>> of "Apache Avro Data Source Guide".
>>at 
>> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>>at 
>> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
>>at 
>> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
>>at 
>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
>>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>>at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>>at 
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>>at 
>> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>>at 
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>>at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>>at 
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> For context, I am invoking spark-submit and adding arguments --packages 
>> org.apache.spark:spark-avro_2.12:3.2.0.
>> Yet, Spark responds as if the dependency was not added.
>> I am running spark-v3.2.0 (Scala 2.12).
>>
>> On the other hand, everything works great with spark-shell or spark-sql.
>>
>> I would appreciate any advice or feedback to get this running.
>>
>> Thank you,
>> Anna
>>
>>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


RE: Using Avro file format with SparkSQL

2022-02-14 Thread Morven Huang
I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
everything works fine.

Another thing is, if my memory serves me right, the spark-submit options for 
extra jars is ‘--jars’ , not ‘--packages’. 

Regards, 

Morven Huang


On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
> Hello,
> 
> I have been trying to use spark SQL’s operations that are related to the Avro 
> file format,
> e.g., stored as, save, load, in a Java class but they keep failing with the 
> following stack trace:
> 
> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed to 
> find data source: avro. Avro is built-in but external data source module 
> since Spark 2.4. Please deploy the application as per the deployment section 
> of "Apache Avro Data Source Guide".
> at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
> at 
> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
> at 
> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
> at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 
> For context, I am invoking spark-submit and adding arguments --packages 
> org.apache.spark:spark-avro_2.12:3.2.0.
> Yet, Spark responds as if the dependency was not added.
> I am running spark-v3.2.0 (Scala 2.12).
> 
> On the other hand, everything works great with spark-shell or spark-sql.
> 
> I would appreciate any advice or feedback to get this running.
> 
> Thank you,
> Anna
> 
> 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2022-02-14 Thread William R
Unsubscribe


Position for 'cf.content' not found in row

2022-02-14 Thread 潘明文
HI,
   Could you help me the below issue,Thanks!
  This is my source code:
SparkConf sparkConf = new SparkConf(true);
sparkConf.setAppName(ESTest.class.getName());

SparkSession spark = null;
sparkConf.setMaster("local[*]");
sparkConf.set("spark.cleaner.ttl", "3600");
sparkConf.set("es.nodes", "10.12.65.10");
sparkConf.set("es.port", "9200");
sparkConf.set("es.nodes.discovery", "false");
sparkConf.set("es.nodes.wan.only", "true");
spark = SparkSession.builder().config(sparkConf).getOrCreate();

Dataset df1 = JavaEsSparkSQL.esDF(spark, "index");
df1.printSchema();
df1.show();


elasticsearch index:


When run the job has below issue:
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException:Position for 
'cf.content' not found in row; typically this is caused by a mapping 
inconsistency
at 
org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:60)
at 
org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:32)
at 
org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:118)
at 
org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:1047)
at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:889)
at 
org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:602)
at 
org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:426)
... 34 more




Thanks.

Re: Apache spark 3.0.3 [Spark lower version enhancements]

2022-02-14 Thread Sean Owen
What vulnerabilities are you referring to? I'm not aware of any critical
outstanding issues, but not sure what you have in mind either.
See https://spark.apache.org/versioning-policy.html - 3.0.x is EOL about
now, which doesn't mean there can't be another release, but would not
generally expect one.

On Mon, Feb 14, 2022 at 3:48 PM Rajesh Krishnamurthy <
rkrishnamur...@perforce.com> wrote:

> Hi Sean,
>
>Thanks for the response. Does the community have any plans of fixing
> any vulnerabilities that have been identified in the 3.0.3 version? Do you
> have any fixed date that 3.0.x is going to be EOL?
>
>
>
> Rajesh Krishnamurthy | Enterprise Architect
> T: +1 510-833-7189 | M: +1 925-917-9208
> http://www.perforce.com
> Visit us on: Twitter
> 
>  | LinkedIn
> 
>  | Facebook
> 
>
> On Feb 11, 2022, at 3:09 PM, Sean Owen  wrote:
>
> 3.0.x is about EOL now, and I hadn't heard anyone come forward to push a
> final maintenance release. Is there a specific issue you're concerned about?
>
> On Fri, Feb 11, 2022 at 4:24 PM Rajesh Krishnamurthy <
> rkrishnamur...@perforce.com> wrote:
>
>> Hi there,
>>
>>   We are just wondering if there are any agenda by the Spark community to
>> actively engage development activities on the 3.0.x path. I know we have
>> the latest version of Spark with 3.2.x, but we are just wondering if any
>> development plans to have the vulnerabilities fixed on the 3.0.x path that
>> were identified on the 3.0.3 version, so that we don’t need to migrate to
>> next major version(3.1.x in this case), but at the same time all the
>> vulnerabilities fixed within the minor version upgrade(eg:3.0.x)
>>
>>
>> Rajesh Krishnamurthy | Enterprise Architect
>> T: +1 510-833-7189 | M: +1 925-917-9208
>> http://www.perforce.com
>> Visit us on: Twitter
>> 
>>  | LinkedIn
>> 
>>  | Facebook
>> 
>>
>>
>> This e-mail may contain information that is privileged or confidential.
>> If you are not the intended recipient, please delete the e-mail and any
>> attachments and notify us immediately.
>>
>>
>
> *CAUTION:* This email originated from outside of the organization. Do not
> click on links or open attachments unless you recognize the sender and know
> the content is safe.
>
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>


Re: Spark kubernetes s3 connectivity issue

2022-02-14 Thread Mich Talebzadeh
actually can you create an Uber jar file in a conventional way using those
two hadoop versions? You have HADOOP_AWS_VERSION=3.3.0 besides 3.2.

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Feb 2022 at 20:04, Raj ks  wrote:

> I understand what you are saying . However, I am not sure how to implement
> when i create a docker image using spark 3.2.1 with hadoop 3.2 which has
> guava jar already added as part of distribution.
>
> On Tue, Feb 15, 2022, 01:17 Mich Talebzadeh 
> wrote:
>
>> Hi Raj,
>>
>> I found the old email. That is what I did but it is 2018 stuff.
>>
>> The email says
>>
>>  I sorted out this problem. I rewrote the assembly with shade rules to
>> avoid old jar files as follows:
>>
>> lazy val root = (project in file(".")).
>>   settings(
>> name := "${APPLICATION}",
>> version := "1.0",
>> scalaVersion := "2.11.8",
>> mainClass in Compile := Some("myPackage.${APPLICATION}")
>>   )
>> assemblyShadeRules in assembly := Seq(
>> ShadeRule.rename("com.google.common.**" -> "my_conf.@1").inAll
>> )
>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
>> "provided"
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"  %
>> "provided" exclude("org.apache.hadoop", "hadoop-client")
>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/;
>> libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.8"
>> libraryDependencies += "commons-io" % "commons-io" % "2.4"
>> libraryDependencies += "javax.servlet" % "javax.servlet-api" % "3.0.1" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
>> "provided"
>> libraryDependencies += "com.google.cloud.bigdataoss" %
>> "bigquery-connector" % "0.13.4-hadoop3"
>> libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" %
>> "1.9.4-hadoop3"
>> libraryDependencies += "com.google.code.gson" % "gson" % "2.8.5"
>> libraryDependencies += "org.apache.httpcomponents" % "httpcore" % "4.4.8"
>> libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.4.0"
>> libraryDependencies += "com.github.samelamin" %% "spark-bigquery" %
>> "0.2.5"
>>
>> // META-INF discarding
>> assemblyMergeStrategy in assembly := {
>>  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
>>  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>>  case x => MergeStrategy.first
>> }
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Feb 2022 at 19:40, Raj ks  wrote:
>>
>>> Should we remove the existing jar and upgrade it to some recent version?
>>>
>>> On Tue, Feb 15, 2022, 01:08 Mich Talebzadeh 
>>> wrote:
>>>
 I recall I had similar issues running Spark on Google Dataproc.

 sounds like it gets Hadoop's jars on the classpath which include an
 older version of Guava. The solution is to shade/relocate Guava in your
 distribution


 HTH


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Mon, 14 Feb 2022 at 19:10, Raj ks  wrote:

> Hi Team ,
>
> We are trying to build a docker image using Centos and trying to
> connect through S3. Same works with Hadoop 3.2.0 and spark.3.1.2
>
> #Installing spark binaries
> ENV SPARK_HOME /opt/spark
> ENV SPARK_VERSION 3.2.1
> ENV HADOOP_VERSION 3.2.0
> ARG HADOOP_VERSION_SHORT=3.2
> ARG HADOOP_AWS_VERSION=3.3.0
> ARG AWS_SDK_VERSION=1.11.563
>
>

Re: Spark kubernetes s3 connectivity issue

2022-02-14 Thread Raj ks
I understand what you are saying . However, I am not sure how to implement
when i create a docker image using spark 3.2.1 with hadoop 3.2 which has
guava jar already added as part of distribution.

On Tue, Feb 15, 2022, 01:17 Mich Talebzadeh 
wrote:

> Hi Raj,
>
> I found the old email. That is what I did but it is 2018 stuff.
>
> The email says
>
>  I sorted out this problem. I rewrote the assembly with shade rules to
> avoid old jar files as follows:
>
> lazy val root = (project in file(".")).
>   settings(
> name := "${APPLICATION}",
> version := "1.0",
> scalaVersion := "2.11.8",
> mainClass in Compile := Some("myPackage.${APPLICATION}")
>   )
> assemblyShadeRules in assembly := Seq(
> ShadeRule.rename("com.google.common.**" -> "my_conf.@1").inAll
> )
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"  %
> "provided" exclude("org.apache.hadoop", "hadoop-client")
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/;
> libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.8"
> libraryDependencies += "commons-io" % "commons-io" % "2.4"
> libraryDependencies += "javax.servlet" % "javax.servlet-api" % "3.0.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> "provided"
> libraryDependencies += "com.google.cloud.bigdataoss" %
> "bigquery-connector" % "0.13.4-hadoop3"
> libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" %
> "1.9.4-hadoop3"
> libraryDependencies += "com.google.code.gson" % "gson" % "2.8.5"
> libraryDependencies += "org.apache.httpcomponents" % "httpcore" % "4.4.8"
> libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.4.0"
> libraryDependencies += "com.github.samelamin" %% "spark-bigquery" %
> "0.2.5"
>
> // META-INF discarding
> assemblyMergeStrategy in assembly := {
>  case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
>  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
>  case x => MergeStrategy.first
> }
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 19:40, Raj ks  wrote:
>
>> Should we remove the existing jar and upgrade it to some recent version?
>>
>> On Tue, Feb 15, 2022, 01:08 Mich Talebzadeh 
>> wrote:
>>
>>> I recall I had similar issues running Spark on Google Dataproc.
>>>
>>> sounds like it gets Hadoop's jars on the classpath which include an
>>> older version of Guava. The solution is to shade/relocate Guava in your
>>> distribution
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Feb 2022 at 19:10, Raj ks  wrote:
>>>
 Hi Team ,

 We are trying to build a docker image using Centos and trying to
 connect through S3. Same works with Hadoop 3.2.0 and spark.3.1.2

 #Installing spark binaries
 ENV SPARK_HOME /opt/spark
 ENV SPARK_VERSION 3.2.1
 ENV HADOOP_VERSION 3.2.0
 ARG HADOOP_VERSION_SHORT=3.2
 ARG HADOOP_AWS_VERSION=3.3.0
 ARG AWS_SDK_VERSION=1.11.563


 RUN set -xe \
   && cd /tmp \
   && wget
 http://mirrors.gigenet.com/apache/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
  \
   && tar -zxvf
 spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz \
   && rm *.tgz \
   && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}
 ${SPARK_HOME} \
   && cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh
 ${SPARK_HOME} \
   && wget
 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar
  \
  && wget
 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
  \
 && wget
 

Re: Spark kubernetes s3 connectivity issue

2022-02-14 Thread Mich Talebzadeh
Hi Raj,

I found the old email. That is what I did but it is 2018 stuff.

The email says

 I sorted out this problem. I rewrote the assembly with shade rules to
avoid old jar files as follows:

lazy val root = (project in file(".")).
  settings(
name := "${APPLICATION}",
version := "1.0",
scalaVersion := "2.11.8",
mainClass in Compile := Some("myPackage.${APPLICATION}")
  )
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.common.**" -> "my_conf.@1").inAll
)
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
"provided"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.4.0"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0"  %
"provided" exclude("org.apache.hadoop", "hadoop-client")
resolvers += "Akka Repository" at "http://repo.akka.io/releases/;
libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.7.8"
libraryDependencies += "commons-io" % "commons-io" % "2.4"
libraryDependencies += "javax.servlet" % "javax.servlet-api" % "3.0.1" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
"provided"
libraryDependencies += "com.google.cloud.bigdataoss" % "bigquery-connector"
% "0.13.4-hadoop3"
libraryDependencies += "com.google.cloud.bigdataoss" % "gcs-connector" %
"1.9.4-hadoop3"
libraryDependencies += "com.google.code.gson" % "gson" % "2.8.5"
libraryDependencies += "org.apache.httpcomponents" % "httpcore" % "4.4.8"
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.4.0"
libraryDependencies += "com.github.samelamin" %% "spark-bigquery" % "0.2.5"

// META-INF discarding
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", "MANIFEST.MF") => MergeStrategy.discard
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Feb 2022 at 19:40, Raj ks  wrote:

> Should we remove the existing jar and upgrade it to some recent version?
>
> On Tue, Feb 15, 2022, 01:08 Mich Talebzadeh 
> wrote:
>
>> I recall I had similar issues running Spark on Google Dataproc.
>>
>> sounds like it gets Hadoop's jars on the classpath which include an older
>> version of Guava. The solution is to shade/relocate Guava in your
>> distribution
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Feb 2022 at 19:10, Raj ks  wrote:
>>
>>> Hi Team ,
>>>
>>> We are trying to build a docker image using Centos and trying to connect
>>> through S3. Same works with Hadoop 3.2.0 and spark.3.1.2
>>>
>>> #Installing spark binaries
>>> ENV SPARK_HOME /opt/spark
>>> ENV SPARK_VERSION 3.2.1
>>> ENV HADOOP_VERSION 3.2.0
>>> ARG HADOOP_VERSION_SHORT=3.2
>>> ARG HADOOP_AWS_VERSION=3.3.0
>>> ARG AWS_SDK_VERSION=1.11.563
>>>
>>>
>>> RUN set -xe \
>>>   && cd /tmp \
>>>   && wget
>>> http://mirrors.gigenet.com/apache/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
>>>  \
>>>   && tar -zxvf
>>> spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz \
>>>   && rm *.tgz \
>>>   && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}
>>> ${SPARK_HOME} \
>>>   && cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh
>>> ${SPARK_HOME} \
>>>   && wget
>>> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar
>>>  \
>>>  && wget
>>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
>>>  \
>>> && wget
>>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_SDK_VERSION}/aws-java-sdk-${AWS_SDK_VERSION}.jar
>>>  \
>>>  && mv *.jar /opt/spark/jars/
>>>
>>> Error:
>>>
>>> Any help on this is appreciated
>>> java.lang.NoSuchMethodError:
>>> com/google/common/base/Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>>> (loaded from file:/opt/spark/jars/guava-14.0.1.jar by
>>> 

Re: Spark kubernetes s3 connectivity issue

2022-02-14 Thread Raj ks
Should we remove the existing jar and upgrade it to some recent version?

On Tue, Feb 15, 2022, 01:08 Mich Talebzadeh 
wrote:

> I recall I had similar issues running Spark on Google Dataproc.
>
> sounds like it gets Hadoop's jars on the classpath which include an older
> version of Guava. The solution is to shade/relocate Guava in your
> distribution
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Feb 2022 at 19:10, Raj ks  wrote:
>
>> Hi Team ,
>>
>> We are trying to build a docker image using Centos and trying to connect
>> through S3. Same works with Hadoop 3.2.0 and spark.3.1.2
>>
>> #Installing spark binaries
>> ENV SPARK_HOME /opt/spark
>> ENV SPARK_VERSION 3.2.1
>> ENV HADOOP_VERSION 3.2.0
>> ARG HADOOP_VERSION_SHORT=3.2
>> ARG HADOOP_AWS_VERSION=3.3.0
>> ARG AWS_SDK_VERSION=1.11.563
>>
>>
>> RUN set -xe \
>>   && cd /tmp \
>>   && wget
>> http://mirrors.gigenet.com/apache/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
>>  \
>>   && tar -zxvf
>> spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz \
>>   && rm *.tgz \
>>   && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}
>> ${SPARK_HOME} \
>>   && cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh
>> ${SPARK_HOME} \
>>   && wget
>> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar
>>  \
>>  && wget
>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
>>  \
>> && wget
>> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_SDK_VERSION}/aws-java-sdk-${AWS_SDK_VERSION}.jar
>>  \
>>  && mv *.jar /opt/spark/jars/
>>
>> Error:
>>
>> Any help on this is appreciated
>> java.lang.NoSuchMethodError:
>> com/google/common/base/Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
>> (loaded from file:/opt/spark/jars/guava-14.0.1.jar by
>> jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e) called from
>> class org.apache.hadoop.fs.s3a.S3AUtils (loaded from
>> file:/opt/spark/jars/hadoop-aws-3.3.0.jar by
>> jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e).
>>
>>


Re: Spark kubernetes s3 connectivity issue

2022-02-14 Thread Mich Talebzadeh
I recall I had similar issues running Spark on Google Dataproc.

sounds like it gets Hadoop's jars on the classpath which include an older
version of Guava. The solution is to shade/relocate Guava in your
distribution


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Feb 2022 at 19:10, Raj ks  wrote:

> Hi Team ,
>
> We are trying to build a docker image using Centos and trying to connect
> through S3. Same works with Hadoop 3.2.0 and spark.3.1.2
>
> #Installing spark binaries
> ENV SPARK_HOME /opt/spark
> ENV SPARK_VERSION 3.2.1
> ENV HADOOP_VERSION 3.2.0
> ARG HADOOP_VERSION_SHORT=3.2
> ARG HADOOP_AWS_VERSION=3.3.0
> ARG AWS_SDK_VERSION=1.11.563
>
>
> RUN set -xe \
>   && cd /tmp \
>   && wget
> http://mirrors.gigenet.com/apache/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
>  \
>   && tar -zxvf
> spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz \
>   && rm *.tgz \
>   && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}
> ${SPARK_HOME} \
>   && cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh
> ${SPARK_HOME} \
>   && wget
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar
>  \
>  && wget
> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
>  \
> && wget
> https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_SDK_VERSION}/aws-java-sdk-${AWS_SDK_VERSION}.jar
>  \
>  && mv *.jar /opt/spark/jars/
>
> Error:
>
> Any help on this is appreciated
> java.lang.NoSuchMethodError:
> com/google/common/base/Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
> (loaded from file:/opt/spark/jars/guava-14.0.1.jar by
> jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e) called from
> class org.apache.hadoop.fs.s3a.S3AUtils (loaded from
> file:/opt/spark/jars/hadoop-aws-3.3.0.jar by
> jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e).
>
>


Spark kubernetes s3 connectivity issue

2022-02-14 Thread Raj ks
Hi Team ,

We are trying to build a docker image using Centos and trying to connect
through S3. Same works with Hadoop 3.2.0 and spark.3.1.2

#Installing spark binaries
ENV SPARK_HOME /opt/spark
ENV SPARK_VERSION 3.2.1
ENV HADOOP_VERSION 3.2.0
ARG HADOOP_VERSION_SHORT=3.2
ARG HADOOP_AWS_VERSION=3.3.0
ARG AWS_SDK_VERSION=1.11.563


RUN set -xe \
  && cd /tmp \
  && wget
http://mirrors.gigenet.com/apache/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
 \
  && tar -zxvf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}.tgz
\
  && rm *.tgz \
  && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION_SHORT}
${SPARK_HOME} \
  && cp ${SPARK_HOME}/kubernetes/dockerfiles/spark/entrypoint.sh
${SPARK_HOME} \
  && wget
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/${HADOOP_AWS_VERSION}/hadoop-aws-${HADOOP_AWS_VERSION}.jar
 \
 && wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/${AWS_SDK_VERSION}/aws-java-sdk-bundle-${AWS_SDK_VERSION}.jar
 \
&& wget
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/${AWS_SDK_VERSION}/aws-java-sdk-${AWS_SDK_VERSION}.jar
 \
 && mv *.jar /opt/spark/jars/

Error:

Any help on this is appreciated
java.lang.NoSuchMethodError:
com/google/common/base/Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
(loaded from file:/opt/spark/jars/guava-14.0.1.jar by
jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e) called from class
org.apache.hadoop.fs.s3a.S3AUtils (loaded from
file:/opt/spark/jars/hadoop-aws-3.3.0.jar by
jdk.internal.loader.ClassLoaders$AppClassLoader@1e4553e).


Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread Gourav Sengupta
Hi,

sorry in case it appeared otherwise, Mich's takes are super interesting.
Just that while applying solutions on commercial undertakings things are
quite different from research/ development scenarios .



Regards,
Gourav Sengupta





On Mon, Feb 14, 2022 at 5:02 PM ashok34...@yahoo.com.INVALID
 wrote:

> Thanks Mich. Very insightful.
>
>
> AK
> On Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc  is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> 
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> 
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through 
> Support
> for Customized Kubernetes Scheduler
> .
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker 
> *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow , which I
> believe is similar to AWS Glue
> ,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> , which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> 
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>view my Linkedin 

Re: [MLlib]: GLM with multinomial family

2022-02-14 Thread Sean Owen
SparkR is just a wrapper on Scala implementations. Are you just looking for
setting family = multinomial on LogisticRegression ? Sure it's there in the
scala API

On Mon, Feb 14, 2022, 11:50 AM Surya Rajaraman Iyer 
wrote:

> Hi Team,
>
> I am using a multinomial regression in Spark Scala. I want to generate the
> coefficient and p-values for every category.
>
> For example, given two variables salary group (dependent variable) and age
> group (Independent variable)
>
> salary-group: 10,000-, 10,000-100,000, 100,000+
> age-group: 30-, 30-40, 40+
>
> I am looking to get an output like
>
> With 10,000- as baseline,  get the coefficients and pvalues for each
> category. in the salary group
>
> 10,000-100,000,
> coefficient Pvalue
> Intercept .. ..
>
> age group
> 30-40 .. ..
> 40+ .. ..
> 30- 0 0
>
> 100,000+ coefficient Pvalue
> Intercept .. ..
>
> age group
> 30-40 .. ..
> 40+ .. ..
> 30- 0 0
> To do this, I am forced to use glm with binomial family twice. In order to
> parallelize it,  I am using thread pools which doesn't seem ideal.
>
> Do you think there is a way to do multinomial logit in spark scala.I do
> see it in spark R : https://rdrr.io/cran/SparkR/man/spark.logit.html
>
> Is there a spark way to make the glms parallel? Something like:-
>
> SparkLogisticRegressionResult glm (df: DataFrame) {
> }
>
> dfs : Seq[df]
> dfs.map(glm)
>
>
> Thanks a lot for the help!
>
> Regards,
> Surya,
>
> Confidentiality Notice: This email and any files transmitted with it are
> confidential and intended solely for the use of the individual or entity to
> whom they are addressed.  Additionally, this email and any files
> transmitted with it may not be disseminated, distributed or copied. Please
> notify the sender immediately by email if you have received this email by
> mistake and delete this email from your system. If you are not the intended
> recipient, you are notified that disclosing, copying, distributing or
> taking any action in reliance on the contents of this information is
> strictly prohibited.
>
> [image:
> http://www.medallia.com/gartner-report/?source=Marketing%20-%20Email_campaign=FY22Q4_NA_Gartner_MQ_VoC_Campaign_medium=email_source=email-signature_content=report_term=medallia-named-a-leader]
> 
>


[MLlib]: GLM with multinomial family

2022-02-14 Thread Surya Rajaraman Iyer
Hi Team,

I am using a multinomial regression in Spark Scala. I want to generate the
coefficient and p-values for every category.

For example, given two variables salary group (dependent variable) and age
group (Independent variable)

salary-group: 10,000-, 10,000-100,000, 100,000+
age-group: 30-, 30-40, 40+

I am looking to get an output like

With 10,000- as baseline,  get the coefficients and pvalues for each
category. in the salary group

10,000-100,000,
coefficient Pvalue
Intercept .. ..

age group
30-40 .. ..
40+ .. ..
30- 0 0

100,000+ coefficient Pvalue
Intercept .. ..

age group
30-40 .. ..
40+ .. ..
30- 0 0
To do this, I am forced to use glm with binomial family twice. In order to
parallelize it,  I am using thread pools which doesn't seem ideal.

Do you think there is a way to do multinomial logit in spark scala.I do see
it in spark R : https://rdrr.io/cran/SparkR/man/spark.logit.html

Is there a spark way to make the glms parallel? Something like:-

SparkLogisticRegressionResult glm (df: DataFrame) {
}

dfs : Seq[df]
dfs.map(glm)


Thanks a lot for the help!

Regards,
Surya,

-- 
Confidentiality Notice: This email and any files transmitted with it are 
confidential and intended solely for the use of the individual or entity to 
whom they are addressed.  Additionally, this email and any files 
transmitted with it may not be disseminated, distributed or copied. Please 
notify the sender immediately by email if you have received this email by 
mistake and delete this email from your system. If you are not the intended 
recipient, you are notified that disclosing, copying, distributing or 
taking any action in reliance on the contents of this information is 
strictly prohibited.

-- 
 



Re: Spark 3.2.1 in Google Kubernetes Version 1.19 or 1.21

2022-02-14 Thread Mich Talebzadeh
Hi


It is complaining about the missing driver container image. Does
$SPARK_IMAGE point to a valid image in the GCP container registry?

Example for a docker image for PySpark


IMAGEDRIVER="eu.gcr.io/
/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-java8PlusPackages"


spark-submit --verbose \

   --properties-file ${property_file} \

   --master k8s://https://$KUBERNETES_MASTER_IP:443 \

   --deploy-mode cluster \

   --name sparkBQ \

   --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \

   --conf spark.kubernetes.namespace=$NAMESPACE \

   --conf spark.network.timeout=300 \

   --conf spark.executor.instances=$NEXEC \

   --conf spark.kubernetes.allocation.batch.size=3 \

   --conf spark.kubernetes.allocation.batch.delay=1 \

   --conf spark.driver.cores=3 \

   --conf spark.executor.cores=3 \

   --conf spark.driver.memory=8092m \

   --conf spark.executor.memory=8092m \

   --conf spark.dynamicAllocation.enabled=true \

   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \

   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \

   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \

   --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\

   $CODE_DIRECTORY_CLOUD/${APPLICATION}

HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Feb 2022 at 17:04, Gnana Kumar  wrote:

> Also im using the below parameters while submitting the spark job.
>
> spark-submit \
>   --master k8s://$K8S_SERVER \
>   --deploy-mode cluster \
>   --name $POD_NAME \
>   --class org.apache.spark.examples.SparkPi \
>   --conf spark.executor.instances=2 \
>   --conf spark.kubernetes.driver.container.image=$SPARK_IMAGE \
>   --conf spark.kubernetes.executor.container.image=$SPARK_IMAGE \
>   --conf spark.kubernetes.container.image=$SPARK_IMAGE \
>   --conf spark.kubernetes.driver.pod.name=$POD_NAME \
>   --conf spark.kubernetes.namespace=spark-demo \
>   --conf spark.kubernetes.container.image.pullPolicy=Never \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.1.jar
>
> On Mon, Feb 14, 2022 at 9:51 PM Gnana Kumar 
> wrote:
>
>> Hi There,
>>
>> I have been trying to run Spark 3.2.1 in Google Cloud's Kubernetes
>> Cluster version 1.19 or 1.21
>>
>> But I kept on getting on following error and could not proceed.
>>
>> Please help me resolve this issue.
>>
>> 22/02/14 16:00:48 INFO SparkKubernetesClientFactory: Auto-configuring K8S
>> client using current context from users K8S config file
>> Exception in thread "main" org.apache.spark.SparkException: Must specify
>> the driver container image
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$driverContainerImage$1(BasicDriverFeatureStep.scala:45)
>> at scala.Option.getOrElse(Option.scala:189)
>> at
>> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.(BasicDriverFeatureStep.scala:45)
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:46)
>> at
>> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:220)
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:214)
>> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2713)
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:214)
>> at
>> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:186)
>> at org.apache.spark.deploy.SparkSubmit.org
>> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>> at
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>> at
>> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>> at
>> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>> at
>> 

Re: Spark 3.2.1 in Google Kubernetes Version 1.19 or 1.21

2022-02-14 Thread Gnana Kumar
Also im using the below parameters while submitting the spark job.

spark-submit \
  --master k8s://$K8S_SERVER \
  --deploy-mode cluster \
  --name $POD_NAME \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.driver.container.image=$SPARK_IMAGE \
  --conf spark.kubernetes.executor.container.image=$SPARK_IMAGE \
  --conf spark.kubernetes.container.image=$SPARK_IMAGE \
  --conf spark.kubernetes.driver.pod.name=$POD_NAME \
  --conf spark.kubernetes.namespace=spark-demo \
  --conf spark.kubernetes.container.image.pullPolicy=Never \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
$SPARK_HOME/examples/jars/spark-examples_2.12-3.2.1.jar

On Mon, Feb 14, 2022 at 9:51 PM Gnana Kumar 
wrote:

> Hi There,
>
> I have been trying to run Spark 3.2.1 in Google Cloud's Kubernetes Cluster
> version 1.19 or 1.21
>
> But I kept on getting on following error and could not proceed.
>
> Please help me resolve this issue.
>
> 22/02/14 16:00:48 INFO SparkKubernetesClientFactory: Auto-configuring K8S
> client using current context from users K8S config file
> Exception in thread "main" org.apache.spark.SparkException: Must specify
> the driver container image
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$driverContainerImage$1(BasicDriverFeatureStep.scala:45)
> at scala.Option.getOrElse(Option.scala:189)
> at
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.(BasicDriverFeatureStep.scala:45)
> at
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:46)
> at
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:220)
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:214)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2713)
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:214)
> at
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:186)
> at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
> at
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> --
> Thanks
> Gnana
>


-- 
Thanks
Gnana


Spark 3.2.1 in Google Kubernetes Version 1.19 or 1.21

2022-02-14 Thread Gnana Kumar
Hi There,

I have been trying to run Spark 3.2.1 in Google Cloud's Kubernetes Cluster
version 1.19 or 1.21

But I kept on getting on following error and could not proceed.

Please help me resolve this issue.

22/02/14 16:00:48 INFO SparkKubernetesClientFactory: Auto-configuring K8S
client using current context from users K8S config file
Exception in thread "main" org.apache.spark.SparkException: Must specify
the driver container image
at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$driverContainerImage$1(BasicDriverFeatureStep.scala:45)
at scala.Option.getOrElse(Option.scala:189)
at
org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.(BasicDriverFeatureStep.scala:45)
at
org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:46)
at
org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:106)
at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4(KubernetesClientApplication.scala:220)
at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$4$adapted(KubernetesClientApplication.scala:214)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2713)
at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:214)
at
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:186)
at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

-- 
Thanks
Gnana


Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread ashok34...@yahoo.com.INVALID
 Thanks Mich. Very insightful.

AKOn Monday, 14 February 2022, 11:18:19 GMT, Mich Talebzadeh 
 wrote:  
 
 Good question. However, we ought to look at what options we have so to speak. 
Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow



Spark on DataProc is proven and it is in useat many organizations, I have 
deployed it extensively. It is infrastructure asa service provided including 
Spark, Hadoop and other artefacts. You have tomanage cluster creation, automate 
cluster creation and tear down, submittingjobs etc. However, it is another 
stack that needs to be managed.It now has autoscaling(enables cluster worker VM 
autoscaling ) policy as well.


Spark on GKEis something newer. Worth adding that the Spark DEV team are 
working hard to improve the performanceof Spark on Kubernetes, for example, 
through Support forCustomized Kubernetes Scheduler. As I explained in the first 
thread, Spark on Kubernetes relies on containerisation.Containers make 
applications more portable. Moreover, they simplify thepackaging of 
dependencies, especially with PySpark and enable repeatable andreliable build 
workflows which is cost effective. They also reduce the overalldevops load and 
allow one to iterate on the code faster. From a purely costperspective it would 
be cheaper with Docker as you can share resourceswith your other services. You 
can create Spark docker with different versionsof Spark, Scala, Java, OS etc. 
That docker file is portable. Can be used onPrem, AWS, GCP etc in container 
registries and devops and data science peoplecan share it as well. Built once 
used by many. Kuberneteswith autopilot helps scale the nodes of the Kubernetes 
cluster depending on theload. That is what I am currently looking into.

With regard to Dataflow, which I believe issimilar to AWSGlue, it is a managed 
service for executing data processing patterns. Patternsor pipelines are built 
with the Apache Beam SDK,which is an open source programming model that 
supports Java, Python and GO. Itenables batch and streaming pipelines. You 
create your pipelines with an ApacheBeam program and then run them on the 
Dataflow service. TheApache Spark Runner can be used to execute Beam pipelines 
using Spark. When you run a job on Dataflow,it spins up a cluster of virtual 
machines, distributes the tasks in the job tothe VMs, and dynamically scales 
the cluster based on how the job is performing.As I understand both iterative 
processing and notebooks plus Machine learning withSpark ML are not currently 
supported by Dataflow

So we have three choiceshere. If you are migrating from on-prem 
Hadoop/spark/YARN set-up, you may gofor Dataproc which will provide the same 
look and feel. If you want to usemicroservices and containers in your event 
driven architecture, you can adopt dockerimages that run on Kubernetes 
clusters, including Multi-Cloud KubernetesCluster. Dataflow is probably best 
suited for green-field projects.  Lessoperational overhead, unified approach 
for batch and streaming pipelines.

So as ever your mileage varies. If you want to migratefrom your existing 
Hadoop/Spark cluster to GCP, or take advantage of yourexisting workforce, 
choose Dataproc or GKE. In many cases, a bigconsideration is that one already 
has a codebase written against a particularframework, and one just wants to 
deploy it on the GCP, so even if, say, theBeam programming mode/dataflow is 
superior to Hadoop, someone with a lot ofHadoop code might still choose 
Dataproc or GDE for the time being, rather thanrewriting their code on Beam to 
run on Dataflow.

 HTH




   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta  wrote:

Hi,may be this is useful in case someone is testing SPARK in containers for 
developing SPARK. 
>From a production scale work point of view:But if I am in AWS, I will just use 
>GLUE if I want to use containers for SPARK, without massively increasing my 
>costs for operations unnecessarily. 
Also, in case I am not wrong, GCP already has SPARK running in serverless mode. 
 Personally I would never create the overhead of additional costs and issues to 
my clients of deploying SPARK when those solutions are already available by 
Cloud vendors. Infact, that is one of the precise reasons why people use cloud 
- to reduce operational costs.
Sorry, just trying to understand what is the scope of this work.

Regards,Gourav Sengupta
On Fri, Feb 11, 2022 at 8:35 PM Mich Talebzadeh  
wrote:

The equivalent of Google GKE autopilot in AWS is AWS Fargate




I have not used the AWS Fargate so I can only 

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread Gourav Sengupta
Hi,

I would still not build any custom solution, and if in GCP use serverless
Dataproc. I think that it is always better to be hands on with AWS Glue
before commenting on it.

Regards,
Gourav Sengupta

On Mon, Feb 14, 2022 at 11:18 AM Mich Talebzadeh 
wrote:

> Good question. However, we ought to look at what options we have so to
> speak.
>
> Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on
> Dataflow
>
>
> Spark on DataProc  is proven and it is
> in use at many organizations, I have deployed it extensively. It is
> infrastructure as a service provided including Spark, Hadoop and other
> artefacts. You have to manage cluster creation, automate cluster creation
> and tear down, submitting jobs etc. However, it is another stack that needs
> to be managed. It now has autoscaling
> 
> (enables cluster worker VM autoscaling ) policy as well.
>
> Spark on GKE
> 
> is something newer. Worth adding that the Spark DEV team are working hard
> to improve the performance of Spark on Kubernetes, for example, through 
> Support
> for Customized Kubernetes Scheduler
> .
> As I explained in the first thread, Spark on Kubernetes relies on
> containerisation. Containers make applications more portable. Moreover,
> they simplify the packaging of dependencies, especially with PySpark and
> enable repeatable and reliable build workflows which is cost effective.
> They also reduce the overall devops load and allow one to iterate on the
> code faster. From a purely cost perspective it would be cheaper with Docker 
> *as
> you can share resources* with your other services. You can create Spark
> docker with different versions of Spark, Scala, Java, OS etc. That docker
> file is portable. Can be used on Prem, AWS, GCP etc in container registries
> and devops and data science people can share it as well. Built once used by
> many. Kubernetes with autopilo
> t
> helps scale the nodes of the Kubernetes cluster depending on the load. *That
> is what I am currently looking into*.
>
> With regard to Dataflow , which I
> believe is similar to AWS Glue
> ,
> it is a managed service for executing data processing patterns. Patterns or
> pipelines are built with the Apache Beam SDK
> , which is an open
> source programming model that supports Java, Python and GO. It enables
> batch and streaming pipelines. You create your pipelines with an Apache
> Beam program and then run them on the Dataflow service. The Apache Spark
> Runner
> 
> can be used to execute Beam pipelines using Spark. When you run a job on
> Dataflow, it spins up a cluster of virtual machines, distributes the tasks
> in the job to the VMs, and dynamically scales the cluster based on how the
> job is performing. As I understand both iterative processing and notebooks
> plus Machine learning with Spark ML are not currently supported by Dataflow
>
> So we have three choices here. If you are migrating from on-prem
> Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
> same look and feel. If you want to use microservices and containers in your
> event driven architecture, you can adopt docker images that run on
> Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
> probably best suited for green-field projects.  Less operational
> overhead, unified approach for batch and streaming pipelines.
>
> *So as ever your mileage varies*. If you want to migrate from your
> existing Hadoop/Spark cluster to GCP, or take advantage of your existing
> workforce, choose Dataproc or GKE. In many cases, a big consideration is
> that one already has a codebase written against a particular framework, and
> one just wants to deploy it on the GCP, so even if, say, the Beam
> programming mode/dataflow is superior to Hadoop, someone with a lot of
> Hadoop code might still choose Dataproc or GDE for the time being, rather
> than rewriting their code on Beam to run on Dataflow.
>
>  HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all 

Re: Deploying Spark on Google Kubernetes (GKE) autopilot, preliminary findings

2022-02-14 Thread Mich Talebzadeh
Good question. However, we ought to look at what options we have so to
speak.

Let us consider Spark on Dataproc, Spark on Kubernetes and Spark on Dataflow


Spark on DataProc  is proven and it is
in use at many organizations, I have deployed it extensively. It is
infrastructure as a service provided including Spark, Hadoop and other
artefacts. You have to manage cluster creation, automate cluster creation
and tear down, submitting jobs etc. However, it is another stack that needs
to be managed. It now has autoscaling

(enables cluster worker VM autoscaling ) policy as well.

Spark on GKE

is something newer. Worth adding that the Spark DEV team are working hard
to improve the performance of Spark on Kubernetes, for example, through Support
for Customized Kubernetes Scheduler
.
As I explained in the first thread, Spark on Kubernetes relies on
containerisation. Containers make applications more portable. Moreover,
they simplify the packaging of dependencies, especially with PySpark and
enable repeatable and reliable build workflows which is cost effective.
They also reduce the overall devops load and allow one to iterate on the
code faster. From a purely cost perspective it would be cheaper with Docker *as
you can share resources* with your other services. You can create Spark
docker with different versions of Spark, Scala, Java, OS etc. That docker
file is portable. Can be used on Prem, AWS, GCP etc in container registries
and devops and data science people can share it as well. Built once used by
many. Kubernetes with autopilo
t
helps scale the nodes of the Kubernetes cluster depending on the load. *That
is what I am currently looking into*.

With regard to Dataflow , which I
believe is similar to AWS Glue
,
it is a managed service for executing data processing patterns. Patterns or
pipelines are built with the Apache Beam SDK
, which is an open
source programming model that supports Java, Python and GO. It enables
batch and streaming pipelines. You create your pipelines with an Apache
Beam program and then run them on the Dataflow service. The Apache Spark
Runner

can be used to execute Beam pipelines using Spark. When you run a job on
Dataflow, it spins up a cluster of virtual machines, distributes the tasks
in the job to the VMs, and dynamically scales the cluster based on how the
job is performing. As I understand both iterative processing and notebooks
plus Machine learning with Spark ML are not currently supported by Dataflow

So we have three choices here. If you are migrating from on-prem
Hadoop/spark/YARN set-up, you may go for Dataproc which will provide the
same look and feel. If you want to use microservices and containers in your
event driven architecture, you can adopt docker images that run on
Kubernetes clusters, including Multi-Cloud Kubernetes Cluster. Dataflow is
probably best suited for green-field projects.  Less operational overhead,
unified approach for batch and streaming pipelines.

*So as ever your mileage varies*. If you want to migrate from your existing
Hadoop/Spark cluster to GCP, or take advantage of your existing workforce,
choose Dataproc or GKE. In many cases, a big consideration is that one
already has a codebase written against a particular framework, and one just
wants to deploy it on the GCP, so even if, say, the Beam programming
mode/dataflow is superior to Hadoop, someone with a lot of Hadoop code
might still choose Dataproc or GDE for the time being, rather than
rewriting their code on Beam to run on Dataflow.

 HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Feb 2022 at 05:46, Gourav Sengupta 
wrote:

> Hi,
> may be this is useful in case someone is testing SPARK in containers for
> 

Re: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

2022-02-14 Thread Saurabh Gulati
Hey Karan,
you can get the jar from 
here

From: karan alang 
Sent: 13 February 2022 20:08
To: Gourav Sengupta 
Cc: Holden Karau ; Mich Talebzadeh 
; user @spark 
Subject: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This 
is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what 
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting your 
SPARK jobs from a local laptop or in local mode from a GCP dataproc / system?

If you are submitting the job from your local laptop, there will be performance 
bottlenecks I guess based on the internet bandwidth and volume of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
You can also put the GS access jar with your Spark jars — that’s what the class 
not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit


HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file system 
on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse 
https://cloud.google.com/storage/docs/gcs-fuse
 . Cloud Storage FUSE is an open source 
FUSE
 adapter that allows you to mount Cloud Storage buckets as file systems on 
Linux or macOS systems. You can download gcsfuse from 
here


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a 
local mount file like /mnt/gs as root and give permission to others to use it.


As a normal user that needs to access gs:// bucket (not as root), use gcsfuse 
to mount it. For example I am mounting a gcs bucket called spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars 
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may