[jira] [Created] (SPARK-47917) Accounting the impact of failures in spark jobs

2024-04-19 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-47917:
--

 Summary: Accounting the impact of failures in spark jobs
 Key: SPARK-47917
 URL: https://issues.apache.org/jira/browse/SPARK-47917
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Faiz Halde


Hello,
 
In my organization, we have an accounting system for spark jobs that uses the 
task execution time to determine how much time a spark job uses the executors 
for and we use it as a way to segregate cost. We sum all the task times per job 
and apply proportions. Our clusters follow a 1 task per core model & this works 
well.
 
A job goes through several failures during its run, due to executor failure, 
node failure ( spot interruptions ), and spark retries tasks & sometimes entire 
stages.
 
We now want to account for this failure and determine what % of a job's total 
task time is due to these retries. Basically, if a job with failures & retries 
has a total task time of X, there is a X' representing the goodput of this job 
– i.e. a hypothetical run of the job with 0 failures & retries. In this case, ( 
X-X' ) / X quantifies the cost of failures.
 
This form of accounting requires tracking execution history of each task i.e. 
tasks that compute the same logical partition of some RDD. This was quite easy 
with AQE disabled as stage ids never changed, but with AQE enabled that's no 
longer the case. 
 
Do you have any suggestions on how I can use the Spark event system?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46032) connect: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f

2023-12-07 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794376#comment-17794376
 ] 

Faiz Halde commented on SPARK-46032:


So we had a dire need to fix this and the following patch worked for us

[https://github.com/apache/spark/pull/44240]

Not sure if it's the right root cause though

> connect: cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f
> -
>
> Key: SPARK-46032
> URL: https://issues.apache.org/jira/browse/SPARK-46032
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Bobby Wang
>Priority: Major
>
> I downloaded spark 3.5 from the spark official website, and then I started a 
> Spark Standalone cluster in which both master and the only worker are in the 
> same node. 
>  
> Then I started the connect server by 
> {code:java}
> start-connect-server.sh \
>     --master spark://10.19.183.93:7077 \
>     --packages org.apache.spark:spark-connect_2.12:3.5.0 \
>     --conf spark.executor.cores=12 \
>     --conf spark.task.cpus=1 \
>     --executor-memory 30G \
>     --conf spark.executor.resource.gpu.amount=1 \
>     --conf spark.task.resource.gpu.amount=0.08 \
>     --driver-memory 1G{code}
>  
> I can 100% ensure the spark standalone cluster, the connect server and spark 
> driver are started observed from the webui.
>  
> Finally, I tried to run a very simple spark job 
> (spark.range(100).filter("id>2").collect()) from spark-connect-client using 
> pyspark, but I got the below error.
>  
> _pyspark --remote sc://localhost_
> _Python 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] on linux_
> _Type "help", "copyright", "credits" or "license" for more information._
> _Welcome to_
>       _              ___
>      _/ __/_  {{_}}{_}__ ___{_}{{_}}/ /{{_}}{_}_
>     {_}{{_}}\ \/ _ \/ _ `/ {_}{{_}}/  '{_}/{_}
>    {_}/{_}_ / .{_}{{_}}/{_},{_}/{_}/ /{_}/{_}\   version 3.5.0{_}
>       {_}/{_}/_
>  
> _Using Python version 3.10.0 (default, Mar  3 2022 09:58:08)_
> _Client connected to the Spark Connect server at localhost_
> _SparkSession available as 'spark'._
> _>>> spark.range(100).filter("id > 3").collect()_
> _Traceback (most recent call last):_
>   _File "", line 1, in _
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/dataframe.py",
>  line 1645, in collect_
>     _table, schema = self._session.client.to_table(query)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 858, in to_table_
>     _table, schema, _, _, _ = self._execute_and_fetch(req)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1282, in _execute_and_fetch_
>     _for response in self._execute_and_fetch_as_iterator(req):_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1263, in _execute_and_fetch_as_iterator_
>     _self._handle_error(error)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1502, in _handle_error_
>     _self._handle_rpc_error(error)_
>   _File 
> "/home/xxx/github/mytools/spark.home/spark-3.5.0-bin-hadoop3/python/pyspark/sql/connect/client/core.py",
>  line 1538, in _handle_rpc_error_
>     _raise convert_exception(info, status.message) from None_
> _pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 35) (10.19.183.93 executor 0): java.lang.ClassCastException: cannot 
> assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance 
> of org.apache.spark.rdd.MapPartitionsRDD_
> _at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)_
> _at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)_
> _at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)_
> _at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)_
> _at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)_
> _at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)_
> _at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)_
> _at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)_
> _at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)_
> 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-12-01 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791996#comment-17791996
 ] 

Faiz Halde edited comment on SPARK-45255 at 12/1/23 10:40 AM:
--

it seems to be failing on another issue now

a fresh project with build.sbt

```

{{resolvers += "snaps" at "https://repository.apache.org/snapshots/"}}

{{libraryDependencies += "org.apache.spark" %% "spark-sql-api" % 
"3.5.1-SNAPSHOT"}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.1-SNAPSHOT"}}

```

 

```

{{scala> val spark = SparkSession.builder().remote("sc://localhost").build()}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoSuchMethodError: 'void 
com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
char, java.lang.Object)'}}
{{  at org.sparkproject.io.grpc.Metadata$Key.validateName(Metadata.java:754)}}
{{  at org.sparkproject.io.grpc.Metadata$Key.(Metadata.java:762)}}
{{  at org.sparkproject.io.grpc.Metadata$Key.(Metadata.java:671)}}
{{  at org.sparkproject.io.grpc.Metadata$AsciiKey.(Metadata.java:971)}}
{{  at org.sparkproject.io.grpc.Metadata$AsciiKey.(Metadata.java:966)}}
{{  at org.sparkproject.io.grpc.Metadata$Key.of(Metadata.java:708)}}
{{  at org.sparkproject.io.grpc.Metadata$Key.of(Metadata.java:704)}}
{{  at 
org.apache.spark.sql.connect.client.SparkConnectClient$.(SparkConnectClient.scala:329)}}
{{  at 
org.apache.spark.sql.connect.client.SparkConnectClient$.(SparkConnectClient.scala)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.(SparkSession.scala:789)}}
{{  at org.apache.spark.sql.SparkSession$.builder(SparkSession.scala:777)}}
{{  ... 55 elided}}

```


was (Author: JIRAUSER300204):
it seems to be failing on another issue now

a fresh project with build.sbt

```

{{resolvers += "snaps" at "https://repository.apache.org/snapshots/"}}

{{libraryDependencies += "org.apache.spark" %% "spark-sql-api" % 
"3.5.1-SNAPSHOT"}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.1-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)}}
{{  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)}}
{{  at 
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)}}
{{  at java.base/java.lang.Class.forName0(Native Method)}}
{{  at java.base/java.lang.Class.forName(Class.java:375)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:98)}}
{{  at 
org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:772)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig.access$801(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:83)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:47)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:485)}}
{{  at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:61)}}
{{  at org.apache.spark.sql.SparkSession.(SparkSession.scala:75)}}
{{  at org.apache.spark.sql.SparkSession$.create(SparkSession.scala:737)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$create$1(SparkSession.scala:805)}}
{{  at scala.Option.getOrElse(Option.scala:189)}}
{{  at 

[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-12-01 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791996#comment-17791996
 ] 

Faiz Halde commented on SPARK-45255:


it seems to be failing on another issue now

a fresh project with build.sbt

```

resolvers += "snaps" at "https://repository.apache.org/snapshots/;

libraryDependencies += "org.apache.spark" %% "spark-sql-api" % "3.5.1-SNAPSHOT"
libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"

```

```

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val spark = SparkSession.builder().remote("sc://localhost").build()
warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'
java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator
  at java.base/java.lang.ClassLoader.defineClass1(Native Method)
  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)
  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
  at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)
  at 
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)
  at 
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)
  at java.base/java.lang.Class.forName0(Native Method)
  at java.base/java.lang.Class.forName(Class.java:375)
  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)
  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:98)
  at 
org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:772)
  at org.apache.arrow.memory.ImmutableConfig.access$801(ImmutableConfig.java:24)
  at 
org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:83)
  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:47)
  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:24)
  at 
org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:485)
  at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:61)
  at org.apache.spark.sql.SparkSession.(SparkSession.scala:75)
  at org.apache.spark.sql.SparkSession$.create(SparkSession.scala:737)
  at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$create$1(SparkSession.scala:805)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.SparkSession$Builder.create(SparkSession.scala:805)
  at org.apache.spark.sql.SparkSession$Builder.build(SparkSession.scala:795)
  ... 55 elided
Caused by: java.lang.ClassNotFoundException: 
io.netty.buffer.PooledByteBufAllocator
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
  ... 85 more

```

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 4.0.0, 3.5.1
>
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-12-01 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791996#comment-17791996
 ] 

Faiz Halde edited comment on SPARK-45255 at 12/1/23 10:39 AM:
--

it seems to be failing on another issue now

a fresh project with build.sbt

```

{{resolvers += "snaps" at "https://repository.apache.org/snapshots/"}}

{{libraryDependencies += "org.apache.spark" %% "spark-sql-api" % 
"3.5.1-SNAPSHOT"}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)}}
{{  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)}}
{{  at 
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)}}
{{  at java.base/java.lang.Class.forName0(Native Method)}}
{{  at java.base/java.lang.Class.forName(Class.java:375)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:98)}}
{{  at 
org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:772)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig.access$801(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:83)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:47)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:485)}}
{{  at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:61)}}
{{  at org.apache.spark.sql.SparkSession.(SparkSession.scala:75)}}
{{  at org.apache.spark.sql.SparkSession$.create(SparkSession.scala:737)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$create$1(SparkSession.scala:805)}}
{{  at scala.Option.getOrElse(Option.scala:189)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.create(SparkSession.scala:805)}}
{{  at org.apache.spark.sql.SparkSession$Builder.build(SparkSession.scala:795)}}
{{  ... 55 elided}}
{{Caused by: java.lang.ClassNotFoundException: 
io.netty.buffer.PooledByteBufAllocator}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  ... 85 more}}

```


was (Author: JIRAUSER300204):
it seems to be failing on another issue now

a fresh project with build.sbt

```

{{{}resolvers += "snaps" at 
"https://repository.apache.org/snapshots/"{}}}{{{}libraryDependencies += 
"org.apache.spark" %% "spark-sql-api" % "3.5.1-SNAPSHOT"{}}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-12-01 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791996#comment-17791996
 ] 

Faiz Halde edited comment on SPARK-45255 at 12/1/23 10:39 AM:
--

it seems to be failing on another issue now

a fresh project with build.sbt

```

{{resolvers += "snaps" at "https://repository.apache.org/snapshots/"}}

{{libraryDependencies += "org.apache.spark" %% "spark-sql-api" % 
"3.5.1-SNAPSHOT"}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.1-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)}}
{{  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)}}
{{  at 
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)}}
{{  at java.base/java.lang.Class.forName0(Native Method)}}
{{  at java.base/java.lang.Class.forName(Class.java:375)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:98)}}
{{  at 
org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:772)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig.access$801(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:83)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:47)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:485)}}
{{  at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:61)}}
{{  at org.apache.spark.sql.SparkSession.(SparkSession.scala:75)}}
{{  at org.apache.spark.sql.SparkSession$.create(SparkSession.scala:737)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$create$1(SparkSession.scala:805)}}
{{  at scala.Option.getOrElse(Option.scala:189)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.create(SparkSession.scala:805)}}
{{  at org.apache.spark.sql.SparkSession$Builder.build(SparkSession.scala:795)}}
{{  ... 55 elided}}
{{Caused by: java.lang.ClassNotFoundException: 
io.netty.buffer.PooledByteBufAllocator}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  ... 85 more}}

```


was (Author: JIRAUSER300204):
it seems to be failing on another issue now

a fresh project with build.sbt

```

{{resolvers += "snaps" at "https://repository.apache.org/snapshots/"}}

{{libraryDependencies += "org.apache.spark" %% "spark-sql-api" % 
"3.5.1-SNAPSHOT"}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-12-01 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17791996#comment-17791996
 ] 

Faiz Halde edited comment on SPARK-45255 at 12/1/23 10:38 AM:
--

it seems to be failing on another issue now

a fresh project with build.sbt

```

{{{}resolvers += "snaps" at 
"https://repository.apache.org/snapshots/"{}}}{{{}libraryDependencies += 
"org.apache.spark" %% "spark-sql-api" % "3.5.1-SNAPSHOT"{}}}
{{libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"}}

```

```

{{scala> import org.apache.spark.sql.SparkSession}}
{{{}import org.apache.spark.sql.SparkSession{}}}{{{}scala> val spark = 
SparkSession.builder().remote("sc://localhost").build(){}}}
{{warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'}}
{{java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator}}
{{  at java.base/java.lang.ClassLoader.defineClass1(Native Method)}}
{{  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)}}
{{  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)}}
{{  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)}}
{{  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:421)}}
{{  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:420)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  at 
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)}}
{{  at 
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)}}
{{  at java.base/java.lang.Class.forName0(Native Method)}}
{{  at java.base/java.lang.Class.forName(Class.java:375)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)}}
{{  at 
org.apache.arrow.memory.DefaultAllocationManagerOption.getDefaultAllocationManagerFactory(DefaultAllocationManagerOption.java:98)}}
{{  at 
org.apache.arrow.memory.BaseAllocator$Config.getAllocationManagerFactory(BaseAllocator.java:772)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig.access$801(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$InitShim.getAllocationManagerFactory(ImmutableConfig.java:83)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:47)}}
{{  at org.apache.arrow.memory.ImmutableConfig.(ImmutableConfig.java:24)}}
{{  at 
org.apache.arrow.memory.ImmutableConfig$Builder.build(ImmutableConfig.java:485)}}
{{  at org.apache.arrow.memory.BaseAllocator.(BaseAllocator.java:61)}}
{{  at org.apache.spark.sql.SparkSession.(SparkSession.scala:75)}}
{{  at org.apache.spark.sql.SparkSession$.create(SparkSession.scala:737)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$create$1(SparkSession.scala:805)}}
{{  at scala.Option.getOrElse(Option.scala:189)}}
{{  at 
org.apache.spark.sql.SparkSession$Builder.create(SparkSession.scala:805)}}
{{  at org.apache.spark.sql.SparkSession$Builder.build(SparkSession.scala:795)}}
{{  ... 55 elided}}
{{Caused by: java.lang.ClassNotFoundException: 
io.netty.buffer.PooledByteBufAllocator}}
{{  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)}}
{{  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)}}
{{  ... 85 more}}

```


was (Author: JIRAUSER300204):
it seems to be failing on another issue now

a fresh project with build.sbt

```

resolvers += "snaps" at "https://repository.apache.org/snapshots/;

libraryDependencies += "org.apache.spark" %% "spark-sql-api" % "3.5.1-SNAPSHOT"
libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % 
"3.5.0-SNAPSHOT"

```

```

scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession

scala> val spark = SparkSession.builder().remote("sc://localhost").build()
warning: one deprecation (since 3.5.0); for details, enable `:setting 
-deprecation' or `:replay -deprecation'
java.lang.NoClassDefFoundError: io/netty/buffer/PooledByteBufAllocator
  at java.base/java.lang.ClassLoader.defineClass1(Native Method)
  at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
  at 
java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
  at java.base/java.net.URLClassLoader.defineClass(URLClassLoader.java:524)
  at java.base/java.net.URLClassLoader$1.run(URLClassLoader.java:427)
  at 

[jira] [Updated] (SPARK-46046) Isolated classloader per spark session

2023-11-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-46046:
---
Description: 
Hello,

We use spark 3.5.0 and were wondering if the following is achievable using 
spark-core

Our use case involves spinning up a spark cluster wherein the driver 
application loads user jars on-the-fly ( the user jar is not the spark 
driver/application ) but merely a catalog of transformations. A single spark 
application can load multiple jars in its lifetime with potential of classpath 
conflict if care is not taken by the framework

The driver needs to load the jar, add the jar to the executor & call a 
predefined class.method to trigger the transformation

Each transformation runs in its own spark session inside the same spark 
application

AFAIK, on the executor side, isolated classloader per session is only possible 
when using the spark-connect facilities. Is it possible to do this without 
using spark connect?

Spark connect is the only facility that adds the jar into a sessionUUID 
directory of executor and when an executor runs tasks of a job from that 
session, it sets a ChildFirstClassLoader pointing to the sessionUUID directory

 

Thank you

  was:
Hello,

We use spark 3.5.0 and were wondering if the following is achievable using 
spark-core

Our use case involves spinning up a spark cluster wherein the driver 
application loads user jars on-the-fly ( the user jar is not the spark 
driver/application ) but merely a catalog of transformations. A single spark 
application can load multiple jars in its lifetime with potential of classpath 
conflict if care is not taken by the framework

The driver needs to load the jar, add the jar to the executor & call a 
predefined class.method to trigger the transformation

Each transformation runs in its own spark session inside the same spark 
application

AFAIK, on the executor side, isolated classloader per session is only possible 
when using the spark-connect facilities. Is it possible to do this without 
using spark connect?

Spark connect is the only facility that adds the jar into a sessionUUID 
directory of executor and when an executor runs a job from that session, it 
sets a childfirstclassloader pointing to the sessionUUID directory

 

Thank you


> Isolated classloader per spark session
> --
>
> Key: SPARK-46046
> URL: https://issues.apache.org/jira/browse/SPARK-46046
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Hello,
> We use spark 3.5.0 and were wondering if the following is achievable using 
> spark-core
> Our use case involves spinning up a spark cluster wherein the driver 
> application loads user jars on-the-fly ( the user jar is not the spark 
> driver/application ) but merely a catalog of transformations. A single spark 
> application can load multiple jars in its lifetime with potential of 
> classpath conflict if care is not taken by the framework
> The driver needs to load the jar, add the jar to the executor & call a 
> predefined class.method to trigger the transformation
> Each transformation runs in its own spark session inside the same spark 
> application
> AFAIK, on the executor side, isolated classloader per session is only 
> possible when using the spark-connect facilities. Is it possible to do this 
> without using spark connect?
> Spark connect is the only facility that adds the jar into a sessionUUID 
> directory of executor and when an executor runs tasks of a job from that 
> session, it sets a ChildFirstClassLoader pointing to the sessionUUID directory
>  
> Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46046) Isolated classloader per spark session

2023-11-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-46046:
---
Description: 
Hello,

We use spark 3.5.0 and were wondering if the following is achievable using 
spark-core

Our use case involves spinning up a spark cluster wherein the driver 
application loads user jars on-the-fly ( the user jar is not the spark 
driver/application ) but merely a catalog of transformations. A single spark 
application can load multiple jars in its lifetime with potential of classpath 
conflict if care is not taken by the framework

The driver needs to load the jar, add the jar to the executor & call a 
predefined class.method to trigger the transformation

Each transformation runs in its own spark session inside the same spark 
application

AFAIK, on the executor side, isolated classloader per session is only possible 
when using the spark-connect facilities. Is it possible to do this without 
using spark connect?

Spark connect is the only facility that adds the jar into a sessionUUID 
directory of executor and when an executor runs a job from that session, it 
sets a childfirstclassloader pointing to the sessionUUID directory

 

Thank you

  was:
Hello,

We use spark 3.5.0 and were wondering if the following is achievable using 
spark-core

Our use case involves spinning up a spark cluster wherein the driver 
application loads user jars on-the-fly ( the user jar is not the spark 
driver/application ) but merely a catalog of transformations. A single spark 
application can load multiple jars in its lifetime with potential of classpath 
conflict if care is not taken by the framework

The driver needs to load the jar, add the jar to the executor & calls a 
predefined class.method to trigger the transformation

Each transformation runs in its own spark session inside the same spark 
application

AFAIK, on the executor side, isolated classloader per session is only possible 
when using the spark-connect facilities. Is it possible to do this without 
using spark connect?

Spark connect is the only facility that adds the jar into a sessionUUID 
directory of executor and when an executor runs a job from that session, it 
sets a childfirstclassloader pointing to the sessionUUID directory

 

Thank you


> Isolated classloader per spark session
> --
>
> Key: SPARK-46046
> URL: https://issues.apache.org/jira/browse/SPARK-46046
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Hello,
> We use spark 3.5.0 and were wondering if the following is achievable using 
> spark-core
> Our use case involves spinning up a spark cluster wherein the driver 
> application loads user jars on-the-fly ( the user jar is not the spark 
> driver/application ) but merely a catalog of transformations. A single spark 
> application can load multiple jars in its lifetime with potential of 
> classpath conflict if care is not taken by the framework
> The driver needs to load the jar, add the jar to the executor & call a 
> predefined class.method to trigger the transformation
> Each transformation runs in its own spark session inside the same spark 
> application
> AFAIK, on the executor side, isolated classloader per session is only 
> possible when using the spark-connect facilities. Is it possible to do this 
> without using spark connect?
> Spark connect is the only facility that adds the jar into a sessionUUID 
> directory of executor and when an executor runs a job from that session, it 
> sets a childfirstclassloader pointing to the sessionUUID directory
>  
> Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46046) Isolated classloader per spark session

2023-11-21 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-46046:
--

 Summary: Isolated classloader per spark session
 Key: SPARK-46046
 URL: https://issues.apache.org/jira/browse/SPARK-46046
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Faiz Halde


Hello,

We use spark 3.5.0 and were wondering if the following is achievable using 
spark-core

Our use case involves spinning up a spark cluster wherein the driver 
application loads user jars on-the-fly ( the user jar is not the spark 
driver/application ) but merely a catalog of transformations. A single spark 
application can load multiple jars in its lifetime with potential of classpath 
conflict if care is not taken by the framework

The driver needs to load the jar, add the jar to the executor & calls a 
predefined class.method to trigger the transformation

Each transformation runs in its own spark session inside the same spark 
application

AFAIK, on the executor side, isolated classloader per session is only possible 
when using the spark-connect facilities. Is it possible to do this without 
using spark connect?

Spark connect is the only facility that adds the jar into a sessionUUID 
directory of executor and when an executor runs a job from that session, it 
sets a childfirstclassloader pointing to the sessionUUID directory

 

Thank you



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42471) Distributed ML <> spark connect

2023-11-15 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17786292#comment-17786292
 ] 

Faiz Halde edited comment on SPARK-42471 at 11/15/23 10:22 AM:
---

Hello, our use-case requires us to use spark-connect and we have some of our 
jobs that used spark ML ( scala ). Is this Umbrella tracking the work required 
to make spark ml compatible with spark-connect? Because so far we've been 
struggling with this. May I know if this is already done and if there are docs 
on how to make this work?

 

Thanks!


was (Author: JIRAUSER300204):
Hello, our use-case requires us to use spark-connect and we have some of our 
jobs that used spark ML. Is this Umbrella tracking the work required to make 
spark ml compatible with spark-connect? Because so far we've been struggling 
with this. May I know if this is already done and if there are docs on how to 
make this work?

 

Thanks!

> Distributed ML <> spark connect
> ---
>
> Key: SPARK-42471
> URL: https://issues.apache.org/jira/browse/SPARK-42471
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, ML
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-29 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Description: 
Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0.0

Spark connect server was started using

*{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0}}* 
--{*}{{conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
--conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf 
'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}{*}

{{Connect client depends on}}
*libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0"*
*and the connect libraries*
 

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at 

[jira] [Updated] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-29 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Description: 
Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0.0

Spark connect server was started using

*{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0}}* 
*{{--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf 
'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*

{{Connect client depends on}}
*libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0"*
*and the connect libraries*
 

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at 

[jira] [Updated] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-29 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Description: 
Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0.0

Spark connect server was started using

*{{./sbin/start-connect-server.sh -master spark://localhost:7077 --packages 
org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0}}* 
*{{-conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf 
'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*

{{Connect client depends on}}
*libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0"*
*and the connect libraries*
 

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at 

[jira] [Commented] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-26 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17779843#comment-17779843
 ] 

Faiz Halde commented on SPARK-45598:


Hi, do we have any updates here? Happy to help

> Delta table 3.0.0 not working with Spark Connect 3.5.0
> --
>
> Key: SPARK-45598
> URL: https://issues.apache.org/jira/browse/SPARK-45598
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Spark version 3.5.0
> Spark Connect version 3.5.0
> Delta table 3.0-rc2
> Spark connect server was started using
> *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
> org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
> --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf 
> 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*
> {{Connect client depends on}}
> *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
> *and the connect libraries*
>  
> When trying to run a simple job that writes to a delta table
> {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{val data = spark.read.json("profiles.json")}}
> {{data.write.format("delta").save("/tmp/delta")}}
>  
> {{Error log in connect client}}
> {{Exception in thread "main" org.apache.spark.SparkException: 
> io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: 
> cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 
> in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
> {{    at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
> {{    at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{...}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
> {{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
> {{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
> {{    at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
> {{    at 
> 

[jira] [Comment Edited] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-19 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1246#comment-1246
 ] 

Faiz Halde edited comment on SPARK-45598 at 10/19/23 4:04 PM:
--

Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I 
meant was, delta table does not work with {*}spark connect{*}. It does work 
with vanilla spark 3.5.0 otherwise


was (Author: JIRAUSER300204):
Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I 
meant was, delta table does not work with {*}spark connect{*}. It does work 
with spark 3.5.0 otherwise

> Delta table 3.0.0 not working with Spark Connect 3.5.0
> --
>
> Key: SPARK-45598
> URL: https://issues.apache.org/jira/browse/SPARK-45598
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Spark version 3.5.0
> Spark Connect version 3.5.0
> Delta table 3.0-rc2
> Spark connect server was started using
> *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
> org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
> --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf 
> 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*
> {{Connect client depends on}}
> *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
> *and the connect libraries*
>  
> When trying to run a simple job that writes to a delta table
> {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{val data = spark.read.json("profiles.json")}}
> {{data.write.format("delta").save("/tmp/delta")}}
>  
> {{Error log in connect client}}
> {{Exception in thread "main" org.apache.spark.SparkException: 
> io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: 
> cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 
> in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
> {{    at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
> {{    at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{...}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
> {{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
> {{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
> {{    at 
> 

[jira] [Commented] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-19 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1246#comment-1246
 ] 

Faiz Halde commented on SPARK-45598:


Hi [~sdaberdaku] , corrected the title. I tested it with 3.0.0 delta. What I 
meant was, delta table does not work with {*}spark connect{*}. It does work 
with spark 3.5.0 otherwise

> Delta table 3.0.0 not working with Spark Connect 3.5.0
> --
>
> Key: SPARK-45598
> URL: https://issues.apache.org/jira/browse/SPARK-45598
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Spark version 3.5.0
> Spark Connect version 3.5.0
> Delta table 3.0-rc2
> Spark connect server was started using
> *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
> org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
> --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf 
> 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*
> {{Connect client depends on}}
> *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
> *and the connect libraries*
>  
> When trying to run a simple job that writes to a delta table
> {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{val data = spark.read.json("profiles.json")}}
> {{data.write.format("delta").save("/tmp/delta")}}
>  
> {{Error log in connect client}}
> {{Exception in thread "main" org.apache.spark.SparkException: 
> io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: 
> cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 
> in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
> {{    at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
> {{    at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{...}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
> {{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
> {{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
> {{    at 

[jira] [Updated] (SPARK-45598) Delta table 3.0.0 not working with Spark Connect 3.5.0

2023-10-19 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Summary: Delta table 3.0.0 not working with Spark Connect 3.5.0  (was: 
Delta table 3.0-rc2 not working with Spark Connect 3.5.0)

> Delta table 3.0.0 not working with Spark Connect 3.5.0
> --
>
> Key: SPARK-45598
> URL: https://issues.apache.org/jira/browse/SPARK-45598
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Spark version 3.5.0
> Spark Connect version 3.5.0
> Delta table 3.0-rc2
> Spark connect server was started using
> *{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
> org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
> --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
> "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
>  --conf 
> 'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*
> {{Connect client depends on}}
> *libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
> *and the connect libraries*
>  
> When trying to run a simple job that writes to a delta table
> {{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{val data = spark.read.json("profiles.json")}}
> {{data.write.format("delta").save("/tmp/delta")}}
>  
> {{Error log in connect client}}
> {{Exception in thread "main" org.apache.spark.SparkException: 
> io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: 
> cannot assign instance of java.lang.invoke.SerializedLambda to field 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 
> in instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
> {{    at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
> {{    at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
> {{    at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
> {{    at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
> {{    at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
> {{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
> {{...}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
> {{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
> {{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
> {{    at 
> org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
> {{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
> {{    at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
> {{    at 

[jira] [Updated] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0

2023-10-18 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45598:
---
Description: 
Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0-rc2

Spark connect server was started using

*{{./sbin/start-connect-server.sh --master spark://localhost:7077 --packages 
org.apache.spark:spark-connect_2.12:3.5.0,io.delta:delta-spark_2.12:3.0.0rc2 
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
 --conf 
'spark.jars.repositories=[https://oss.sonatype.org/content/repositories/iodelta-1120']}}*

{{Connect client depends on}}
*libraryDependencies += "io.delta" %% "delta-spark" % "3.0.0rc2"*
*and the connect libraries*
 

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at 

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-10-18 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Affects Version/s: 3.4.1
   (was: 3.5.0)

> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-10-18 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Affects Version/s: 3.5.0
   (was: 3.4.1)

> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45598) Delta table 3.0-rc2 not working with Spark Connect 3.5.0

2023-10-18 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-45598:
--

 Summary: Delta table 3.0-rc2 not working with Spark Connect 3.5.0
 Key: SPARK-45598
 URL: https://issues.apache.org/jira/browse/SPARK-45598
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 3.5.0
Reporter: Faiz Halde


Spark version 3.5.0

Spark Connect version 3.5.0

Delta table 3.0-rc2

When trying to run a simple job that writes to a delta table

{{val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{val data = spark.read.json("profiles.json")}}
{{data.write.format("delta").save("/tmp/delta")}}

 

{{Error log in connect client}}

{{Exception in thread "main" org.apache.spark.SparkException: 
io.grpc.StatusRuntimeException: INTERNAL: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4) (172.23.128.15 executor 0): java.lang.ClassCastException: cannot 
assign instance of java.lang.invoke.SerializedLambda to field 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.f of type scala.Function1 in 
instance of org.apache.spark.sql.catalyst.expressions.ScalaUDF}}
{{    at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)}}
{{    at 
java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2437)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{    at java.io.ObjectInputStream.readArray(ObjectInputStream.java:2119)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1657)}}
{{    at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)}}
{{    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)}}
{{    at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)}}
{{    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)}}
{{...}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.toThrowable(GrpcExceptionConverter.scala:110)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$.convert(GrpcExceptionConverter.scala:41)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.hasNext(GrpcExceptionConverter.scala:49)}}
{{    at scala.collection.Iterator.foreach(Iterator.scala:943)}}
{{    at scala.collection.Iterator.foreach$(Iterator.scala:943)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.foreach(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)}}
{{    at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)}}
{{    at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)}}
{{    at scala.collection.TraversableOnce.to(TraversableOnce.scala:366)}}
{{    at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.to(GrpcExceptionConverter.scala:46)}}
{{    at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)}}
{{    at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)}}
{{    at 
org.apache.spark.sql.connect.client.GrpcExceptionConverter$$anon$1.toBuffer(GrpcExceptionConverter.scala:46)}}
{{    at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:554)}}
{{    at 
org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:257)}}
{{    at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:221)}}
{{    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:210)}}
{{    at Main$.main(Main.scala:11)}}
{{    at Main.main(Main.scala)}}

 

{{Error log in spark connect 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-22 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040
 ] 

Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:31 PM:
-

to get past the error
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
even after adding guava library, you need to copy their shading rules

```

    (assembly / assemblyShadeRules) := Seq(
      ShadeRule.rename("io.grpc.**" -> 
"org.sparkproject.connect.client.io.grpc.@1").inAll,
      ShadeRule.rename("com.google.**" -> 
"org.sparkproject.connect.client.com.google.@1").inAll,
      ShadeRule.rename("io.netty.**" -> 
"org.sparkproject.connect.client.io.netty.@1").inAll,
      ShadeRule.rename("org.checkerframework.**" -> 
"org.sparkproject.connect.client.org.checkerframework.@1").inAll,
      ShadeRule.rename("javax.annotation.**" -> 
"org.sparkproject.connect.client.javax.annotation.@1").inAll,
      ShadeRule.rename("io.perfmark.**" -> 
"org.sparkproject.connect.client.io.perfmark.@1").inAll,
      ShadeRule.rename("org.codehaus.**" -> 
"org.sparkproject.connect.client.org.codehaus.@1").inAll,
      ShadeRule.rename("android.annotation.**" -> 
"org.sparkproject.connect.client.android.annotation.@1").inAll
    ),

```


was (Author: JIRAUSER300204):
to get pas the error
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
even after adding guava library, you need to copy their shading rules

```

    (assembly / assemblyShadeRules) := Seq(
      ShadeRule.rename("io.grpc.**" -> 
"org.sparkproject.connect.client.io.grpc.@1").inAll,
      ShadeRule.rename("com.google.**" -> 
"org.sparkproject.connect.client.com.google.@1").inAll,
      ShadeRule.rename("io.netty.**" -> 
"org.sparkproject.connect.client.io.netty.@1").inAll,
      ShadeRule.rename("org.checkerframework.**" -> 
"org.sparkproject.connect.client.org.checkerframework.@1").inAll,
      ShadeRule.rename("javax.annotation.**" -> 
"org.sparkproject.connect.client.javax.annotation.@1").inAll,
      ShadeRule.rename("io.perfmark.**" -> 
"org.sparkproject.connect.client.io.perfmark.@1").inAll,
      ShadeRule.rename("org.codehaus.**" -> 
"org.sparkproject.connect.client.org.codehaus.@1").inAll,
      ShadeRule.rename("android.annotation.**" -> 
"org.sparkproject.connect.client.android.annotation.@1").inAll
    ),

```

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> EDIT
> Not sure if it's the right mitigation but explicitly adding guava worked but 
> now I am in the 2nd territory of error
> {{Sep 21, 2023 8:21:59 PM 
> 

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-22 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040
 ] 

Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:31 PM:
-

to get past the error
`org/sparkproject/connect/client/com/google/common/cache/CacheLoader`
even after adding guava library, you need to copy their shading rules

```

    (assembly / assemblyShadeRules) := Seq(
      ShadeRule.rename("io.grpc.**" -> 
"org.sparkproject.connect.client.io.grpc.@1").inAll,
      ShadeRule.rename("com.google.**" -> 
"org.sparkproject.connect.client.com.google.@1").inAll,
      ShadeRule.rename("io.netty.**" -> 
"org.sparkproject.connect.client.io.netty.@1").inAll,
      ShadeRule.rename("org.checkerframework.**" -> 
"org.sparkproject.connect.client.org.checkerframework.@1").inAll,
      ShadeRule.rename("javax.annotation.**" -> 
"org.sparkproject.connect.client.javax.annotation.@1").inAll,
      ShadeRule.rename("io.perfmark.**" -> 
"org.sparkproject.connect.client.io.perfmark.@1").inAll,
      ShadeRule.rename("org.codehaus.**" -> 
"org.sparkproject.connect.client.org.codehaus.@1").inAll,
      ShadeRule.rename("android.annotation.**" -> 
"org.sparkproject.connect.client.android.annotation.@1").inAll
    ),

```


was (Author: JIRAUSER300204):
to get past the error
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
even after adding guava library, you need to copy their shading rules

```

    (assembly / assemblyShadeRules) := Seq(
      ShadeRule.rename("io.grpc.**" -> 
"org.sparkproject.connect.client.io.grpc.@1").inAll,
      ShadeRule.rename("com.google.**" -> 
"org.sparkproject.connect.client.com.google.@1").inAll,
      ShadeRule.rename("io.netty.**" -> 
"org.sparkproject.connect.client.io.netty.@1").inAll,
      ShadeRule.rename("org.checkerframework.**" -> 
"org.sparkproject.connect.client.org.checkerframework.@1").inAll,
      ShadeRule.rename("javax.annotation.**" -> 
"org.sparkproject.connect.client.javax.annotation.@1").inAll,
      ShadeRule.rename("io.perfmark.**" -> 
"org.sparkproject.connect.client.io.perfmark.@1").inAll,
      ShadeRule.rename("org.codehaus.**" -> 
"org.sparkproject.connect.client.org.codehaus.@1").inAll,
      ShadeRule.rename("android.annotation.**" -> 
"org.sparkproject.connect.client.android.annotation.@1").inAll
    ),

```

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> EDIT
> Not sure if it's the right mitigation but explicitly adding guava worked but 
> now I am in the 2nd territory of error
> {{Sep 21, 2023 8:21:59 PM 
> 

[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-22 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768040#comment-17768040
 ] 

Faiz Halde commented on SPARK-45255:


to get pas the error
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
even after adding guava library, you need to copy their shading rules

```

    (assembly / assemblyShadeRules) := Seq(
      ShadeRule.rename("io.grpc.**" -> 
"org.sparkproject.connect.client.io.grpc.@1").inAll,
      ShadeRule.rename("com.google.**" -> 
"org.sparkproject.connect.client.com.google.@1").inAll,
      ShadeRule.rename("io.netty.**" -> 
"org.sparkproject.connect.client.io.netty.@1").inAll,
      ShadeRule.rename("org.checkerframework.**" -> 
"org.sparkproject.connect.client.org.checkerframework.@1").inAll,
      ShadeRule.rename("javax.annotation.**" -> 
"org.sparkproject.connect.client.javax.annotation.@1").inAll,
      ShadeRule.rename("io.perfmark.**" -> 
"org.sparkproject.connect.client.io.perfmark.@1").inAll,
      ShadeRule.rename("org.codehaus.**" -> 
"org.sparkproject.connect.client.org.codehaus.@1").inAll,
      ShadeRule.rename("android.annotation.**" -> 
"org.sparkproject.connect.client.android.annotation.@1").inAll
    ),

```

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> EDIT
> Not sure if it's the right mitigation but explicitly adding guava worked but 
> now I am in the 2nd territory of error
> {{Sep 21, 2023 8:21:59 PM 
> org.sparkproject.connect.client.io.grpc.NameResolverRegistry 
> getDefaultRegistry}}
> {{WARNING: No NameResolverProviders found via ServiceLoader, including for 
> DNS. This is probably due to a broken build. If using ProGuard, check your 
> configuration}}
> {{Exception in thread "main" 
> org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException:
>  
> org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException:
>  No functional channel service provider found. Try adding a dependency on the 
> grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}}
> {{    at 
> 

[jira] [Commented] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-22 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768037#comment-17768037
 ] 

Faiz Halde commented on SPARK-45255:


For now, I unblocked myself by manually building spark connect

{{build/mvn -Pconnect -DskipTests clean package}}

{{and then running}}

{{mkdir connect-jars}}

{{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} 
connect-jars}}

 

{{Then, in your client application, have the connect-jars directory in your 
classpath. Not sure if this is the right way though}}

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> EDIT
> Not sure if it's the right mitigation but explicitly adding guava worked but 
> now I am in the 2nd territory of error
> {{Sep 21, 2023 8:21:59 PM 
> org.sparkproject.connect.client.io.grpc.NameResolverRegistry 
> getDefaultRegistry}}
> {{WARNING: No NameResolverProviders found via ServiceLoader, including for 
> DNS. This is probably due to a broken build. If using ProGuard, check your 
> configuration}}
> {{Exception in thread "main" 
> org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException:
>  
> org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException:
>  No functional channel service provider found. Try adding a dependency on the 
> grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}}
> {{    at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}}
> {{    at scala.Option.getOrElse(Option.scala:189)}}
> {{    at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{   

[jira] [Comment Edited] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-22 Thread Faiz Halde (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17768037#comment-17768037
 ] 

Faiz Halde edited comment on SPARK-45255 at 9/22/23 2:29 PM:
-

For now, I unblocked myself by manually building spark connect

{{build/mvn -Pconnect -DskipTests clean package}}

{{and then running}}

{{mkdir connect-jars}}

{{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} 
connect-jars}}

 

{{Then, when starting your client application, have the connect-jars directory 
in your classpath. Not sure if this is the right way though}}


was (Author: JIRAUSER300204):
For now, I unblocked myself by manually building spark connect

{{build/mvn -Pconnect -DskipTests clean package}}

{{and then running}}

{{mkdir connect-jars}}

{{./bin/spark-connect-scala-client-classpath | tr ':' '\n' | xargs -I{} cp {} 
connect-jars}}

 

{{Then, in your client application, have the connect-jars directory in your 
classpath. Not sure if this is the right way though}}

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> EDIT
> Not sure if it's the right mitigation but explicitly adding guava worked but 
> now I am in the 2nd territory of error
> {{Sep 21, 2023 8:21:59 PM 
> org.sparkproject.connect.client.io.grpc.NameResolverRegistry 
> getDefaultRegistry}}
> {{WARNING: No NameResolverProviders found via ServiceLoader, including for 
> DNS. This is probably due to a broken build. If using ProGuard, check your 
> configuration}}
> {{Exception in thread "main" 
> org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException:
>  
> org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException:
>  No functional channel service provider found. Try adding a dependency on the 
> grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}}
> {{    at 
> org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}}
> {{    at 
> org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}}
> {{    at scala.Option.getOrElse(Option.scala:189)}}
> {{    at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}}
> {{    at 

[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Description: 
java 1.8, sbt 1.9, scala 2.12

 

I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

EDIT

Not sure if it's the right mitigation but explicitly adding guava worked but 
now I am in the 2nd territory of error

{{Sep 21, 2023 8:21:59 PM 
org.sparkproject.connect.client.io.grpc.NameResolverRegistry 
getDefaultRegistry}}
{{WARNING: No NameResolverProviders found via ServiceLoader, including for DNS. 
This is probably due to a broken build. If using ProGuard, check your 
configuration}}
{{Exception in thread "main" 
org.sparkproject.connect.client.com.google.common.util.concurrent.UncheckedExecutionException:
 
org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException:
 No functional channel service provider found. Try adding a dependency on the 
grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}}
{{    at 
org.sparkproject.connect.client.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2085)}}
{{    at 
org.sparkproject.connect.client.com.google.common.cache.LocalCache.get(LocalCache.java:4011)}}
{{    at 
org.sparkproject.connect.client.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)}}
{{    at 
org.sparkproject.connect.client.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)}}
{{    at 
org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$1(SparkSession.scala:945)}}
{{    at scala.Option.getOrElse(Option.scala:189)}}
{{    at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:945)}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: 
org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry$ProviderNotFoundException:
 No functional channel service provider found. Try adding a dependency on the 
grpc-okhttp, grpc-netty, or grpc-netty-shaded artifact}}
{{    at 
org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry.newChannelBuilder(ManagedChannelRegistry.java:179)}}
{{    at 
org.sparkproject.connect.client.io.grpc.ManagedChannelRegistry.newChannelBuilder(ManagedChannelRegistry.java:155)}}
{{    at 
org.sparkproject.connect.client.io.grpc.Grpc.newChannelBuilder(Grpc.java:101)}}
{{    at 
org.sparkproject.connect.client.io.grpc.Grpc.newChannelBuilderForAddress(Grpc.java:111)}}
{{    at 
org.apache.spark.sql.connect.client.SparkConnectClient$Configuration.createChannel(SparkConnectClient.scala:633)}}
{{    at 

[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Description: 
java 1.8, sbt 1.9, scala 2.12

 

I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

I followed the doc exactly as described. Can somebody help

  was:
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

I followed the doc exactly as described. Can somebody help


> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> java 1.8, sbt 1.9, scala 2.12
>  
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" 

[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Issue Type: Bug  (was: New Feature)

> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the following error
>  
> ```
> {{Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
> {{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
> {{    at Main$delayedInit$body.apply(Main.scala:3)}}
> {{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
> {{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
> {{    at 
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
> {{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
> {{    at scala.collection.immutable.List.foreach(List.scala:431)}}
> {{    at scala.App.main(App.scala:80)}}
> {{    at scala.App.main$(App.scala:78)}}
> {{    at Main$.main(Main.scala:3)}}
> {{    at Main.main(Main.scala)}}
> {{Caused by: java.lang.ClassNotFoundException: 
> org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
> {{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
> {{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
> {{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
> {{    ... 11 more}}
> ```
> I know the connect does a bunch of shading during assembly so it could be 
> related to that. This application is not started via spark-submit or 
> anything. It's not run neither under a `SPARK_HOME` ( I guess that's the 
> whole point of connect client )
>  
> I followed the doc exactly as described. Can somebody help



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Description: 
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

I followed the doc exactly as described. Can somebody help

  was:
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?


> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get the 

[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Description: 
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
It's not run neither under a `SPARK_HOME` ( I guess that's the whole point of 
connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?

  was:
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
Neither under `SPARK_HOME` ( I guess that's the whole point of connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?


> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   

[jira] [Created] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-45255:
--

 Summary: Spark connect client failing with 
java.lang.NoClassDefFoundError
 Key: SPARK-45255
 URL: https://issues.apache.org/jira/browse/SPARK-45255
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.5.0
Reporter: Faiz Halde


I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
    at Main$.delayedEndpoint$Main$1(Main.scala:4)
    at Main$delayedInit$body.apply(Main.scala:3)
    at scala.Function0.apply$mcV$sp(Function0.scala:39)
    at scala.Function0.apply$mcV$sp$(Function0.scala:39)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
    at scala.App.$anonfun$main$1$adapted(App.scala:80)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at scala.App.main(App.scala:80)
    at scala.App.main$(App.scala:78)
    at Main$.main(Main.scala:3)
    at Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 11 more

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
Neither under `SPARK_HOME` ( I guess that's the whole point of connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45255) Spark connect client failing with java.lang.NoClassDefFoundError

2023-09-21 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-45255:
---
Description: 
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

{{Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader}}
{{    at Main$.delayedEndpoint$Main$1(Main.scala:4)}}
{{    at Main$delayedInit$body.apply(Main.scala:3)}}
{{    at scala.Function0.apply$mcV$sp(Function0.scala:39)}}
{{    at scala.Function0.apply$mcV$sp$(Function0.scala:39)}}
{{    at 
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)}}
{{    at scala.App.$anonfun$main$1$adapted(App.scala:80)}}
{{    at scala.collection.immutable.List.foreach(List.scala:431)}}
{{    at scala.App.main(App.scala:80)}}
{{    at scala.App.main$(App.scala:78)}}
{{    at Main$.main(Main.scala:3)}}
{{    at Main.main(Main.scala)}}
{{Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader}}
{{    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)}}
{{    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)}}
{{    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)}}
{{    ... 11 more}}

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
Neither under `SPARK_HOME` ( I guess that's the whole point of connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?

  was:
I have a very simple repo with the following dependency in `build.sbt`

```

{{libraryDependencies ++= Seq("org.apache.spark" %% "spark-connect-client-jvm" 
% "3.5.0")}}

```

A simple application

```

{{object Main extends App {}}
{{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
{{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
{{}}}

```

But when I run it, I get the following error

 

```

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/sparkproject/connect/client/com/google/common/cache/CacheLoader
    at Main$.delayedEndpoint$Main$1(Main.scala:4)
    at Main$delayedInit$body.apply(Main.scala:3)
    at scala.Function0.apply$mcV$sp(Function0.scala:39)
    at scala.Function0.apply$mcV$sp$(Function0.scala:39)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
    at scala.App.$anonfun$main$1$adapted(App.scala:80)
    at scala.collection.immutable.List.foreach(List.scala:431)
    at scala.App.main(App.scala:80)
    at scala.App.main$(App.scala:78)
    at Main$.main(Main.scala:3)
    at Main.main(Main.scala)
Caused by: java.lang.ClassNotFoundException: 
org.sparkproject.connect.client.com.google.common.cache.CacheLoader
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 11 more

```

I know the connect does a bunch of shading during assembly so it could be 
related to that. This application is not started via spark-submit or anything. 
Neither under `SPARK_HOME` ( I guess that's the whole point of connect client )

 

I followed the doc exactly as described. Can somebody help?

BTW it did work if I copied the exact shading rules in my project but I wonder 
if that's the right thing to do?


> Spark connect client failing with java.lang.NoClassDefFoundError
> 
>
> Key: SPARK-45255
> URL: https://issues.apache.org/jira/browse/SPARK-45255
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Faiz Halde
>Priority: Major
>
> I have a very simple repo with the following dependency in `build.sbt`
> ```
> {{libraryDependencies ++= Seq("org.apache.spark" %% 
> "spark-connect-client-jvm" % "3.5.0")}}
> ```
> A simple application
> ```
> {{object Main extends App {}}
> {{   val s = SparkSession.builder().remote("sc://localhost").getOrCreate()}}
> {{   s.read.json("/tmp/input.json").repartition(10).show(false)}}
> {{}}}
> ```
> But when I run it, I get 

[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleDataIO`
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleDataIO`
> itself is not that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to it
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not that 
much. However when I tried this on the `LocalDiskShuffleExecutorComponents` it 
was not a successful experiment which suggests there is more to recovering 
shuffle files

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in `KubernetesLocalDiskShuffleDataIO`
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in `KubernetesLocalDiskShuffleExecutorComponents` itself is not 
> that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to recovering shuffle files
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-44526:
---
Description: 
Hi,

This ticket is meant to understand the work that would be involved in porting 
the k8s PVC reuse feature onto the spark standalone cluster manager which 
reuses the shuffle files present locally in the disk

We are a heavy user of spot instances and we suffer from spot terminations 
impacting our long running jobs

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute

  was:
Hi,

This ticket is meant to understand the work that would be involved in porting 
the PVC reuse feature onto the spark standalone cluster manager

 

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute


> Porting k8s PVC reuse logic to spark standalone
> ---
>
> Key: SPARK-44526
> URL: https://issues.apache.org/jira/browse/SPARK-44526
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.4.1
>Reporter: Faiz Halde
>Priority: Major
>
> Hi,
> This ticket is meant to understand the work that would be involved in porting 
> the k8s PVC reuse feature onto the spark standalone cluster manager which 
> reuses the shuffle files present locally in the disk
> We are a heavy user of spot instances and we suffer from spot terminations 
> impacting our long running jobs
> The logic in
> KubernetesLocalDiskShuffleDataIO
> itself is not that much. However when I tried this on the 
> `LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
> suggests there is more to it
> I'd like to understand what will be the work involved for this. We'll be more 
> than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44526) Porting k8s PVC reuse logic to spark standalone

2023-07-24 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-44526:
--

 Summary: Porting k8s PVC reuse logic to spark standalone
 Key: SPARK-44526
 URL: https://issues.apache.org/jira/browse/SPARK-44526
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle, Spark Core
Affects Versions: 3.4.1
Reporter: Faiz Halde


Hi,

This ticket is meant to understand the work that would be involved in porting 
the PVC reuse feature onto the spark standalone cluster manager

 

The logic in
KubernetesLocalDiskShuffleDataIO
itself is not that much. However when I tried this on the 
`LocalDiskShuffleExecutorComponents` it was not a successful experiment which 
suggests there is more to it

I'd like to understand what will be the work involved for this. We'll be more 
than happy to contribute



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43426) Introducing a new SparkListenerEvent for jobs running in AQE mode

2023-05-09 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-43426:
--

 Summary: Introducing a new SparkListenerEvent for jobs running in 
AQE mode
 Key: SPARK-43426
 URL: https://issues.apache.org/jira/browse/SPARK-43426
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Faiz Halde


Prior to AQE, we used to rely on SparkListenerJobEnd to figure out when a job 
has been completed from Spark's scheduler point of view

Unfortunately with AQE a job is split into multiple jobs and there's no way for 
us to figure out when a job is completed

We have a listener that tracks some spark metrics. In order to establish the 
right relations among stages, jobs, and tasks we had to keep a mapping around. 
We used the SparkListenerJobEnd event to clear out these entries, however with 
AQE we cannot know when is the right time to remove these entries that we track

What suggestion would you have for this?

Would having a SparkListenerJobGroupEnd be a rationale addition?

Because our jobs have 1 single action, we probably can rely on stage metrics 
i.e if a stage completed that had some output metrics > 0, that's a useful 
indicator that the next Job End likely was the last job in that Job group



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed up 
by caching data in memory

To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in 
the context of a single action

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
to speed up by caching data in memory


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case is DISK_ONLY i.e. I am not looking to speed 
> up by caching data in memory
> To rephrase, is DISK_ONLY caching better or same as shuffle checkpointing in 
> the context of a single action



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API. The 
StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
to speed up by caching data in memory

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API. The 
> StorageLevel allowed in my case would only be DISK_ONLY i.e. I am not looking 
> to speed up by caching data in memory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Faiz Halde updated SPARK-43408:
---
Description: 
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself

FWIW, I am talking specifically in the context of the Dataframe API

  was:
Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself


> Spark caching in the context of a single job
> 
>
> Key: SPARK-43408
> URL: https://issues.apache.org/jira/browse/SPARK-43408
> Project: Spark
>  Issue Type: Question
>  Components: Shuffle
>Affects Versions: 3.3.1
>Reporter: Faiz Halde
>Priority: Trivial
>
> Does caching benefit a spark job with only a single action in it? Spark IIRC 
> already optimizes shuffles by persisting them onto the disk
> I am unable to find a counter-example where caching would benefit a job with 
> a single action. In every case I can think of, the shuffle checkpoint acts as 
> a good enough caching mechanism in itself
> FWIW, I am talking specifically in the context of the Dataframe API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43408) Spark caching in the context of a single job

2023-05-08 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-43408:
--

 Summary: Spark caching in the context of a single job
 Key: SPARK-43408
 URL: https://issues.apache.org/jira/browse/SPARK-43408
 Project: Spark
  Issue Type: Question
  Components: Shuffle
Affects Versions: 3.3.1
Reporter: Faiz Halde


Does caching benefit a spark job with only a single action in it? Spark IIRC 
already optimizes shuffles by persisting them onto the disk

I am unable to find a counter-example where caching would benefit a job with a 
single action. In every case I can think of, the shuffle checkpoint acts as a 
good enough caching mechanism in itself



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43407) Can executors recover/reuse shuffle files upon failure?

2023-05-08 Thread Faiz Halde (Jira)
Faiz Halde created SPARK-43407:
--

 Summary: Can executors recover/reuse shuffle files upon failure?
 Key: SPARK-43407
 URL: https://issues.apache.org/jira/browse/SPARK-43407
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Faiz Halde


Hello,

We've been in touch with a few spark specialists who suggested us a potential 
solution to improve the reliability of our jobs that are shuffle heavy

Here is what our setup looks like
 * Spark version: 3.3.1
 * Java version: 1.8
 * We do not use external shuffle service
 * We use spot instances

We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir 
is mounted on this EBS volume. One of the offerings from the service we use is 
EBS migration which basically means if a host is about to get evicted, a new 
host is created and the EBS volume is attached to it

When Spark assigns a new executor to the newly created instance, it basically 
can recover all the shuffle files that are already persisted in the migrated 
EBS volume

Is this how it works? Do executors recover / re-register the shuffle files that 
they found?

So far I have not come across any recovery mechanism. I can only see 
{noformat}
KubernetesLocalDiskShuffleDataIO{noformat}
 that has a pre-init step where it tries to register the available shuffle 
files to itself

A natural follow-up on this,

If what they claim is true, then ideally we should expect that when an executor 
is killed/OOM'd and a new executor is spawned on the same host, the new 
executor registers the shuffle files to itself. Is that so?

Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org