Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi,
I think you should write to HDFS then copy file (parquet or orc) from HDFS 
to MinIO.



eabour
 
From: Prem Sahoo
Date: 2024-05-22 00:38
To: Vibhor Gupta; user
Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way


On Tue, May 21, 2024 at 6:58 AM Prem Sahoo  wrote:
Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:

 
Hi Prem,
 
You can try to write to HDFS then read from HDFS and write to MinIO.
 
This will prevent duplicate transformation.
 
You can also try persisting the dataframe using the DISK_ONLY level.
 
Regards,
Vibhor
From: Prem Sahoo 
Date: Tuesday, 21 May 2024 at 8:16 AM
To: Spark dev list 
Subject: EXT: Dual Write to HDFS and MinIO in faster way
EXTERNAL: Report suspicious emails to Email Abuse.
Hello Team,
I am planning to write to two datasource at the same time . 
 
Scenario:-
 
Writing the same dataframe to HDFS and MinIO without re-executing the 
transformations and no cache(). Then how can we make it faster ?
 
Read the parquet file and do a few transformations and write to HDFS and MinIO.
 
here in both write spark needs execute the transformation again. Do we know how 
we can avoid re-execution of transformation  without cache()/persist ?
 
Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?
 
I don't want to do repartition and write as repartition will have overhead of 
shuffling .
 
Please provide some inputs. 
 
 


Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene,
  As the logs indicate, when executing spark-submit, Spark will package and 
upload spark/conf to HDFS, along with uploading spark/jars. These files are 
uploaded to HDFS unless you specify uploading them to another OSS. To do so, 
you'll need to modify the configuration in hdfs-site.xml, for instance, 
fs.oss.impl, etc.



eabour
 
From: Eugene Miretsky
Date: 2023-11-16 09:58
To: eab...@163.com
CC: Eugene Miretsky; user @spark
Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS
Hey! 

Thanks for the response. 

We are getting the error because there is no network connectivity to the data 
nodes - that's expected. 

What I am trying to find out is WHY we need access to the data nodes, and if 
there is a way to submit a job without it. 

Cheers,
Eugene

On Wed, Nov 15, 2023 at 7:32 PM eab...@163.com  wrote:
Hi Eugene,
I think you should Check if the HDFS service is running properly.  From the 
logs, it appears that there are two datanodes in HDFS,  but none of them are 
healthy.  Please investigate the reasons why the datanodes are not functioning 
properly.  It seems that the issue might be due to insufficient disk space.



eabour
 
From: Eugene Miretsky
Date: 2023-11-16 05:31
To: user
Subject: Spark-submit without access to HDFS
Hey All, 

We are running Pyspark spark-submit from a client outside the cluster. The 
client has network connectivity only to the Yarn Master, not the HDFS 
Datanodes. How can we submit the jobs? The idea would be to preload all the 
dependencies (job code, libraries, etc) to HDFS, and just submit the job from 
the client. 

We tried something like this
'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn 
--deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'

The error we are getting is 
"
org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could 
only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) 
running and 2 node(s) are excluded in this operation.
" 

A few question 
1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? 
Why would the client send them to the cluster? (the cluster already has all 
that info - this would make sense in client mode, but not cluster mode )
2) Is it possible to use spark-submit without HDFS access? 
3) How would we fix this?  

Cheers,
Eugene

-- 

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile:  416-568-9245
email: eug...@badal.io


-- 

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile:  416-568-9245
email: eug...@badal.io


Re: Spark-submit without access to HDFS

2023-11-15 Thread eab...@163.com
Hi Eugene,
I think you should Check if the HDFS service is running properly.  From the 
logs, it appears that there are two datanodes in HDFS,  but none of them are 
healthy.  Please investigate the reasons why the datanodes are not functioning 
properly.  It seems that the issue might be due to insufficient disk space.



eabour
 
From: Eugene Miretsky
Date: 2023-11-16 05:31
To: user
Subject: Spark-submit without access to HDFS
Hey All, 

We are running Pyspark spark-submit from a client outside the cluster. The 
client has network connectivity only to the Yarn Master, not the HDFS 
Datanodes. How can we submit the jobs? The idea would be to preload all the 
dependencies (job code, libraries, etc) to HDFS, and just submit the job from 
the client. 

We tried something like this
'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn 
--deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py'

The error we are getting is 
"
org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866]
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could 
only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) 
running and 2 node(s) are excluded in this operation.
" 

A few question 
1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? 
Why would the client send them to the cluster? (the cluster already has all 
that info - this would make sense in client mode, but not cluster mode )
2) Is it possible to use spark-submit without HDFS access? 
3) How would we fix this?  

Cheers,
Eugene

-- 

Eugene Miretsky
Managing Partner |  Badal.io | Book a meeting /w me! 
mobile:  416-568-9245
email: eug...@badal.io


Re: Re: jackson-databind version mismatch

2023-11-02 Thread eab...@163.com
Hi,
But in fact, it does have those packages.

 D:\02_bigdata\spark-3.5.0-bin-hadoop3\jars 

2023/09/09  10:0875,567 jackson-annotations-2.15.2.jar
2023/09/09  10:08   549,207 jackson-core-2.15.2.jar
2023/09/09  10:08   232,248 jackson-core-asl-1.9.13.jar
2023/09/09  10:08 1,620,088 jackson-databind-2.15.2.jar
2023/09/09  10:0854,630 jackson-dataformat-yaml-2.15.2.jar
2023/09/09  10:08   122,937 jackson-datatype-jsr310-2.15.2.jar
2023/09/09  10:08   780,664 jackson-mapper-asl-1.9.13.jar
2023/09/09  10:08   513,968 jackson-module-scala_2.12-2.15.2.jar



eabour
 
From: Bjørn Jørgensen
Date: 2023-11-02 16:40
To: eab...@163.com
CC: user @spark; Saar Barhoom; moshik.vitas
Subject: Re: jackson-databind version mismatch
[SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from 
pre-built distribution

tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen :
In spark 3.5.0 removed  jackson-core-asl and jackson-mapper-asl  those are with 
groupid org.codehaus.jackson. 

Those others jackson-* are with groupid com.fasterxml.jackson.core 


tor. 2. nov. 2023 kl. 01:43 skrev eab...@163.com :
Hi,
Please check the versions of jar files starting with "jackson-". Make sure 
all versions are consistent.  jackson jar list in spark-3.3.0:

2022/06/10  04:3775,714 jackson-annotations-2.13.3.jar
2022/06/10  04:37   374,895 jackson-core-2.13.3.jar
2022/06/10  04:37   232,248 jackson-core-asl-1.9.13.jar
2022/06/10  04:37 1,536,542 jackson-databind-2.13.3.jar
2022/06/10  04:3752,020 jackson-dataformat-yaml-2.13.3.jar
2022/06/10  04:37   121,201 jackson-datatype-jsr310-2.13.3.jar
2022/06/10  04:37   780,664 jackson-mapper-asl-1.9.13.jar
2022/06/10  04:37   458,981 jackson-module-scala_2.12-2.13.3.jar

Spark 3.3.0 uses Jackson version 2.13.3, while Spark 3.5.0 uses Jackson version 
2.15.2. I think you can remove the lower version of Jackson package to keep the 
versions consistent.
eabour
 
From: moshik.vi...@veeva.com.INVALID
Date: 2023-11-01 15:03
To: user@spark.apache.org
CC: 'Saar Barhoom'
Subject: jackson-databind version mismatch
Hi Spark team,
 
On upgrading spark version from 3.2.1 to 3.4.1 got the following issue:
java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator 
com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream,
 com.fasterxml.jackson.core.JsonEncoding)'
at 
org.apache.spark.util.JsonProtocol$.toJsonString(JsonProtocol.scala:75)
at 
org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:74)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:127)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
at org.apache.spark.sql.Dataset.head(Dataset.scala:3161)
at org.apache.spark.sql.Dataset.take(Dataset.scala:3382)
at org.apache.spark.sql.Dataset.takeAsList(Dataset.scala:3405)
at 
com.crossix.safemine.cloud.utils.DebugRDDLogger.showDataset(DebugRDDLogger.java:84)
at 
com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.getFillRateCountsWithSparkQuery(StatisticsTransformer.java:122)
at 
com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.calculateStatistics(StatisticsTransformer.java:61)
at 
com.crossix.safemine.cloud.components.statistics.spark.SparkFileStatistics.execute(SparkFileStatistics.java:102)
at 
com.crossix.safemine.cloud.StatisticsFlow.calculateAllStatistics(StatisticsFlow.java:146)
at 
com.crossix.safemine.cloud.StatisticsFlow.runStatistics(StatisticsFlow.java:119)
at 
com.crossix.safemine.cloud.StatisticsFlow.initialFileStatistics(StatisticsFlow.java:77)
at com.crossix.safemine.cloud.SMCFlow.process(SMCFlow.java:221)
at com.crossix.safemine.cloud.SMCFlow.execute(SMCFlow.java:132)
at com.crossix.safemine.cloud.SMCFlow.run(SMCFlow.java:91)

I see that that spark package contains the dependency:
com.

Re: jackson-databind version mismatch

2023-11-01 Thread eab...@163.com
Hi,
Please check the versions of jar files starting with "jackson-". Make sure 
all versions are consistent.  jackson jar list in spark-3.3.0:

2022/06/10  04:3775,714 jackson-annotations-2.13.3.jar
2022/06/10  04:37   374,895 jackson-core-2.13.3.jar
2022/06/10  04:37   232,248 jackson-core-asl-1.9.13.jar
2022/06/10  04:37 1,536,542 jackson-databind-2.13.3.jar
2022/06/10  04:3752,020 jackson-dataformat-yaml-2.13.3.jar
2022/06/10  04:37   121,201 jackson-datatype-jsr310-2.13.3.jar
2022/06/10  04:37   780,664 jackson-mapper-asl-1.9.13.jar
2022/06/10  04:37   458,981 jackson-module-scala_2.12-2.13.3.jar

Spark 3.3.0 uses Jackson version 2.13.3, while Spark 3.5.0 uses Jackson version 
2.15.2. I think you can remove the lower version of Jackson package to keep the 
versions consistent.
eabour
 
From: moshik.vi...@veeva.com.INVALID
Date: 2023-11-01 15:03
To: user@spark.apache.org
CC: 'Saar Barhoom'
Subject: jackson-databind version mismatch
Hi Spark team,
 
On upgrading spark version from 3.2.1 to 3.4.1 got the following issue:
java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator 
com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream,
 com.fasterxml.jackson.core.JsonEncoding)'
at 
org.apache.spark.util.JsonProtocol$.toJsonString(JsonProtocol.scala:75)
at 
org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:74)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:127)
at scala.Option.map(Option.scala:230)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
at org.apache.spark.sql.Dataset.head(Dataset.scala:3161)
at org.apache.spark.sql.Dataset.take(Dataset.scala:3382)
at org.apache.spark.sql.Dataset.takeAsList(Dataset.scala:3405)
at 
com.crossix.safemine.cloud.utils.DebugRDDLogger.showDataset(DebugRDDLogger.java:84)
at 
com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.getFillRateCountsWithSparkQuery(StatisticsTransformer.java:122)
at 
com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.calculateStatistics(StatisticsTransformer.java:61)
at 
com.crossix.safemine.cloud.components.statistics.spark.SparkFileStatistics.execute(SparkFileStatistics.java:102)
at 
com.crossix.safemine.cloud.StatisticsFlow.calculateAllStatistics(StatisticsFlow.java:146)
at 
com.crossix.safemine.cloud.StatisticsFlow.runStatistics(StatisticsFlow.java:119)
at 
com.crossix.safemine.cloud.StatisticsFlow.initialFileStatistics(StatisticsFlow.java:77)
at com.crossix.safemine.cloud.SMCFlow.process(SMCFlow.java:221)
at com.crossix.safemine.cloud.SMCFlow.execute(SMCFlow.java:132)
at com.crossix.safemine.cloud.SMCFlow.run(SMCFlow.java:91)

I see that that spark package contains the dependency:
com.fasterxml.jackson.core:jackson-databind:jar:2.10.5:compile
 
But jackson-databind 2.10.5 does not contain 
ObjectMapper.createGenerator(java.io.OutputStream, 
com.fasterxml.jackson.core.JsonEncoding)
It was added on 2.11.0
 
Trying to upgrade jackson-databind fails with:
com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.5 
requires Jackson Databind version >= 2.10.0 and < 2.11.0
 
According to spark 3.3.0 release notes: "Upgrade Jackson to 2.13.3" but spark 
package of 3.4.1 contains Jackson of 2.10.5
(https://spark.apache.org/releases/spark-release-3-3-0.html)
What am I missing?
 
--
Moshik Vitas
Senior Software Developer, Crossix
Veeva Systems
m: +972-54-5326-400
moshik.vi...@veeva.com
 


[Resolved] Re: spark.stop() cannot stop spark connect session

2023-10-25 Thread eab...@163.com
Hi all.
I read source code at spark/python/pyspark/sql/connect/session.py at master 
· apache/spark (github.com) and the comment for the "stop" method is described 
as follows:
def stop(self) -> None:
# Stopping the session will only close the connection to the current 
session (and
# the life cycle of the session is maintained by the server),
# whereas the regular PySpark session immediately terminates the Spark 
Context
# itself, meaning that stopping all Spark sessions.
# It is controversial to follow the existing the regular Spark 
session's behavior
# specifically in Spark Connect the Spark Connect server is designed for
# multi-tenancy - the remote client side cannot just stop the server 
and stop
# other remote clients being used from other users.
 
So, that's how it was designed.


eabour
 
From: eab...@163.com
Date: 2023-10-20 15:56
To: user @spark
Subject: spark.stop() cannot stop spark connect session
Hi,
my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate()

import pandas as pd
# 创建pandas dataframe
pdf = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"gender": ["F", "M", "M"]
})

# 将pandas dataframe转换为spark dataframe
sdf = spark.createDataFrame(pdf)

# 显示spark dataframe
sdf.show()

spark.stop()

After stop, execute sdf.show() throw 
pyspark.errors.exceptions.connect.SparkConnectException: [NO_ACTIVE_SESSION] No 
active Spark session found. Please create a new Spark session before running 
the code. Visit the Spark web UI at http://172.29.190.147:4040/connect/ to 
check if the current session is still running and has not been stopped yet.
1 session(s) are online, running 0 Request(s)
 Session Statistics (1)
1 Pages. Jump to. Showitems in a page.Go
Page:
1
User
Session ID
Start Time ▾
Finish Time
Duration
Total Execute
29f05cde-8f8b-418d-95c0-8dbbbfb556d22023/10/20 15:30:0414 minutes 49 seconds2


eabour


submitting tasks failed in Spark standalone mode due to missing failureaccess jar file

2023-10-23 Thread eab...@163.com
Hi Team.
I use spark 3.5.0 to start Spark cluster with start-master.sh and 
start-worker.sh, when I use  ./bin/spark-shell --master 
spark://LAPTOP-TC4A0SCV.:7077 and get error logs: 
```
23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already 
removed): Command exited with code 50
```
  The worker finished executors  logs:
```
Spark Executor Command: 
"/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.372.b07-1.el7_9.x86_64/jre/bin/java" 
"-cp" 
"/root/spark-3.5.0-bin-hadoop3/conf/:/root/spark-3.5.0-bin-hadoop3/jars/*" 
"-Xmx1024M" "-Dspark.driver.port=43765" "-Djava.net.preferIPv6Addresses=false" 
"-XX:+IgnoreUnrecognizedVMOptions" 
"--add-opens=java.base/java.lang=ALL-UNNAMED" 
"--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" 
"--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" 
"--add-opens=java.base/java.io=ALL-UNNAMED" 
"--add-opens=java.base/java.net=ALL-UNNAMED" 
"--add-opens=java.base/java.nio=ALL-UNNAMED" 
"--add-opens=java.base/java.util=ALL-UNNAMED" 
"--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" 
"--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" 
"--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" 
"--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" 
"--add-opens=java.base/sun.security.action=ALL-UNNAMED" 
"--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" 
"--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" 
"-Djdk.reflect.useDirectMethodHandle=false" 
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
"spark://CoarseGrainedScheduler@172.29.190.147:43765" "--executor-id" "0" 
"--hostname" "172.29.190.147" "--cores" "6" "--app-id" 
"app-20231024120037-0001" "--worker-url" "spark://Worker@172.29.190.147:34707" 
"--resourceProfileId" "0"

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
23/10/24 12:00:39 INFO CoarseGrainedExecutorBackend: Started daemon with 
process name: 19535@LAPTOP-TC4A0SCV
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for TERM
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for HUP
23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for INT
23/10/24 12:00:39 WARN Utils: Your hostname, LAPTOP-TC4A0SCV resolves to a 
loopback address: 127.0.1.1; using 172.29.190.147 instead (on interface eth0)
23/10/24 12:00:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
23/10/24 12:00:42 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
23/10/24 12:00:42 INFO Executor: Starting executor ID 0 on host 172.29.190.147
23/10/24 12:00:42 INFO Executor: OS info Linux, 
5.15.123.1-microsoft-standard-WSL2, amd64
23/10/24 12:00:42 INFO Executor: Java version 1.8.0_372
23/10/24 12:00:42 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 35227.
23/10/24 12:00:42 INFO NettyBlockTransferService: Server created on 
172.29.190.147:35227
23/10/24 12:00:42 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
23/10/24 12:00:42 ERROR Inbox: An error happened while processing message in 
the inbox for Executor
java.lang.NoClassDefFoundError: 
org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:756)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.ClassLoader.d

spark.stop() cannot stop spark connect session

2023-10-20 Thread eab...@163.com
Hi,
my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate()

import pandas as pd
# 创建pandas dataframe
pdf = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35],
"gender": ["F", "M", "M"]
})

# 将pandas dataframe转换为spark dataframe
sdf = spark.createDataFrame(pdf)

# 显示spark dataframe
sdf.show()

spark.stop()

After stop, execute sdf.show() throw 
pyspark.errors.exceptions.connect.SparkConnectException: [NO_ACTIVE_SESSION] No 
active Spark session found. Please create a new Spark session before running 
the code. Visit the Spark web UI at http://172.29.190.147:4040/connect/ to 
check if the current session is still running and has not been stopped yet.
1 session(s) are online, running 0 Request(s)
 Session Statistics (1)
1 Pages. Jump to. Showitems in a page.Go
Page:
1
User
Session ID
Start Time ▾
Finish Time
Duration
Total Execute
29f05cde-8f8b-418d-95c0-8dbbbfb556d22023/10/20 15:30:0414 minutes 49 seconds2


eabour


Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-19 Thread eab...@163.com
Hi,
I have found three important classes:
org.apache.spark.sql.connect.service.SparkConnectServer : the 
./sbin/start-connect-server.sh script use SparkConnectServer  class as main 
class. In main function, use SparkSession.builder.getOrCreate() create local 
sessin, and start SparkConnectService.
org.apache.spark.sql.connect.SparkConnectPlugin : To enable Spark Connect, 
simply make sure that the appropriate JAR is available in the CLASSPATH and the 
driver plugin is configured to load this class.
org.apache.spark.sql.connect.SimpleSparkConnectService : A simple main class 
method to start the spark connect server as a service for client tests. 

   So, I believe that by configuring the spark.plugins and starting the Spark 
cluster on Kubernetes, clients can utilize sc://ip:port to connect to the 
remote server. 
   Let me give it a try.



eabour
 
From: eab...@163.com
Date: 2023-10-19 14:28
To: Nagatomi Yasukazu; user @spark
Subject: Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Hi all, 
Has the spark connect server running on k8s functionality been implemented?



 
From: Nagatomi Yasukazu
Date: 2023-09-05 17:51
To: user
Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Dear Spark Community,

I've been exploring the capabilities of the Spark Connect Server and 
encountered an issue when trying to launch it in a cluster deploy mode with 
Kubernetes as the master.

While initiating the `start-connect-server.sh` script with the `--conf` 
parameter for `spark.master` and `spark.submit.deployMode`, I was met with an 
error message:

```
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode 
is not applicable to Spark Connect server.
```

This error message can be traced back to Spark's source code here:
https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307

Given my observations, I'm curious about the Spark Connect Server roadmap:

Is there a plan or current conversation to enable Kubernetes as a master in 
Spark Connect Server's cluster deploy mode?

I have tried to gather information from existing JIRA tickets, but have not 
been able to get a definitive answer:

https://issues.apache.org/jira/browse/SPARK-42730
https://issues.apache.org/jira/browse/SPARK-39375
https://issues.apache.org/jira/browse/SPARK-44117

Any thoughts, updates, or references to similar conversations or initiatives 
would be greatly appreciated.

Thank you for your time and expertise!

Best regards,
Yasukazu

2023年9月5日(火) 12:09 Nagatomi Yasukazu :
Hello Mich,
Thank you for your questions. Here are my responses:

> 1. What investigation have you done to show that it is running in local mode?

I have verified through the History Server's Environment tab that:
- "spark.master" is set to local[*]
- "spark.app.id" begins with local-xxx
- "spark.submit.deployMode" is set to local


> 2. who has configured this kubernetes cluster? Is it supplied by a cloud 
> vendor?

Our Kubernetes cluster was set up in an on-prem environment using RKE2( 
https://docs.rke2.io/ ).


> 3. Confirm that you have configured Spark Connect Server correctly for 
> cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes) 
> and other relevant Spark configurations in your Spark job submission.

Based on the Spark Connect documentation I've read, there doesn't seem to be 
any specific settings for cluster mode related to the Spark Connect Server.

Configuration - Spark 3.4.1 Documentation
https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect

Quickstart: Spark Connect — PySpark 3.4.1 documentation
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

Spark Connect Overview - Spark 3.4.1 Documentation
https://spark.apache.org/docs/latest/spark-connect-overview.html

The documentation only suggests running ./sbin/start-connect-server.sh 
--packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss.


> 4. Can you provide a full spark submit command

Given the nature of Spark Connect, I don't use the spark-submit command. 
Instead, as per the documentation, I can execute workloads using only a Python 
script. For the Spark Connect Server, I have a Kubernetes manifest executing 
"/opt.spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.4.0".


> 5. Make sure that the Python client script connecting to Spark Connect Server 
> specifies the cluster mode explicitly, like using --master or --deploy-mode 
> flags when creating a SparkSession.

The Spark Connect Server operates as a Driver, so it isn't possible to specify 
the --master or --deploy-mode flags in the Python client script. If I try, I 
encounter a RuntimeError.

like this:
RuntimeError: Spark master cannot be 

Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-10-18 Thread eab...@163.com
Hi all, 
Has the spark connect server running on k8s functionality been implemented?



 
From: Nagatomi Yasukazu
Date: 2023-09-05 17:51
To: user
Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Dear Spark Community,

I've been exploring the capabilities of the Spark Connect Server and 
encountered an issue when trying to launch it in a cluster deploy mode with 
Kubernetes as the master.

While initiating the `start-connect-server.sh` script with the `--conf` 
parameter for `spark.master` and `spark.submit.deployMode`, I was met with an 
error message:

```
Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode 
is not applicable to Spark Connect server.
```

This error message can be traced back to Spark's source code here:
https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307

Given my observations, I'm curious about the Spark Connect Server roadmap:

Is there a plan or current conversation to enable Kubernetes as a master in 
Spark Connect Server's cluster deploy mode?

I have tried to gather information from existing JIRA tickets, but have not 
been able to get a definitive answer:

https://issues.apache.org/jira/browse/SPARK-42730
https://issues.apache.org/jira/browse/SPARK-39375
https://issues.apache.org/jira/browse/SPARK-44117

Any thoughts, updates, or references to similar conversations or initiatives 
would be greatly appreciated.

Thank you for your time and expertise!

Best regards,
Yasukazu

2023年9月5日(火) 12:09 Nagatomi Yasukazu :
Hello Mich,
Thank you for your questions. Here are my responses:

> 1. What investigation have you done to show that it is running in local mode?

I have verified through the History Server's Environment tab that:
- "spark.master" is set to local[*]
- "spark.app.id" begins with local-xxx
- "spark.submit.deployMode" is set to local


> 2. who has configured this kubernetes cluster? Is it supplied by a cloud 
> vendor?

Our Kubernetes cluster was set up in an on-prem environment using RKE2( 
https://docs.rke2.io/ ).


> 3. Confirm that you have configured Spark Connect Server correctly for 
> cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes) 
> and other relevant Spark configurations in your Spark job submission.

Based on the Spark Connect documentation I've read, there doesn't seem to be 
any specific settings for cluster mode related to the Spark Connect Server.

Configuration - Spark 3.4.1 Documentation
https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect

Quickstart: Spark Connect — PySpark 3.4.1 documentation
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html

Spark Connect Overview - Spark 3.4.1 Documentation
https://spark.apache.org/docs/latest/spark-connect-overview.html

The documentation only suggests running ./sbin/start-connect-server.sh 
--packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss.


> 4. Can you provide a full spark submit command

Given the nature of Spark Connect, I don't use the spark-submit command. 
Instead, as per the documentation, I can execute workloads using only a Python 
script. For the Spark Connect Server, I have a Kubernetes manifest executing 
"/opt.spark/sbin/start-connect-server.sh --packages 
org.apache.spark:spark-connect_2.12:3.4.0".


> 5. Make sure that the Python client script connecting to Spark Connect Server 
> specifies the cluster mode explicitly, like using --master or --deploy-mode 
> flags when creating a SparkSession.

The Spark Connect Server operates as a Driver, so it isn't possible to specify 
the --master or --deploy-mode flags in the Python client script. If I try, I 
encounter a RuntimeError.

like this:
RuntimeError: Spark master cannot be configured with Spark Connect server; 
however, found URL for Spark Connect [sc://.../]


> 6. Ensure that you have allocated the necessary resources (CPU, memory etc) 
> to Spark Connect Server when running it on Kubernetes.

Resources are ample, so that shouldn't be the problem.


> 7. Review the environment variables and configurations you have set, 
> including the SPARK_NO_DAEMONIZE=1 variable. Ensure that these variables are 
> not conflicting with 

I'm unsure if SPARK_NO_DAEMONIZE=1 conflicts with cluster mode settings. But 
without it, the process goes to the background when executing 
start-connect-server.sh, causing the Pod to terminate prematurely.


> 8. Are you using the correct spark client version that is fully compatible 
> with your spark on the server?

Yes, I have verified that without using Spark Connect (e.g., using Spark 
Operator), Spark applications run as expected.

> 9. check the kubernetes error logs

The Kubernetes logs don't show any errors, and jobs are running in local mode.


> 10. Insufficient resources can lead to the application running in local mode

I wasn't aware that insufficient resources cou

Running 30 Spark applications at the same time is slower than one on average

2022-10-26 Thread eab...@163.com
Hi All,

I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM + 3DN/NM), 
and yarn with 192C/240G. I used the following test scenario:

1.spark app resource with 2G driver memory/2C driver vcore/1 executor nums/2G 
executor memory/2C executor vcore.
2.one spark app will use 5G4C on yarn.
3.first, I only run one spark app takes 40s.
4.Then, I run 30 the same spark app at once, and each spark app takes 80s on 
average.

So, I want to know why the run time gap is so big, and how to optimize?

Thanks



Re: How can I config hive.metastore.warehouse.dir

2021-08-11 Thread eab...@163.com
 Hi,
I think you should set hive-site.xml before init SparkSession, spark will 
connect to metostore,and logged like that:
==
2021-08-12 09:21:21 INFO  HiveUtils:54 - Initializing HiveMetastoreConnection 
version 1.2.1 using Spark classes.
2021-08-12 09:21:22 INFO  metastore:376 - Trying to connect to metastore with 
URI thrift://hadoop001:9083
2021-08-12 09:21:22 WARN  UserGroupInformation:1535 - No groups available for 
user hdfs
2021-08-12 09:21:22 INFO  metastore:472 - Connected to metastore.
2021-08-12 09:21:22 INFO  SessionState:641 - Created local directory: 
/tmp/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076_resources
2021-08-12 09:21:22 INFO  SessionState:641 - Created HDFS directory: 
/tmp/hive/hdfs/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076
2021-08-12 09:21:22 INFO  SessionState:641 - Created local directory: 
/tmp/tempo/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076
2021-08-12 09:21:22 INFO  SessionState:641 - Created HDFS directory: 
/tmp/hive/hdfs/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076/_tmp_space.db
===

eab...@163.com
 
From: igyu
Date: 2021-08-12 11:33
To: user
Subject: How can I config hive.metastore.warehouse.dir
I need write data to hive with spark

val proper =  new Properties
proper.setProperty("fs.defaultFS", "hdfs://nameservice1")
proper.setProperty("dfs.nameservices", "nameservice1")
proper.setProperty("dfs.ha.namenodes.nameservice1", "namenode337,namenode369")
proper.setProperty("dfs.namenode.rpc-address.nameservice1.namenode337", 
"bigdser1:8020")
proper.setProperty("dfs.namenode.rpc-address.nameservice1.namenode369", 
"bigdser5:8020")
proper.setProperty("dfs.client.failover.proxy.provider.nameservice1", 
"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
proper.setProperty("hadoop.security.authentication", "Kerberos")
proper.setProperty("fs.hdfs.impl", 
"org.apache.hadoop.hdfs.DistributedFileSystem")
proper.setProperty("spark.sql.warehouse.dir", "/user/hive/warehouse")
proper.setProperty("hive.metastore.warehouse.dir","/user/hive/warehouse")
proper.setProperty("hive.metastore.uris", "thrift://bigdser1:9083")
sparkSession.sqlContext.setConf(proper)
sparkSession.sqlContext.setConf("hive.exec.dynamic.partition", "true")
sparkSession.sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")   DF.write.format("jdbc")
  .option("timestampFormat", "/MM/dd HH:mm:ss ZZ")
  .options(cfg)
//  .partitionBy(partitions:_*)
  .mode(mode)
  .insertInto(table)
but I get a error

21/08/12 11:25:07 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/D:/file/code/Java/jztsynctools/spark-warehouse/').
21/08/12 11:25:07 INFO SharedState: Warehouse path is 
'file:/D:/file/code/Java/jztsynctools/spark-warehouse/'.
21/08/12 11:25:08 INFO StateStoreCoordinatorRef: Registered 
StateStoreCoordinator endpoint
21/08/12 11:25:08 INFO Version: Elasticsearch Hadoop v7.10.2 [f53f4b7b2b]
21/08/12 11:25:08 INFO Utils: Supplied authorities: tidb4ser:11000
21/08/12 11:25:08 INFO Utils: Resolved authority: tidb4ser:11000
21/08/12 11:25:16 INFO UserGroupInformation: Login successful for user 
jztwk/had...@join.com using keytab file D:\file\jztwk.keytab. Keytab auto 
renewal enabled : false
login user: jztwk/had...@join.com (auth:KERBEROS)
21/08/12 11:25:25 WARN ProcfsMetricsGetter: Exception when trying to compute 
pagesize, as a result reporting of ProcessTree metrics is stopped
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or 
view not found: hivetest.chinese_part1, the database hivetest doesn't exist.;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:737)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:710)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)

Is there a better way to read kerberized impala tables by spark jdbc?

2020-12-07 Thread eab...@163.com
Hi:

I want to use spark jdbc to read kerberized impala tables, like:
```
val impalaUrl = 
"jdbc:impala://:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=;KrbServiceName=impala"
spark.read.jdbc(impalaUrl)
```

As we know, spark will read impala data by executor rather than driver, so 
throw excepting:  javax.security.sasl.SaslException: GSS initiate failed

```
Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism 
level: Failed to find any Kerberos tgt)
at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
... 20 common frames omitted

``` 

Ony way to solve this problem is set jaas.conf by 
"java.security.auth.login.config" property, 

This is jaas.conf:

```
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  doNotPrompt=true
  useTicketCache=true
  principal="test"
  keyTab="/home/keytab/user.keytab";
   };

```

Then set spark.executor.extraJavaOptions like :
```
--conf 
"spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/data/disk1/spark-jdbc-impala/conf/jaas.conf
 -Djavax.security.auth.useSubjectCredsOnly=false" 
```

This way required absolute jaas.conf file and keyTab file, in other words, 
these files must be placed in the same path and on each node, Is there a better 
way?

Please help.

Regards




eab...@163.com