Re: Re: EXT: Dual Write to HDFS and MinIO in faster way
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor, Thanks for the suggestion . I am looking for some other alternatives where I can use the same dataframe can be written to two destinations without re execution and cache or persist . Can some one help me in scenario 2 ? How to make spark write to MinIO faster ? Sent from my iPhone On May 21, 2024, at 1:18 AM, Vibhor Gupta wrote: Hi Prem, You can try to write to HDFS then read from HDFS and write to MinIO. This will prevent duplicate transformation. You can also try persisting the dataframe using the DISK_ONLY level. Regards, Vibhor From: Prem Sahoo Date: Tuesday, 21 May 2024 at 8:16 AM To: Spark dev list Subject: EXT: Dual Write to HDFS and MinIO in faster way EXTERNAL: Report suspicious emails to Email Abuse. Hello Team, I am planning to write to two datasource at the same time . Scenario:- Writing the same dataframe to HDFS and MinIO without re-executing the transformations and no cache(). Then how can we make it faster ? Read the parquet file and do a few transformations and write to HDFS and MinIO. here in both write spark needs execute the transformation again. Do we know how we can avoid re-execution of transformation without cache()/persist ? Scenario2 :- I am writing 3.2G data to HDFS and MinIO which takes ~6mins. Do we have any way to make writing this faster ? I don't want to do repartition and write as repartition will have overhead of shuffling . Please provide some inputs.
Re: Re: [EXTERNAL] Re: Spark-submit without access to HDFS
Hi Eugene, As the logs indicate, when executing spark-submit, Spark will package and upload spark/conf to HDFS, along with uploading spark/jars. These files are uploaded to HDFS unless you specify uploading them to another OSS. To do so, you'll need to modify the configuration in hdfs-site.xml, for instance, fs.oss.impl, etc. eabour From: Eugene Miretsky Date: 2023-11-16 09:58 To: eab...@163.com CC: Eugene Miretsky; user @spark Subject: Re: [EXTERNAL] Re: Spark-submit without access to HDFS Hey! Thanks for the response. We are getting the error because there is no network connectivity to the data nodes - that's expected. What I am trying to find out is WHY we need access to the data nodes, and if there is a way to submit a job without it. Cheers, Eugene On Wed, Nov 15, 2023 at 7:32 PM eab...@163.com wrote: Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due to insufficient disk space. eabour From: Eugene Miretsky Date: 2023-11-16 05:31 To: user Subject: Spark-submit without access to HDFS Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the job from the client. We tried something like this 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py' The error we are getting is " org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866] org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation. " A few question 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? Why would the client send them to the cluster? (the cluster already has all that info - this would make sense in client mode, but not cluster mode ) 2) Is it possible to use spark-submit without HDFS access? 3) How would we fix this? Cheers, Eugene -- Eugene Miretsky Managing Partner | Badal.io | Book a meeting /w me! mobile: 416-568-9245 email: eug...@badal.io -- Eugene Miretsky Managing Partner | Badal.io | Book a meeting /w me! mobile: 416-568-9245 email: eug...@badal.io
Re: Spark-submit without access to HDFS
Hi Eugene, I think you should Check if the HDFS service is running properly. From the logs, it appears that there are two datanodes in HDFS, but none of them are healthy. Please investigate the reasons why the datanodes are not functioning properly. It seems that the issue might be due to insufficient disk space. eabour From: Eugene Miretsky Date: 2023-11-16 05:31 To: user Subject: Spark-submit without access to HDFS Hey All, We are running Pyspark spark-submit from a client outside the cluster. The client has network connectivity only to the Yarn Master, not the HDFS Datanodes. How can we submit the jobs? The idea would be to preload all the dependencies (job code, libraries, etc) to HDFS, and just submit the job from the client. We tried something like this 'PYSPARK_ARCHIVES_PATH=hdfs://some-path/pyspark.zip spark-submit --master yarn --deploy-mode cluster --py-files hdfs://yarn-master-url hdfs://foo.py' The error we are getting is " org.apache.hadoop.net.ConnectTimeoutException: 6 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.117.110.19:9866] org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/users/.sparkStaging/application_1698216436656_0104/spark_conf.zip could only be written to 0 of the 1 minReplication nodes. There are 2 datanode(s) running and 2 node(s) are excluded in this operation. " A few question 1) What are the spark_conf.zip files. Is it the hive-site/yarn-site conf files? Why would the client send them to the cluster? (the cluster already has all that info - this would make sense in client mode, but not cluster mode ) 2) Is it possible to use spark-submit without HDFS access? 3) How would we fix this? Cheers, Eugene -- Eugene Miretsky Managing Partner | Badal.io | Book a meeting /w me! mobile: 416-568-9245 email: eug...@badal.io
Re: Re: jackson-databind version mismatch
Hi, But in fact, it does have those packages. D:\02_bigdata\spark-3.5.0-bin-hadoop3\jars 2023/09/09 10:0875,567 jackson-annotations-2.15.2.jar 2023/09/09 10:08 549,207 jackson-core-2.15.2.jar 2023/09/09 10:08 232,248 jackson-core-asl-1.9.13.jar 2023/09/09 10:08 1,620,088 jackson-databind-2.15.2.jar 2023/09/09 10:0854,630 jackson-dataformat-yaml-2.15.2.jar 2023/09/09 10:08 122,937 jackson-datatype-jsr310-2.15.2.jar 2023/09/09 10:08 780,664 jackson-mapper-asl-1.9.13.jar 2023/09/09 10:08 513,968 jackson-module-scala_2.12-2.15.2.jar eabour From: Bjørn Jørgensen Date: 2023-11-02 16:40 To: eab...@163.com CC: user @spark; Saar Barhoom; moshik.vitas Subject: Re: jackson-databind version mismatch [SPARK-43225][BUILD][SQL] Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution tor. 2. nov. 2023 kl. 09:15 skrev Bjørn Jørgensen : In spark 3.5.0 removed jackson-core-asl and jackson-mapper-asl those are with groupid org.codehaus.jackson. Those others jackson-* are with groupid com.fasterxml.jackson.core tor. 2. nov. 2023 kl. 01:43 skrev eab...@163.com : Hi, Please check the versions of jar files starting with "jackson-". Make sure all versions are consistent. jackson jar list in spark-3.3.0: 2022/06/10 04:3775,714 jackson-annotations-2.13.3.jar 2022/06/10 04:37 374,895 jackson-core-2.13.3.jar 2022/06/10 04:37 232,248 jackson-core-asl-1.9.13.jar 2022/06/10 04:37 1,536,542 jackson-databind-2.13.3.jar 2022/06/10 04:3752,020 jackson-dataformat-yaml-2.13.3.jar 2022/06/10 04:37 121,201 jackson-datatype-jsr310-2.13.3.jar 2022/06/10 04:37 780,664 jackson-mapper-asl-1.9.13.jar 2022/06/10 04:37 458,981 jackson-module-scala_2.12-2.13.3.jar Spark 3.3.0 uses Jackson version 2.13.3, while Spark 3.5.0 uses Jackson version 2.15.2. I think you can remove the lower version of Jackson package to keep the versions consistent. eabour From: moshik.vi...@veeva.com.INVALID Date: 2023-11-01 15:03 To: user@spark.apache.org CC: 'Saar Barhoom' Subject: jackson-databind version mismatch Hi Spark team, On upgrading spark version from 3.2.1 to 3.4.1 got the following issue: java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream, com.fasterxml.jackson.core.JsonEncoding)' at org.apache.spark.util.JsonProtocol$.toJsonString(JsonProtocol.scala:75) at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:74) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:127) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165) at org.apache.spark.sql.Dataset.head(Dataset.scala:3161) at org.apache.spark.sql.Dataset.take(Dataset.scala:3382) at org.apache.spark.sql.Dataset.takeAsList(Dataset.scala:3405) at com.crossix.safemine.cloud.utils.DebugRDDLogger.showDataset(DebugRDDLogger.java:84) at com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.getFillRateCountsWithSparkQuery(StatisticsTransformer.java:122) at com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.calculateStatistics(StatisticsTransformer.java:61) at com.crossix.safemine.cloud.components.statistics.spark.SparkFileStatistics.execute(SparkFileStatistics.java:102) at com.crossix.safemine.cloud.StatisticsFlow.calculateAllStatistics(StatisticsFlow.java:146) at com.crossix.safemine.cloud.StatisticsFlow.runStatistics(StatisticsFlow.java:119) at com.crossix.safemine.cloud.StatisticsFlow.initialFileStatistics(StatisticsFlow.java:77) at com.crossix.safemine.cloud.SMCFlow.process(SMCFlow.java:221) at com.crossix.safemine.cloud.SMCFlow.execute(SMCFlow.java:132) at com.crossix.safemine.cloud.SMCFlow.run(SMCFlow.java:91) I see that that spark package contains the dependency: com.
Re: jackson-databind version mismatch
Hi, Please check the versions of jar files starting with "jackson-". Make sure all versions are consistent. jackson jar list in spark-3.3.0: 2022/06/10 04:3775,714 jackson-annotations-2.13.3.jar 2022/06/10 04:37 374,895 jackson-core-2.13.3.jar 2022/06/10 04:37 232,248 jackson-core-asl-1.9.13.jar 2022/06/10 04:37 1,536,542 jackson-databind-2.13.3.jar 2022/06/10 04:3752,020 jackson-dataformat-yaml-2.13.3.jar 2022/06/10 04:37 121,201 jackson-datatype-jsr310-2.13.3.jar 2022/06/10 04:37 780,664 jackson-mapper-asl-1.9.13.jar 2022/06/10 04:37 458,981 jackson-module-scala_2.12-2.13.3.jar Spark 3.3.0 uses Jackson version 2.13.3, while Spark 3.5.0 uses Jackson version 2.15.2. I think you can remove the lower version of Jackson package to keep the versions consistent. eabour From: moshik.vi...@veeva.com.INVALID Date: 2023-11-01 15:03 To: user@spark.apache.org CC: 'Saar Barhoom' Subject: jackson-databind version mismatch Hi Spark team, On upgrading spark version from 3.2.1 to 3.4.1 got the following issue: java.lang.NoSuchMethodError: 'com.fasterxml.jackson.core.JsonGenerator com.fasterxml.jackson.databind.ObjectMapper.createGenerator(java.io.OutputStream, com.fasterxml.jackson.core.JsonEncoding)' at org.apache.spark.util.JsonProtocol$.toJsonString(JsonProtocol.scala:75) at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:74) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$7(SQLExecution.scala:127) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165) at org.apache.spark.sql.Dataset.head(Dataset.scala:3161) at org.apache.spark.sql.Dataset.take(Dataset.scala:3382) at org.apache.spark.sql.Dataset.takeAsList(Dataset.scala:3405) at com.crossix.safemine.cloud.utils.DebugRDDLogger.showDataset(DebugRDDLogger.java:84) at com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.getFillRateCountsWithSparkQuery(StatisticsTransformer.java:122) at com.crossix.safemine.cloud.components.statistics.spark.StatisticsTransformer.calculateStatistics(StatisticsTransformer.java:61) at com.crossix.safemine.cloud.components.statistics.spark.SparkFileStatistics.execute(SparkFileStatistics.java:102) at com.crossix.safemine.cloud.StatisticsFlow.calculateAllStatistics(StatisticsFlow.java:146) at com.crossix.safemine.cloud.StatisticsFlow.runStatistics(StatisticsFlow.java:119) at com.crossix.safemine.cloud.StatisticsFlow.initialFileStatistics(StatisticsFlow.java:77) at com.crossix.safemine.cloud.SMCFlow.process(SMCFlow.java:221) at com.crossix.safemine.cloud.SMCFlow.execute(SMCFlow.java:132) at com.crossix.safemine.cloud.SMCFlow.run(SMCFlow.java:91) I see that that spark package contains the dependency: com.fasterxml.jackson.core:jackson-databind:jar:2.10.5:compile But jackson-databind 2.10.5 does not contain ObjectMapper.createGenerator(java.io.OutputStream, com.fasterxml.jackson.core.JsonEncoding) It was added on 2.11.0 Trying to upgrade jackson-databind fails with: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.10.5 requires Jackson Databind version >= 2.10.0 and < 2.11.0 According to spark 3.3.0 release notes: "Upgrade Jackson to 2.13.3" but spark package of 3.4.1 contains Jackson of 2.10.5 (https://spark.apache.org/releases/spark-release-3-3-0.html) What am I missing? -- Moshik Vitas Senior Software Developer, Crossix Veeva Systems m: +972-54-5326-400 moshik.vi...@veeva.com
[Resolved] Re: spark.stop() cannot stop spark connect session
Hi all. I read source code at spark/python/pyspark/sql/connect/session.py at master · apache/spark (github.com) and the comment for the "stop" method is described as follows: def stop(self) -> None: # Stopping the session will only close the connection to the current session (and # the life cycle of the session is maintained by the server), # whereas the regular PySpark session immediately terminates the Spark Context # itself, meaning that stopping all Spark sessions. # It is controversial to follow the existing the regular Spark session's behavior # specifically in Spark Connect the Spark Connect server is designed for # multi-tenancy - the remote client side cannot just stop the server and stop # other remote clients being used from other users. So, that's how it was designed. eabour From: eab...@163.com Date: 2023-10-20 15:56 To: user @spark Subject: spark.stop() cannot stop spark connect session Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"] }) # 将pandas dataframe转换为spark dataframe sdf = spark.createDataFrame(pdf) # 显示spark dataframe sdf.show() spark.stop() After stop, execute sdf.show() throw pyspark.errors.exceptions.connect.SparkConnectException: [NO_ACTIVE_SESSION] No active Spark session found. Please create a new Spark session before running the code. Visit the Spark web UI at http://172.29.190.147:4040/connect/ to check if the current session is still running and has not been stopped yet. 1 session(s) are online, running 0 Request(s) Session Statistics (1) 1 Pages. Jump to. Showitems in a page.Go Page: 1 User Session ID Start Time ▾ Finish Time Duration Total Execute 29f05cde-8f8b-418d-95c0-8dbbbfb556d22023/10/20 15:30:0414 minutes 49 seconds2 eabour
submitting tasks failed in Spark standalone mode due to missing failureaccess jar file
Hi Team. I use spark 3.5.0 to start Spark cluster with start-master.sh and start-worker.sh, when I use ./bin/spark-shell --master spark://LAPTOP-TC4A0SCV.:7077 and get error logs: ``` 23/10/24 12:00:46 ERROR TaskSchedulerImpl: Lost an executor 1 (already removed): Command exited with code 50 ``` The worker finished executors logs: ``` Spark Executor Command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.372.b07-1.el7_9.x86_64/jre/bin/java" "-cp" "/root/spark-3.5.0-bin-hadoop3/conf/:/root/spark-3.5.0-bin-hadoop3/jars/*" "-Xmx1024M" "-Dspark.driver.port=43765" "-Djava.net.preferIPv6Addresses=false" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "-Djdk.reflect.useDirectMethodHandle=false" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@172.29.190.147:43765" "--executor-id" "0" "--hostname" "172.29.190.147" "--cores" "6" "--app-id" "app-20231024120037-0001" "--worker-url" "spark://Worker@172.29.190.147:34707" "--resourceProfileId" "0" Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties 23/10/24 12:00:39 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 19535@LAPTOP-TC4A0SCV 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for TERM 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for HUP 23/10/24 12:00:39 INFO SignalUtils: Registering signal handler for INT 23/10/24 12:00:39 WARN Utils: Your hostname, LAPTOP-TC4A0SCV resolves to a loopback address: 127.0.1.1; using 172.29.190.147 instead (on interface eth0) 23/10/24 12:00:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 23/10/24 12:00:42 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 23/10/24 12:00:42 INFO Executor: Starting executor ID 0 on host 172.29.190.147 23/10/24 12:00:42 INFO Executor: OS info Linux, 5.15.123.1-microsoft-standard-WSL2, amd64 23/10/24 12:00:42 INFO Executor: Java version 1.8.0_372 23/10/24 12:00:42 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35227. 23/10/24 12:00:42 INFO NettyBlockTransferService: Server created on 172.29.190.147:35227 23/10/24 12:00:42 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 23/10/24 12:00:42 ERROR Inbox: An error happened while processing message in the inbox for Executor java.lang.NoClassDefFoundError: org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:473) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) at java.lang.ClassLoader.d
spark.stop() cannot stop spark connect session
Hi, my code: from pyspark.sql import SparkSession spark = SparkSession.builder.remote("sc://172.29.190.147").getOrCreate() import pandas as pd # 创建pandas dataframe pdf = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "gender": ["F", "M", "M"] }) # 将pandas dataframe转换为spark dataframe sdf = spark.createDataFrame(pdf) # 显示spark dataframe sdf.show() spark.stop() After stop, execute sdf.show() throw pyspark.errors.exceptions.connect.SparkConnectException: [NO_ACTIVE_SESSION] No active Spark session found. Please create a new Spark session before running the code. Visit the Spark web UI at http://172.29.190.147:4040/connect/ to check if the current session is still running and has not been stopped yet. 1 session(s) are online, running 0 Request(s) Session Statistics (1) 1 Pages. Jump to. Showitems in a page.Go Page: 1 User Session ID Start Time ▾ Finish Time Duration Total Execute 29f05cde-8f8b-418d-95c0-8dbbbfb556d22023/10/20 15:30:0414 minutes 49 seconds2 eabour
Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Hi, I have found three important classes: org.apache.spark.sql.connect.service.SparkConnectServer : the ./sbin/start-connect-server.sh script use SparkConnectServer class as main class. In main function, use SparkSession.builder.getOrCreate() create local sessin, and start SparkConnectService. org.apache.spark.sql.connect.SparkConnectPlugin : To enable Spark Connect, simply make sure that the appropriate JAR is available in the CLASSPATH and the driver plugin is configured to load this class. org.apache.spark.sql.connect.SimpleSparkConnectService : A simple main class method to start the spark connect server as a service for client tests. So, I believe that by configuring the spark.plugins and starting the Spark cluster on Kubernetes, clients can utilize sc://ip:port to connect to the remote server. Let me give it a try. eabour From: eab...@163.com Date: 2023-10-19 14:28 To: Nagatomi Yasukazu; user @spark Subject: Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark Connect Server and encountered an issue when trying to launch it in a cluster deploy mode with Kubernetes as the master. While initiating the `start-connect-server.sh` script with the `--conf` parameter for `spark.master` and `spark.submit.deployMode`, I was met with an error message: ``` Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark Connect server. ``` This error message can be traced back to Spark's source code here: https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307 Given my observations, I'm curious about the Spark Connect Server roadmap: Is there a plan or current conversation to enable Kubernetes as a master in Spark Connect Server's cluster deploy mode? I have tried to gather information from existing JIRA tickets, but have not been able to get a definitive answer: https://issues.apache.org/jira/browse/SPARK-42730 https://issues.apache.org/jira/browse/SPARK-39375 https://issues.apache.org/jira/browse/SPARK-44117 Any thoughts, updates, or references to similar conversations or initiatives would be greatly appreciated. Thank you for your time and expertise! Best regards, Yasukazu 2023年9月5日(火) 12:09 Nagatomi Yasukazu : Hello Mich, Thank you for your questions. Here are my responses: > 1. What investigation have you done to show that it is running in local mode? I have verified through the History Server's Environment tab that: - "spark.master" is set to local[*] - "spark.app.id" begins with local-xxx - "spark.submit.deployMode" is set to local > 2. who has configured this kubernetes cluster? Is it supplied by a cloud > vendor? Our Kubernetes cluster was set up in an on-prem environment using RKE2( https://docs.rke2.io/ ). > 3. Confirm that you have configured Spark Connect Server correctly for > cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes) > and other relevant Spark configurations in your Spark job submission. Based on the Spark Connect documentation I've read, there doesn't seem to be any specific settings for cluster mode related to the Spark Connect Server. Configuration - Spark 3.4.1 Documentation https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect Quickstart: Spark Connect — PySpark 3.4.1 documentation https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html Spark Connect Overview - Spark 3.4.1 Documentation https://spark.apache.org/docs/latest/spark-connect-overview.html The documentation only suggests running ./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss. > 4. Can you provide a full spark submit command Given the nature of Spark Connect, I don't use the spark-submit command. Instead, as per the documentation, I can execute workloads using only a Python script. For the Spark Connect Server, I have a Kubernetes manifest executing "/opt.spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0". > 5. Make sure that the Python client script connecting to Spark Connect Server > specifies the cluster mode explicitly, like using --master or --deploy-mode > flags when creating a SparkSession. The Spark Connect Server operates as a Driver, so it isn't possible to specify the --master or --deploy-mode flags in the Python client script. If I try, I encounter a RuntimeError. like this: RuntimeError: Spark master cannot be
Re: Re: Running Spark Connect Server in Cluster Mode on Kubernetes
Hi all, Has the spark connect server running on k8s functionality been implemented? From: Nagatomi Yasukazu Date: 2023-09-05 17:51 To: user Subject: Re: Running Spark Connect Server in Cluster Mode on Kubernetes Dear Spark Community, I've been exploring the capabilities of the Spark Connect Server and encountered an issue when trying to launch it in a cluster deploy mode with Kubernetes as the master. While initiating the `start-connect-server.sh` script with the `--conf` parameter for `spark.master` and `spark.submit.deployMode`, I was met with an error message: ``` Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is not applicable to Spark Connect server. ``` This error message can be traced back to Spark's source code here: https://github.com/apache/spark/blob/6c885a7cf57df328b03308cff2eed814bda156e4/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L307 Given my observations, I'm curious about the Spark Connect Server roadmap: Is there a plan or current conversation to enable Kubernetes as a master in Spark Connect Server's cluster deploy mode? I have tried to gather information from existing JIRA tickets, but have not been able to get a definitive answer: https://issues.apache.org/jira/browse/SPARK-42730 https://issues.apache.org/jira/browse/SPARK-39375 https://issues.apache.org/jira/browse/SPARK-44117 Any thoughts, updates, or references to similar conversations or initiatives would be greatly appreciated. Thank you for your time and expertise! Best regards, Yasukazu 2023年9月5日(火) 12:09 Nagatomi Yasukazu : Hello Mich, Thank you for your questions. Here are my responses: > 1. What investigation have you done to show that it is running in local mode? I have verified through the History Server's Environment tab that: - "spark.master" is set to local[*] - "spark.app.id" begins with local-xxx - "spark.submit.deployMode" is set to local > 2. who has configured this kubernetes cluster? Is it supplied by a cloud > vendor? Our Kubernetes cluster was set up in an on-prem environment using RKE2( https://docs.rke2.io/ ). > 3. Confirm that you have configured Spark Connect Server correctly for > cluster mode. Make sure you specify the cluster manager (e.g., Kubernetes) > and other relevant Spark configurations in your Spark job submission. Based on the Spark Connect documentation I've read, there doesn't seem to be any specific settings for cluster mode related to the Spark Connect Server. Configuration - Spark 3.4.1 Documentation https://spark.apache.org/docs/3.4.1/configuration.html#spark-connect Quickstart: Spark Connect — PySpark 3.4.1 documentation https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html Spark Connect Overview - Spark 3.4.1 Documentation https://spark.apache.org/docs/latest/spark-connect-overview.html The documentation only suggests running ./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0, leaving me at a loss. > 4. Can you provide a full spark submit command Given the nature of Spark Connect, I don't use the spark-submit command. Instead, as per the documentation, I can execute workloads using only a Python script. For the Spark Connect Server, I have a Kubernetes manifest executing "/opt.spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.4.0". > 5. Make sure that the Python client script connecting to Spark Connect Server > specifies the cluster mode explicitly, like using --master or --deploy-mode > flags when creating a SparkSession. The Spark Connect Server operates as a Driver, so it isn't possible to specify the --master or --deploy-mode flags in the Python client script. If I try, I encounter a RuntimeError. like this: RuntimeError: Spark master cannot be configured with Spark Connect server; however, found URL for Spark Connect [sc://.../] > 6. Ensure that you have allocated the necessary resources (CPU, memory etc) > to Spark Connect Server when running it on Kubernetes. Resources are ample, so that shouldn't be the problem. > 7. Review the environment variables and configurations you have set, > including the SPARK_NO_DAEMONIZE=1 variable. Ensure that these variables are > not conflicting with I'm unsure if SPARK_NO_DAEMONIZE=1 conflicts with cluster mode settings. But without it, the process goes to the background when executing start-connect-server.sh, causing the Pod to terminate prematurely. > 8. Are you using the correct spark client version that is fully compatible > with your spark on the server? Yes, I have verified that without using Spark Connect (e.g., using Spark Operator), Spark applications run as expected. > 9. check the kubernetes error logs The Kubernetes logs don't show any errors, and jobs are running in local mode. > 10. Insufficient resources can lead to the application running in local mode I wasn't aware that insufficient resources cou
Running 30 Spark applications at the same time is slower than one on average
Hi All, I have a CDH5.16.2 hadoop cluster with 1+3 nodes(64C/128G, 1NN/RM + 3DN/NM), and yarn with 192C/240G. I used the following test scenario: 1.spark app resource with 2G driver memory/2C driver vcore/1 executor nums/2G executor memory/2C executor vcore. 2.one spark app will use 5G4C on yarn. 3.first, I only run one spark app takes 40s. 4.Then, I run 30 the same spark app at once, and each spark app takes 80s on average. So, I want to know why the run time gap is so big, and how to optimize? Thanks
Re: How can I config hive.metastore.warehouse.dir
Hi, I think you should set hive-site.xml before init SparkSession, spark will connect to metostore,and logged like that: == 2021-08-12 09:21:21 INFO HiveUtils:54 - Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 2021-08-12 09:21:22 INFO metastore:376 - Trying to connect to metastore with URI thrift://hadoop001:9083 2021-08-12 09:21:22 WARN UserGroupInformation:1535 - No groups available for user hdfs 2021-08-12 09:21:22 INFO metastore:472 - Connected to metastore. 2021-08-12 09:21:22 INFO SessionState:641 - Created local directory: /tmp/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076_resources 2021-08-12 09:21:22 INFO SessionState:641 - Created HDFS directory: /tmp/hive/hdfs/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076 2021-08-12 09:21:22 INFO SessionState:641 - Created local directory: /tmp/tempo/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076 2021-08-12 09:21:22 INFO SessionState:641 - Created HDFS directory: /tmp/hive/hdfs/8bc342dd-aa0b-407b-b9ad-ff7ed3cd4076/_tmp_space.db === eab...@163.com From: igyu Date: 2021-08-12 11:33 To: user Subject: How can I config hive.metastore.warehouse.dir I need write data to hive with spark val proper = new Properties proper.setProperty("fs.defaultFS", "hdfs://nameservice1") proper.setProperty("dfs.nameservices", "nameservice1") proper.setProperty("dfs.ha.namenodes.nameservice1", "namenode337,namenode369") proper.setProperty("dfs.namenode.rpc-address.nameservice1.namenode337", "bigdser1:8020") proper.setProperty("dfs.namenode.rpc-address.nameservice1.namenode369", "bigdser5:8020") proper.setProperty("dfs.client.failover.proxy.provider.nameservice1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider") proper.setProperty("hadoop.security.authentication", "Kerberos") proper.setProperty("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem") proper.setProperty("spark.sql.warehouse.dir", "/user/hive/warehouse") proper.setProperty("hive.metastore.warehouse.dir","/user/hive/warehouse") proper.setProperty("hive.metastore.uris", "thrift://bigdser1:9083") sparkSession.sqlContext.setConf(proper) sparkSession.sqlContext.setConf("hive.exec.dynamic.partition", "true") sparkSession.sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict") DF.write.format("jdbc") .option("timestampFormat", "/MM/dd HH:mm:ss ZZ") .options(cfg) // .partitionBy(partitions:_*) .mode(mode) .insertInto(table) but I get a error 21/08/12 11:25:07 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/D:/file/code/Java/jztsynctools/spark-warehouse/'). 21/08/12 11:25:07 INFO SharedState: Warehouse path is 'file:/D:/file/code/Java/jztsynctools/spark-warehouse/'. 21/08/12 11:25:08 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 21/08/12 11:25:08 INFO Version: Elasticsearch Hadoop v7.10.2 [f53f4b7b2b] 21/08/12 11:25:08 INFO Utils: Supplied authorities: tidb4ser:11000 21/08/12 11:25:08 INFO Utils: Resolved authority: tidb4ser:11000 21/08/12 11:25:16 INFO UserGroupInformation: Login successful for user jztwk/had...@join.com using keytab file D:\file\jztwk.keytab. Keytab auto renewal enabled : false login user: jztwk/had...@join.com (auth:KERBEROS) 21/08/12 11:25:25 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: hivetest.chinese_part1, the database hivetest doesn't exist.; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:737) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:710) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:708) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
Is there a better way to read kerberized impala tables by spark jdbc?
Hi: I want to use spark jdbc to read kerberized impala tables, like: ``` val impalaUrl = "jdbc:impala://:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=;KrbServiceName=impala" spark.read.jdbc(impalaUrl) ``` As we know, spark will read impala data by executor rather than driver, so throw excepting: javax.security.sasl.SaslException: GSS initiate failed ``` Caused by: org.ietf.jgss.GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147) at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122) at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187) at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212) at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179) at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192) ... 20 common frames omitted ``` Ony way to solve this problem is set jaas.conf by "java.security.auth.login.config" property, This is jaas.conf: ``` Client { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true doNotPrompt=true useTicketCache=true principal="test" keyTab="/home/keytab/user.keytab"; }; ``` Then set spark.executor.extraJavaOptions like : ``` --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/data/disk1/spark-jdbc-impala/conf/jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false" ``` This way required absolute jaas.conf file and keyTab file, in other words, these files must be placed in the same path and on each node, Is there a better way? Please help. Regards eab...@163.com