Re: error trying to save to database (Phoenix)
Sorry for being so Dense and thank you for your help. I was using this version phoenix-spark-5.0.0-HBase-2.0.jar Because it was the latest in this repo https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark On Mon, Aug 21, 2023 at 5:07 PM Sean Owen wrote: > It is. But you have a third party library in here which seems to require a > different version. > > On Mon, Aug 21, 2023, 7:04 PM Kal Stevens wrote: > >> OK, it was my impression that scala was packaged with Spark to avoid a >> mismatch >> https://spark.apache.org/downloads.html >> >> It looks like spark 3.4.1 (my version) uses scala Scala 2.12 >> How do I specify the scala version? >> >> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: >> >>> That's a mismatch in the version of scala that your library uses vs >>> spark uses. >>> >>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: >>> >>>> I am having a hard time figuring out what I am doing wrong here. >>>> I am not sure if I have an incompatible version of something installed >>>> or something else. >>>> I can not find anything relevant in google to figure out what I am >>>> doing wrong >>>> I am using *spark 3.4.1*, and *python3.10* >>>> >>>> This is my code to save my dataframe >>>> urls = [] >>>> pull_sitemap_xml(robot, urls) >>>> df = spark.createDataFrame(data=urls, schema=schema) >>>> df.write.format("org.apache.phoenix.spark") \ >>>> .mode("overwrite") \ >>>> .option("table", "property") \ >>>> .option("zkUrl", "192.168.1.162:2181") \ >>>> .save() >>>> >>>> urls is an array of maps, containing a "url" and a "last_mod" field. >>>> >>>> Here is the error that I am getting >>>> >>>> Traceback (most recent call last): >>>> >>>> File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in >>>> main >>>> >>>> .save() >>>> >>>> File >>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", >>>> line 1396, in save >>>> >>>> self._jwrite.save() >>>> >>>> File >>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", >>>> line 1322, in __call__ >>>> >>>> return_value = get_return_value( >>>> >>>> File >>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", >>>> line 169, in deco >>>> >>>> return f(*a, **kw) >>>> >>>> File >>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", >>>> line 326, in get_return_value >>>> >>>> raise Py4JJavaError( >>>> >>>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save. >>>> >>>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps >>>> scala.Predef$.refArrayOps(java.lang.Object[])' >>>> >>>> at >>>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76) >>>> >>>> at >>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35) >>>> >>>> at >>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28) >>>> >>>> at >>>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47) >>>> >>>> at >>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) >>>> >>>> at >>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) >>>> >>>> at >>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) >>>> >>>
Re: error trying to save to database (Phoenix)
OK, it was my impression that scala was packaged with Spark to avoid a mismatch https://spark.apache.org/downloads.html It looks like spark 3.4.1 (my version) uses scala Scala 2.12 How do I specify the scala version? On Mon, Aug 21, 2023 at 4:47 PM Sean Owen wrote: > That's a mismatch in the version of scala that your library uses vs spark > uses. > > On Mon, Aug 21, 2023, 6:46 PM Kal Stevens wrote: > >> I am having a hard time figuring out what I am doing wrong here. >> I am not sure if I have an incompatible version of something installed or >> something else. >> I can not find anything relevant in google to figure out what I am doing >> wrong >> I am using *spark 3.4.1*, and *python3.10* >> >> This is my code to save my dataframe >> urls = [] >> pull_sitemap_xml(robot, urls) >> df = spark.createDataFrame(data=urls, schema=schema) >> df.write.format("org.apache.phoenix.spark") \ >> .mode("overwrite") \ >> .option("table", "property") \ >> .option("zkUrl", "192.168.1.162:2181") \ >> .save() >> >> urls is an array of maps, containing a "url" and a "last_mod" field. >> >> Here is the error that I am getting >> >> Traceback (most recent call last): >> >> File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in >> main >> >> .save() >> >> File >> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", >> line 1396, in save >> >> self._jwrite.save() >> >> File >> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", >> line 1322, in __call__ >> >> return_value = get_return_value( >> >> File >> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", >> line 169, in deco >> >> return f(*a, **kw) >> >> File >> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", >> line 326, in get_return_value >> >> raise Py4JJavaError( >> >> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save. >> >> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps >> scala.Predef$.refArrayOps(java.lang.Object[])' >> >> at >> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76) >> >> at >> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35) >> >> at >> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28) >> >> at >> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47) >> >> at >> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) >> >> at >> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) >> >> at >> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) >> >
error trying to save to database (Phoenix)
I am having a hard time figuring out what I am doing wrong here. I am not sure if I have an incompatible version of something installed or something else. I can not find anything relevant in google to figure out what I am doing wrong I am using *spark 3.4.1*, and *python3.10* This is my code to save my dataframe urls = [] pull_sitemap_xml(robot, urls) df = spark.createDataFrame(data=urls, schema=schema) df.write.format("org.apache.phoenix.spark") \ .mode("overwrite") \ .option("table", "property") \ .option("zkUrl", "192.168.1.162:2181") \ .save() urls is an array of maps, containing a "url" and a "last_mod" field. Here is the error that I am getting Traceback (most recent call last): File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in main .save() File "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1396, in save self._jwrite.save() File "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 169, in deco return f(*a, **kw) File "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o636.save. : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps scala.Predef$.refArrayOps(java.lang.Object[])' at org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76) at org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35) at org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28) at org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
Problem with spark 3.4.1 not finding spark java classes
I am getting a class not found error import org.apache.spark.SparkContext It sounds like this is because pyspark is not installed, but as far as I can tell it is. Pyspark is installed in the correct python verison root@namenode:/home/spark/# pip3.10 install pyspark Requirement already satisfied: pyspark in /usr/local/lib/python3.10/dist-packages (3.4.1) Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7) __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Python version 3.10.12 (main, Jun 11 2023 05:26:28) Spark context Web UI available at http://namenode:4040 Spark context available as 'sc' (master = yarn, app id = application_1692452853354_0008). SparkSession available as 'spark'. Traceback (most recent call last): File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in import org.apache.spark.SparkContext ModuleNotFoundError: No module named 'org.apache.spark.SparkContext' 2023-08-20T19:45:19,242 INFO [Thread-5] spark.SparkContext: SparkContext is stopping with exitCode 0. 2023-08-20T19:45:19,246 INFO [Thread-5] server.AbstractConnector: Stopped Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 2023-08-20T19:45:19,247 INFO [Thread-5] ui.SparkUI: Stopped Spark web UI at http://namenode:4040 2023-08-20T19:45:19,251 INFO [YARN application state monitor] cluster.YarnClientSchedulerBackend: Interrupting monitor thread 2023-08-20T19:45:19,260 INFO [Thread-5] cluster.YarnClientSchedulerBackend: Shutting down all executors 2023-08-20T19:45:19,260 INFO [dispatcher-CoarseGrainedScheduler] cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 2023-08-20T19:45:19,263 INFO [Thread-5] cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped 2023-08-20T19:45:19,267 INFO [dispatcher-event-loop-29] spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 2023-08-20T19:45:19,271 INFO [Thread-5] memory.MemoryStore: MemoryStore cleared 2023-08-20T19:45:19,271 INFO [Thread-5] storage.BlockManager: BlockManager stopped 2023-08-20T19:45:19,275 INFO [Thread-5] storage.BlockManagerMaster: BlockManagerMaster stopped 2023-08-20T19:45:19,276 INFO [dispatcher-event-loop-8] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 2023-08-20T19:45:19,279 INFO [Thread-5] spark.SparkContext: Successfully stopped SparkContext 2023-08-20T19:45:19,687 INFO [shutdown-hook-0] util.ShutdownHookManager: Shutdown hook called 2023-08-20T19:45:19,688 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895 2023-08-20T19:45:19,689 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e 2023-08-20T19:45:19,690 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034 2023-08-20T19:45:19,691 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64
Re: Problem with spark 3.4.1 not finding spark java classes
Nevermind I was doing something dumb On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote: > Are there installation instructions for Spark 3.4.1? > > I defined SPARK_HOME as it describes here > > https://spark.apache.org/docs/latest/api/python/getting_started/install.html > > ls $SPARK_HOME/python/lib > py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip > > > I am getting a class not found error > import org.apache.spark.SparkContext > > I also unzipped those files just in case but that gives the same error. > > > It sounds like this is because pyspark is not installed, but as far as I > can tell it is. > Pyspark is installed in the correct python verison > > > root@namenode:/home/spark/# pip3.10 install pyspark > Requirement already satisfied: pyspark in > /usr/local/lib/python3.10/dist-packages (3.4.1) > Requirement already satisfied: py4j==0.10.9.7 in > /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7) > > > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.4.1 > /_/ > > Using Python version 3.10.12 (main, Jun 11 2023 05:26:28) > Spark context Web UI available at http://namenode:4040 > Spark context available as 'sc' (master = yarn, app id = > application_1692452853354_0008). > SparkSession available as 'spark'. > Traceback (most recent call last): > File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in > > import org.apache.spark.SparkContext > ModuleNotFoundError: No module named 'org.apache.spark.SparkContext' > 2023-08-20T19:45:19,242 INFO [Thread-5] spark.SparkContext: SparkContext > is stopping with exitCode 0. > 2023-08-20T19:45:19,246 INFO [Thread-5] server.AbstractConnector: Stopped > Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2023-08-20T19:45:19,247 INFO [Thread-5] ui.SparkUI: Stopped Spark web UI > at http://namenode:4040 > 2023-08-20T19:45:19,251 INFO [YARN application state monitor] > cluster.YarnClientSchedulerBackend: Interrupting monitor thread > 2023-08-20T19:45:19,260 INFO [Thread-5] > cluster.YarnClientSchedulerBackend: Shutting down all executors > 2023-08-20T19:45:19,260 INFO [dispatcher-CoarseGrainedScheduler] > cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to > shut down > 2023-08-20T19:45:19,263 INFO [Thread-5] > cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped > 2023-08-20T19:45:19,267 INFO [dispatcher-event-loop-29] > spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint > stopped! > 2023-08-20T19:45:19,271 INFO [Thread-5] memory.MemoryStore: MemoryStore > cleared > 2023-08-20T19:45:19,271 INFO [Thread-5] storage.BlockManager: > BlockManager stopped > 2023-08-20T19:45:19,275 INFO [Thread-5] storage.BlockManagerMaster: > BlockManagerMaster stopped > 2023-08-20T19:45:19,276 INFO [dispatcher-event-loop-8] > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2023-08-20T19:45:19,279 INFO [Thread-5] spark.SparkContext: Successfully > stopped SparkContext > 2023-08-20T19:45:19,687 INFO [shutdown-hook-0] util.ShutdownHookManager: > Shutdown hook called > 2023-08-20T19:45:19,688 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory > /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895 > 2023-08-20T19:45:19,689 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e > 2023-08-20T19:45:19,690 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034 > 2023-08-20T19:45:19,691 INFO [shutdown-hook-0] util.ShutdownHookManager: > Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64 > >
Problem with spark 3.4.1 not finding spark java classes
Are there installation instructions for Spark 3.4.1? I defined SPARK_HOME as it describes here https://spark.apache.org/docs/latest/api/python/getting_started/install.html ls $SPARK_HOME/python/lib py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip I am getting a class not found error import org.apache.spark.SparkContext I also unzipped those files just in case but that gives the same error. It sounds like this is because pyspark is not installed, but as far as I can tell it is. Pyspark is installed in the correct python verison root@namenode:/home/spark/# pip3.10 install pyspark Requirement already satisfied: pyspark in /usr/local/lib/python3.10/dist-packages (3.4.1) Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7) __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.4.1 /_/ Using Python version 3.10.12 (main, Jun 11 2023 05:26:28) Spark context Web UI available at http://namenode:4040 Spark context available as 'sc' (master = yarn, app id = application_1692452853354_0008). SparkSession available as 'spark'. Traceback (most recent call last): File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in import org.apache.spark.SparkContext ModuleNotFoundError: No module named 'org.apache.spark.SparkContext' 2023-08-20T19:45:19,242 INFO [Thread-5] spark.SparkContext: SparkContext is stopping with exitCode 0. 2023-08-20T19:45:19,246 INFO [Thread-5] server.AbstractConnector: Stopped Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} 2023-08-20T19:45:19,247 INFO [Thread-5] ui.SparkUI: Stopped Spark web UI at http://namenode:4040 2023-08-20T19:45:19,251 INFO [YARN application state monitor] cluster.YarnClientSchedulerBackend: Interrupting monitor thread 2023-08-20T19:45:19,260 INFO [Thread-5] cluster.YarnClientSchedulerBackend: Shutting down all executors 2023-08-20T19:45:19,260 INFO [dispatcher-CoarseGrainedScheduler] cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down 2023-08-20T19:45:19,263 INFO [Thread-5] cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped 2023-08-20T19:45:19,267 INFO [dispatcher-event-loop-29] spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 2023-08-20T19:45:19,271 INFO [Thread-5] memory.MemoryStore: MemoryStore cleared 2023-08-20T19:45:19,271 INFO [Thread-5] storage.BlockManager: BlockManager stopped 2023-08-20T19:45:19,275 INFO [Thread-5] storage.BlockManagerMaster: BlockManagerMaster stopped 2023-08-20T19:45:19,276 INFO [dispatcher-event-loop-8] scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 2023-08-20T19:45:19,279 INFO [Thread-5] spark.SparkContext: Successfully stopped SparkContext 2023-08-20T19:45:19,687 INFO [shutdown-hook-0] util.ShutdownHookManager: Shutdown hook called 2023-08-20T19:45:19,688 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895 2023-08-20T19:45:19,689 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e 2023-08-20T19:45:19,690 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034 2023-08-20T19:45:19,691 INFO [shutdown-hook-0] util.ShutdownHookManager: Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64