Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
Sorry for being so Dense and thank you for your help.

I was using this version
phoenix-spark-5.0.0-HBase-2.0.jar

Because it was the latest in this repo
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark


On Mon, Aug 21, 2023 at 5:07 PM Sean Owen  wrote:

> It is. But you have a third party library in here which seems to require a
> different version.
>
> On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:
>
>> OK, it was my impression that scala was packaged with Spark to avoid a
>> mismatch
>> https://spark.apache.org/downloads.html
>>
>> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
>> How do I specify the scala version?
>>
>> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>>
>>> That's a mismatch in the version of scala that your library uses vs
>>> spark uses.
>>>
>>> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>>>
>>>> I am having a hard time figuring out what I am doing wrong here.
>>>> I am not sure if I have an incompatible version of something installed
>>>> or something else.
>>>> I can not find anything relevant in google to figure out what I am
>>>> doing wrong
>>>> I am using *spark 3.4.1*, and *python3.10*
>>>>
>>>> This is my code to save my dataframe
>>>> urls = []
>>>> pull_sitemap_xml(robot, urls)
>>>> df = spark.createDataFrame(data=urls, schema=schema)
>>>> df.write.format("org.apache.phoenix.spark") \
>>>> .mode("overwrite") \
>>>> .option("table", "property") \
>>>> .option("zkUrl", "192.168.1.162:2181") \
>>>> .save()
>>>>
>>>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>>>
>>>> Here is the error that I am getting
>>>>
>>>> Traceback (most recent call last):
>>>>
>>>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>>>> main
>>>>
>>>> .save()
>>>>
>>>>   File
>>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>>>> line 1396, in save
>>>>
>>>> self._jwrite.save()
>>>>
>>>>   File
>>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>>>> line 1322, in __call__
>>>>
>>>> return_value = get_return_value(
>>>>
>>>>   File
>>>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>>>> line 169, in deco
>>>>
>>>> return f(*a, **kw)
>>>>
>>>>   File
>>>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>>>> line 326, in get_return_value
>>>>
>>>> raise Py4JJavaError(
>>>>
>>>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>>>
>>>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>>>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>>>
>>>> at
>>>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>>>
>>>> at
>>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>>>
>>>> at
>>>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>>>
>>>> at
>>>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>>>
>>>> at
>>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>>>
>>>


Re: error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
OK, it was my impression that scala was packaged with Spark to avoid a
mismatch
https://spark.apache.org/downloads.html

It looks like spark 3.4.1 (my version) uses scala Scala 2.12
How do I specify the scala version?

On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:

> That's a mismatch in the version of scala that your library uses vs spark
> uses.
>
> On Mon, Aug 21, 2023, 6:46 PM Kal Stevens  wrote:
>
>> I am having a hard time figuring out what I am doing wrong here.
>> I am not sure if I have an incompatible version of something installed or
>> something else.
>> I can not find anything relevant in google to figure out what I am doing
>> wrong
>> I am using *spark 3.4.1*, and *python3.10*
>>
>> This is my code to save my dataframe
>> urls = []
>> pull_sitemap_xml(robot, urls)
>> df = spark.createDataFrame(data=urls, schema=schema)
>> df.write.format("org.apache.phoenix.spark") \
>> .mode("overwrite") \
>> .option("table", "property") \
>> .option("zkUrl", "192.168.1.162:2181") \
>> .save()
>>
>> urls is an array of maps, containing a "url" and a "last_mod" field.
>>
>> Here is the error that I am getting
>>
>> Traceback (most recent call last):
>>
>>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in
>> main
>>
>> .save()
>>
>>   File
>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
>> line 1396, in save
>>
>> self._jwrite.save()
>>
>>   File
>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>> line 1322, in __call__
>>
>> return_value = get_return_value(
>>
>>   File
>> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
>> line 169, in deco
>>
>> return f(*a, **kw)
>>
>>   File
>> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
>> line 326, in get_return_value
>>
>> raise Py4JJavaError(
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>>
>> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
>> scala.Predef$.refArrayOps(java.lang.Object[])'
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>>
>> at
>> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>>
>> at
>> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>>
>> at
>> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>>
>


error trying to save to database (Phoenix)

2023-08-21 Thread Kal Stevens
I am having a hard time figuring out what I am doing wrong here.
I am not sure if I have an incompatible version of something installed or
something else.
I can not find anything relevant in google to figure out what I am doing
wrong
I am using *spark 3.4.1*, and *python3.10*

This is my code to save my dataframe
urls = []
pull_sitemap_xml(robot, urls)
df = spark.createDataFrame(data=urls, schema=schema)
df.write.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "property") \
.option("zkUrl", "192.168.1.162:2181") \
.save()

urls is an array of maps, containing a "url" and a "last_mod" field.

Here is the error that I am getting

Traceback (most recent call last):

  File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65, in main

.save()

  File
"/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
line 1396, in save

self._jwrite.save()

  File
"/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
line 1322, in __call__

return_value = get_return_value(

  File
"/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
line 169, in deco

return f(*a, **kw)

  File
"/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
line 326, in get_return_value

raise Py4JJavaError(

py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.

: java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
scala.Predef$.refArrayOps(java.lang.Object[])'

at
org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)

at
org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)

at
org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)

at
org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)

at
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)


Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
I am getting a class not found error
import org.apache.spark.SparkContext

It sounds like this is because pyspark is not installed, but as far as I
can tell it is.
Pyspark is installed in the correct python verison


root@namenode:/home/spark/# pip3.10 install pyspark
Requirement already satisfied: pyspark in
/usr/local/lib/python3.10/dist-packages (3.4.1)
Requirement already satisfied: py4j==0.10.9.7 in
/usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)


    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.1
  /_/

Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://namenode:4040
Spark context available as 'sc' (master = yarn, app id =
application_1692452853354_0008).
SparkSession available as 'spark'.
Traceback (most recent call last):
  File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in

import org.apache.spark.SparkContext
ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
is stopping with exitCode 0.
2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
at http://namenode:4040
2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
cluster.YarnClientSchedulerBackend: Interrupting monitor thread
2023-08-20T19:45:19,260 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: Shutting down all executors
2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
shut down
2023-08-20T19:45:19,263 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
stopped!
2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
cleared
2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager: BlockManager
stopped
2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
BlockManagerMaster stopped
2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
stopped SparkContext
2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Shutdown hook called
2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory
/tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64


Re: Problem with spark 3.4.1 not finding spark java classes

2023-08-21 Thread Kal Stevens
Nevermind I was doing something dumb

On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens  wrote:

> Are there installation instructions for Spark 3.4.1?
>
> I defined SPARK_HOME as it describes here
>
> https://spark.apache.org/docs/latest/api/python/getting_started/install.html
>
> ls $SPARK_HOME/python/lib
> py4j-0.10.9.7-src.zip  PY4J_LICENSE.txt  pyspark.zip
>
>
> I am getting a class not found error
> import org.apache.spark.SparkContext
>
> I also unzipped those files just in case but that gives the same error.
>
>
> It sounds like this is because pyspark is not installed, but as far as I
> can tell it is.
> Pyspark is installed in the correct python verison
>
>
> root@namenode:/home/spark/# pip3.10 install pyspark
> Requirement already satisfied: pyspark in
> /usr/local/lib/python3.10/dist-packages (3.4.1)
> Requirement already satisfied: py4j==0.10.9.7 in
> /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)
>
>
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.4.1
>   /_/
>
> Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
> Spark context Web UI available at http://namenode:4040
> Spark context available as 'sc' (master = yarn, app id =
> application_1692452853354_0008).
> SparkSession available as 'spark'.
> Traceback (most recent call last):
>   File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in
> 
> import org.apache.spark.SparkContext
> ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
> 2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
> is stopping with exitCode 0.
> 2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
> Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
> at http://namenode:4040
> 2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
> cluster.YarnClientSchedulerBackend: Interrupting monitor thread
> 2023-08-20T19:45:19,260 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: Shutting down all executors
> 2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
> cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
> shut down
> 2023-08-20T19:45:19,263 INFO  [Thread-5]
> cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
> 2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
> spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
> stopped!
> 2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
> cleared
> 2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager:
> BlockManager stopped
> 2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
> BlockManagerMaster stopped
> 2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
> stopped SparkContext
> 2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Shutdown hook called
> 2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory
> /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
> 2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
> 2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
> 2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
> Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64
>
>


Problem with spark 3.4.1 not finding spark java classes

2023-08-20 Thread Kal Stevens
Are there installation instructions for Spark 3.4.1?

I defined SPARK_HOME as it describes here
https://spark.apache.org/docs/latest/api/python/getting_started/install.html

ls $SPARK_HOME/python/lib
py4j-0.10.9.7-src.zip  PY4J_LICENSE.txt  pyspark.zip


I am getting a class not found error
import org.apache.spark.SparkContext

I also unzipped those files just in case but that gives the same error.


It sounds like this is because pyspark is not installed, but as far as I
can tell it is.
Pyspark is installed in the correct python verison


root@namenode:/home/spark/# pip3.10 install pyspark
Requirement already satisfied: pyspark in
/usr/local/lib/python3.10/dist-packages (3.4.1)
Requirement already satisfied: py4j==0.10.9.7 in
/usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)


    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.4.1
  /_/

Using Python version 3.10.12 (main, Jun 11 2023 05:26:28)
Spark context Web UI available at http://namenode:4040
Spark context available as 'sc' (master = yarn, app id =
application_1692452853354_0008).
SparkSession available as 'spark'.
Traceback (most recent call last):
  File "/home/spark/real-estate/pullhttp/pull_apartments.py", line 11, in

import org.apache.spark.SparkContext
ModuleNotFoundError: No module named 'org.apache.spark.SparkContext'
2023-08-20T19:45:19,242 INFO  [Thread-5] spark.SparkContext: SparkContext
is stopping with exitCode 0.
2023-08-20T19:45:19,246 INFO  [Thread-5] server.AbstractConnector: Stopped
Spark@467be156{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2023-08-20T19:45:19,247 INFO  [Thread-5] ui.SparkUI: Stopped Spark web UI
at http://namenode:4040
2023-08-20T19:45:19,251 INFO  [YARN application state monitor]
cluster.YarnClientSchedulerBackend: Interrupting monitor thread
2023-08-20T19:45:19,260 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: Shutting down all executors
2023-08-20T19:45:19,260 INFO  [dispatcher-CoarseGrainedScheduler]
cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to
shut down
2023-08-20T19:45:19,263 INFO  [Thread-5]
cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
2023-08-20T19:45:19,267 INFO  [dispatcher-event-loop-29]
spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
stopped!
2023-08-20T19:45:19,271 INFO  [Thread-5] memory.MemoryStore: MemoryStore
cleared
2023-08-20T19:45:19,271 INFO  [Thread-5] storage.BlockManager: BlockManager
stopped
2023-08-20T19:45:19,275 INFO  [Thread-5] storage.BlockManagerMaster:
BlockManagerMaster stopped
2023-08-20T19:45:19,276 INFO  [dispatcher-event-loop-8]
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
2023-08-20T19:45:19,279 INFO  [Thread-5] spark.SparkContext: Successfully
stopped SparkContext
2023-08-20T19:45:19,687 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Shutdown hook called
2023-08-20T19:45:19,688 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory
/tmp/spark-9375452d-1989-4df5-9d85-950f751ce034/pyspark-2fcfbc8e-fd40-41f5-bf8d-e4c460332895
2023-08-20T19:45:19,689 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-bf6cbc46-ad8b-429a-9d7a-7d98b7d7912e
2023-08-20T19:45:19,690 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/spark-9375452d-1989-4df5-9d85-950f751ce034
2023-08-20T19:45:19,691 INFO  [shutdown-hook-0] util.ShutdownHookManager:
Deleting directory /tmp/localPyFiles-6c113b2b-9ac3-45e3-9032-d1c83419aa64