unsubscribe

2023-08-22 Thread heri wijayanto
unsubscribe


Re: error trying to save to database (Phoenix)

2023-08-22 Thread Gera Shegalov
If you look at the dependencies of the 5.0.0-HBase-2.0 artifact
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/5.0.0-HBase-2.0
it was built against Spark 2.3.0, Scala 2.11.8

You may need to check with the Phoenix community if your setup with Spark
3.4.1 etc  is supported by something like
https://github.com/apache/phoenix-connectors/tree/master/phoenix5-spark3



On Mon, Aug 21, 2023 at 6:12 PM Kal Stevens  wrote:

> Sorry for being so Dense and thank you for your help.
>
> I was using this version
> phoenix-spark-5.0.0-HBase-2.0.jar
>
> Because it was the latest in this repo
> https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark
>
>
> On Mon, Aug 21, 2023 at 5:07 PM Sean Owen  wrote:
>
>> It is. But you have a third party library in here which seems to require
>> a different version.
>>
>> On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:
>>
>>> OK, it was my impression that scala was packaged with Spark to avoid a
>>> mismatch
>>> https://spark.apache.org/downloads.html
>>>
>>> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
>>> How do I specify the scala version?
>>>
>>> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>>>
 That's a mismatch in the version of scala that your library uses vs
 spark uses.

 On Mon, Aug 21, 2023, 6:46 PM Kal Stevens 
 wrote:

> I am having a hard time figuring out what I am doing wrong here.
> I am not sure if I have an incompatible version of something installed
> or something else.
> I can not find anything relevant in google to figure out what I am
> doing wrong
> I am using *spark 3.4.1*, and *python3.10*
>
> This is my code to save my dataframe
> urls = []
> pull_sitemap_xml(robot, urls)
> df = spark.createDataFrame(data=urls, schema=schema)
> df.write.format("org.apache.phoenix.spark") \
> .mode("overwrite") \
> .option("table", "property") \
> .option("zkUrl", "192.168.1.162:2181") \
> .save()
>
> urls is an array of maps, containing a "url" and a "last_mod" field.
>
> Here is the error that I am getting
>
> Traceback (most recent call last):
>
>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65,
> in main
>
> .save()
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 1396, in save
>
> self._jwrite.save()
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
> line 1322, in __call__
>
> return_value = get_return_value(
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
> line 169, in deco
>
> return f(*a, **kw)
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> raise Py4JJavaError(
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>
> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
> scala.Predef$.refArrayOps(java.lang.Object[])'
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>
> at
> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>



Fwd: Recap on current status of "SPIP: Support Customized Kubernetes Schedulers"

2023-08-22 Thread Mich Talebzadeh
I found some of the notes on Volcano and my tests back in Feb 2022. I did
my volcano tests on Spark 3.1.1. The results were not very great then.
Hence I asked in thread from @santosh, if any updated comparisons are
available. I will try the test with Spark 3.4.1 at some point. Maybe some
users have done some tests on Volcano with newer versions of Spark that
they care to share?


Thanks



Forwarded Conversation
Subject: Recap on current status of "SPIP: Support Customized Kubernetes
Schedulers",



--
From: Mich Talebzadeh 
Date: Thu, 24 Feb 2022 at 09:16
To: Yikun Jiang 
Cc: dev , Dongjoon Hyun , Holden
Karau , William Wang ,
Attila Zsolt Piros , Hyukjin Kwon <
gurwls...@gmail.com>, , Weiwei Yang ,
Thomas Graves 


Hi,

what do expect the performance gain to be by using volcano versus standard
scheduler.

Just to be sure there are two aspects here.


   1. Procuring the Kubernetes cluster
   2. Running the job through spark-submit


Item 1 is left untouched and we should see improvements in item 2 with
Volcano

Thanks



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.





--
From: Mich Talebzadeh 
Date: Thu, 24 Feb 2022 at 23:35
To: Yikun Jiang 
Cc: dev , Dongjoon Hyun , Holden
Karau , William Wang ,
Attila Zsolt Piros , Hyukjin Kwon <
gurwls...@gmail.com>, , Weiwei Yang ,
Thomas Graves 



I did some preliminary tests without volcana and with volcano addition to
spark-submit.


*setup*


The K8s cluster used was a Google Kubernetes standard cluster with three
nodes with autoscale up to 6 nodes. It runs *spark 3.1.1* with spark-py
dockers also using *spark 3.1.1 with Java 8*.  In every run, it creates a
million rows of random data and inserts them from Spark DF into Google
BigQuery database. The choice of Spark 3.1.1 and Java 8 was for
compatibility for Spark API and the BigQuery.


To keep the systematics the same I used the same cluster with the only
difference being the spark-submit additional lines as below for volcano



 NEXEC=2

 MEMORY="8192m"

 VCORES=3


FEATURES=”org.apache.spark.deploy.k8s.features.VolcanoFeatureStep”

gcloud config set compute/zone $ZONE

export PROJECT=$(gcloud info --format='value(config.project)')

gcloud container clusters get-credentials ${CLUSTER_NAME} --zone
$ZONE

export KUBERNETES_MASTER_IP=$(gcloud container clusters list
--filter name=${CLUSTER_NAME} --format='value(MASTER_IP)')

spark-submit --verbose \

   --properties-file ${property_file} \

   --master k8s://https://$KUBERNETES_MASTER_IP:443 \

   --deploy-mode cluster \

   --name sparkBQ \

   * --conf spark.kubernetes.scheduler=volcano \*

*   --conf spark.kubernetes.driver.pod.featureSteps=$FEATURES \*

*   --conf spark.kubernetes.executor.pod.featureSteps=$FEATURES \*

*   --conf spark.kubernetes.job.queue=queue1 \*

   --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \

   --conf spark.kubernetes.namespace=$NAMESPACE \

   --conf spark.executor.instances=$NEXEC \

   --conf spark.driver.cores=$VCORES \

   --conf spark.executor.cores=$VCORES \

   --conf spark.driver.memory=$MEMORY \

   --conf spark.executor.memory=$MEMORY \

   --conf spark.network.timeout=300 \

   --conf spark.kubernetes.allocation.batch.size=3 \

   --conf spark.kubernetes.allocation.batch.delay=1 \

   --conf spark.dynamicAllocation.enabled=true \

   --conf spark.dynamicAllocation.shuffleTracking.enabled=true \

   --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \

   --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \

 --conf
spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

   --conf
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
\

   --conf spark.kubernetes.authenticate.caCertFile=/var/run/secrets/
kubernetes.io/serviceaccount/ca.crt  \

   --conf
spark.kubernetes.authenticate.oauthTokenFile=/var/run/secrets/
kubernetes.io/serviceaccount/token  \

   $CODE_DIRECTORY_CLOUD/${APPLICATION}



In contrast the standard spark-submit does not have those 4 volcano
specific lines (in bald). This i the output from *spark-submit --verbose*


Spark properties used, including those specified through

 --conf and those from the 

[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3!

Spark 3.3.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.