How to create spark udf use functioncatalog?

2023-04-14 Thread ??????
We are using spark.Today I see the FunctionCatalog , and I have seen the 
source of 
spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala 
 and have implements the ScalarFunction.But i still not konw how 
to register it in sql

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
Hi,

I'm not sure I understand the question, but if your question is how to
register (plug-in) your own custom FunctionCatalog, it's through
spark.sql.catalog configuration property, e.g.

spark.sql.catalog.catalog-name=com.example.YourCatalogClass

spark.sql.catalog registers a CatalogPlugin that in your case is also
supposed to be a FunctionCatalog.

When needed, implicit class CatalogHelper.asFunctionCatalog is going to be
used to offer your custom CatalogPlugin (e.g., catalog-name above) so
functions identified by three-part identifiers (catalog.schema.function)
are resolved and used properly using the custom catalog impl.

HTH

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, Apr 14, 2023 at 2:10 PM 许新浩 <948718...@qq.com.invalid> wrote:

> We are using spark.Today I see the FunctionCatalog , and I have seen the
> source of
> spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala
> and have implements the ScalarFunction.But i still not konw how
> to register it in sql


Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
Hi,

Start with intercepting stage completions using SparkListenerStageCompleted
[1]. That's Spark Core (jobs, stages and tasks).

Go up the execution chain to Spark SQL with SparkListenerSQLExecutionStart
[2] and SparkListenerSQLExecutionEnd [3], and correlate infos.

You may want to look at how web UI works under the covers to collect all
the information. Start from SQLTab that should give you what is displayed
(that should give you then what's needed and how it's collected).

[1]
https://github.com/apache/spark/blob/8cceb3946bdfa5ceac0f2b4fe6a7c43eafb76d59/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L46
[2]
https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L44
[3]
https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60
[4]
https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Thu, Apr 13, 2023 at 10:40 AM Trường Trần Phan An 
wrote:

> Hi,
>
> Can you give me more details or give me a tutorial on "You'd have to
> intercept execution events and correlate them. Not an easy task yet doable"
>
> Thank
>
> Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski 
> đã viết:
>
>> Hi,
>>
>> tl;dr it's not possible to "reverse-engineer" tasks to functions.
>>
>> In essence, Spark SQL is an abstraction layer over RDD API that's made up
>> of partitions and tasks. Tasks are Scala functions (possibly with some
>> Python for PySpark). A simple-looking high-level operator like
>> DataFrame.join can end up with multiple RDDs, each with a set of partitions
>> (and hence tasks). What the tasks do is an implementation detail that you'd
>> have to know about by reading the source code of Spark SQL that produces
>> the "bytecode".
>>
>> Just looking at the DAG or the tasks screenshots won't give you that
>> level of detail. You'd have to intercept execution events and correlate
>> them. Not an easy task yet doable. HTH.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Tue, Apr 11, 2023 at 6:53 PM Trường Trần Phan An <
>> truong...@vlute.edu.vn> wrote:
>>
>>> Hi all,
>>>
>>> I am conducting a study comparing the execution time of Bloom Filter
>>> Join operation on two environments: Apache Spark Cluster and Apache Spark.
>>> I have compared the overall time of the two environments, but I want to
>>> compare specific "tasks on each stage" to see which computation has the
>>> most significant difference.
>>>
>>> I have taken a screenshot of the DAG of Stage 0 and the list of tasks
>>> executed in Stage 0.
>>> - DAG.png
>>> - Task.png
>>>
>>> *I have questions:*
>>> 1. Can we determine which tasks are responsible for executing each step
>>> scheduled on the DAG during the processing?
>>> 2. Is it possible to know the function of each task (e.g., what is task
>>> ID 0 responsible for? What is task ID 1 responsible for? ... )?
>>>
>>> Best regards,
>>> Truong
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for
AWS EKS to work

It is a bit of nightmare compared to the same on Google SDK which is simpler

Anyhow you will require additional jar files to be added to
$SPARK_HOME/jars. These two files will be picked up after you build the
docker image and will be available to pods.


   1. hadoop-aws-3.2.0.jar
   2. aws-java-sdk-bundle-1.11.375.jar

Then build your docker image and push the image to ecr registry on AWS.

This will allow you to refer to both the zipped package and your source
file as

 spark-submit --verbose \
   --master k8s://$KUBERNETES_MASTER_IP:443 \
   --deploy-mode cluster \
   --py-files s3a://spark-on-k8s/codes/spark_on_eks.zip \
   s3a://1spark-on-k8s/codes/

Note that you refer to the bucket as* s3a rather than s3*

Output from driver log

kubectl logs   -n spark

Started at
14/04/2023 15:08:11.11
starting at ID =  1 ,ending on =  100
root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)
 |-- op_type: integer (nullable = false)
 |-- op_time: timestamp (nullable = false)

+---+-+-+--+--+--+--+---+---+
|ID |CLUSTERED|SCATTERED|RANDOMISED|RANDOM_STRING
  |SMALL_VC  |PADDING   |op_type|op_time|
+---+-+-+--+--+--+--+---+---+
|1  |0.0  |0.0  |17.0
 |KZWeqhFWCEPyYngFbyBMWXaSCrUZoLgubbbPIayRnBUbHoWCFJ|
1|xx|1  |2023-04-14 15:08:15.534|
|2  |0.01 |1.0  |7.0
|ffxkVZQtqMnMcLRkBOzZUGxICGrcbxDuyBHkJlpobluliGGxGR| 2|xx|1
 |2023-04-14 15:08:15.534|
|3  |0.02 |2.0  |30.0
 |LIixMEOLeMaEqJomTEIJEzOjoOjHyVaQXekWLctXbrEMUyTYBz|
3|xx|1  |2023-04-14 15:08:15.534|
|4  |0.03 |3.0  |30.0
 |tgUzEjfebzJsZWdoHIxrXlgqnbPZqZrmktsOUxfMvQyGplpErf|
4|xx|1  |2023-04-14 15:08:15.534|
|5  |0.04 |4.0  |79.0
 |qVwYSVPHbDXpPdkhxEpyIgKpaUnArlXykWZeiNNCiiaanXnkks|
5|xx|1  |2023-04-14 15:08:15.534|
|6  |0.05 |5.0  |73.0
 |fFWqcajQLEWVxuXbrFZmUAIIRgmKJSZUqQZNRfBvfxZAZqCSgW|
6|xx|1  |2023-04-14 15:08:15.534|
|7  |0.06 |6.0  |41.0
 |jzPdeIgxLdGncfBAepfJBdKhoOOLdKLzdocJisAjIhKtJRlgLK|
7|xx|1  |2023-04-14 15:08:15.534|
|8  |0.07 |7.0  |29.0
 |xyimTcfipZGnzPbDFDyFKmzfFoWbSrHAEyUhQqgeyNygQdvpSf|
8|xx|1  |2023-04-14 15:08:15.534|
|9  |0.08 |8.0  |59.0
 |NxrilRavGDMfvJNScUykTCUBkkpdhiGLeXSyYVgsnRoUYAfXrn|
9|xx|1  |2023-04-14 15:08:15.534|
|10 |0.09 |9.0  |73.0
 |cBEKanDFrPZkcHFuepVxcAiMwyAsRqDlRtQxiDXpCNycLapimt|
 10|xx|1  |2023-04-14 15:08:15.534|
+---+-+-+--+--+--+--+---+---+
only showing top 10 rows

Finished at
14/04/2023 15:08:16.16

I will provide the details under section *spark-on-aws *in
http://sparkcommunitytalk.slack.com/

HTH

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 12 Apr 2023 at 19:04, Mich Talebzadeh 
wrote:

> Thanks! I will have a look.
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 12 Apr 2023 at 18:26, Bjørn Jørgensen 
> wrote:
>
>> Yes, it looks inside the docker containers folder. It will work if you
>> are using s3 og gs.
>>
>> ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> In my spark-submit to eks cluster, I use the standard code to submit to
>>> the clu

Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
Hi,

ATM I see the most used option for a Spark operator is the one provided by
Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

Unfortunately, it doesn't seem actively maintained. Are there any plans to
support an official Apache Spark community driven operator?


Re: Spark Kubernetes Operator

2023-04-14 Thread Mich Talebzadeh
Hi,

What exactly are you trying to achieve? Spark on GKE works fine and you can
run Datapoc now on GKE
https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D

Unless I misunderstood your point.

HTH


Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 14 Apr 2023 at 17:42, Yuval Itzchakov  wrote:

> Hi,
>
> ATM I see the most used option for a Spark operator is the one provided by
> Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
>
> Unfortunately, it doesn't seem actively maintained. Are there any plans to
> support an official Apache Spark community driven operator?
>


Re: Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
I'm not running on GKE. I am wondering what's the long term strategy around
a Spark operator. Operators are the de-facto way to run complex
deployments. The Flink community now has an official community led
operator, and I was wondering if there are any similar plans for Spark.

On Fri, Apr 14, 2023, 19:51 Mich Talebzadeh 
wrote:

> Hi,
>
> What exactly are you trying to achieve? Spark on GKE works fine and you
> can run Datapoc now on GKE
> https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D
>
> Unless I misunderstood your point.
>
> HTH
>
>
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 14 Apr 2023 at 17:42, Yuval Itzchakov  wrote:
>
>> Hi,
>>
>> ATM I see the most used option for a Spark operator is the one provided
>> by Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
>>
>> Unfortunately, it doesn't seem actively maintained. Are there any plans
>> to support an official Apache Spark community driven operator?
>>
>


Scala commands syntax shortcuts(alias)

2023-04-14 Thread Ankit Singla
HI there,

I'm a user of spark as part of a Data Engineer profile for daily analytical
work. I write a few commands 100s of times a day and I always wonder if
there would be some way to get spark commands alias instead of
rewriting whole syntax all the time. I checked and there seems no *eval *option
available.

I would like to know if any option is available to ease down the keystrokes.

In advance appreciate your help and time

Regards,
Ankit Singla
+1 847 471 4988