Re: [ANNOUNCE] Apache Spark 3.5.2 released

2024-08-12 Thread Xiao Li
Thank you, Kent!

Kent Yao  于2024年8月12日周一 08:03写道:

> We are happy to announce the availability of Apache Spark 3.5.2!
>
> Spark 3.5.2 is the second maintenance release containing security
> and correctness fixes. This release is based on the branch-3.5
> maintenance branch of Spark. We strongly recommend all 3.5 users
> to upgrade to this stable release.
>
> To download Spark 3.5.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Kent Yao
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[ANNOUNCE] Apache Spark 3.5.2 released

2024-08-12 Thread Kent Yao
We are happy to announce the availability of Apache Spark 3.5.2!

Spark 3.5.2 is the second maintenance release containing security
and correctness fixes. This release is based on the branch-3.5
maintenance branch of Spark. We strongly recommend all 3.5 users
to upgrade to this stable release.

To download Spark 3.5.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-5-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Kent Yao

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Meena Rajani
k.scheduler.ResultTask.runTask(ResultTask.scala:93)
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
>   at org.apache.spark.scheduler.Task.run(Task.scala:141)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   ... 1 more
>
>
> On Mon, Jul 29, 2024 at 4:34 PM Sadha Chilukoori 
> wrote:
>
>> Hi Mike,
>>
>> I'm not sure about the minimum requirements of a machine for running
>> Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local
>> machine, I found the following steps are the easiest.
>>
>>
>> I installed Amazon corretto and updated the java_home variable as
>> instructed here
>> https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
>> (Any other java works too, I'm used to corretto from work).
>>
>> Then installed the Pyspark module using pip, which enabled me run Pyspark
>> on my machine.
>>
>> -Sadha
>>
>> On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:
>>
>>> Hello,
>>>
>>> I am trying to run Pyspark on my computer without success.  I follow
>>> several different directions from online sources and it appears that I need
>>> to get a faster computer.
>>>
>>> I wanted to ask what are some recommendations for computer
>>> specifications to run PySpark (Apache Spark).
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thank you,
>>>
>>> Mike
>>>
>>


Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
7)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
>   at org.apache.spark.scheduler.Task.run(Task.scala:141)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
>   at 
> org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   ... 1 more
>
>
> On Mon, Jul 29, 2024 at 4:34 PM Sadha Chilukoori 
> wrote:
>
>> Hi Mike,
>>
>> I'm not sure about the minimum requirements of a machine for running
>> Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local
>> machine, I found the following steps are the easiest.
>>
>>
>> I installed Amazon corretto and updated the java_home variable as
>> instructed here
>> https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
>> (Any other java works too, I'm used to corretto from work).
>>
>> Then installed the Pyspark module using pip, which enabled me run Pyspark
>> on my machine.
>>
>> -Sadha
>>
>> On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:
>>
>>> Hello,
>>>
>>> I am trying to run Pyspark on my computer without success.  I follow
>>> several different directions from online sources and it appears that I need
>>> to get a faster computer.
>>>
>>> I wanted to ask what are some recommendations for computer
>>> specifications to run PySpark (Apache Spark).
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thank you,
>>>
>>> Mike
>>>
>>


Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
 Sadha Chilukoori 
wrote:

> Hi Mike,
>
> I'm not sure about the minimum requirements of a machine for running
> Spark. But to run some Pyspark scripts (and Jupiter notbebooks) on a local
> machine, I found the following steps are the easiest.
>
>
> I installed Amazon corretto and updated the java_home variable as
> instructed here
> https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
> (Any other java works too, I'm used to corretto from work).
>
> Then installed the Pyspark module using pip, which enabled me run Pyspark
> on my machine.
>
> -Sadha
>
> On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:
>
>> Hello,
>>
>> I am trying to run Pyspark on my computer without success.  I follow
>> several different directions from online sources and it appears that I need
>> to get a faster computer.
>>
>> I wanted to ask what are some recommendations for computer specifications
>> to run PySpark (Apache Spark).
>>
>> Any help would be greatly appreciated.
>>
>> Thank you,
>>
>> Mike
>>
>


Re: Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread Sadha Chilukoori
Hi Mike,

I'm not sure about the minimum requirements of a machine for running Spark.
But to run some Pyspark scripts (and Jupiter notbebooks) on a local
machine, I found the following steps are the easiest.


I installed Amazon corretto and updated the java_home variable as
instructed here
https://docs.aws.amazon.com/corretto/latest/corretto-11-ug/downloads-list.html
(Any other java works too, I'm used to corretto from work).

Then installed the Pyspark module using pip, which enabled me run Pyspark
on my machine.

-Sadha

On Mon, Jul 29, 2024, 12:51 PM mike Jadoo  wrote:

> Hello,
>
> I am trying to run Pyspark on my computer without success.  I follow
> several different directions from online sources and it appears that I need
> to get a faster computer.
>
> I wanted to ask what are some recommendations for computer specifications
> to run PySpark (Apache Spark).
>
> Any help would be greatly appreciated.
>
> Thank you,
>
> Mike
>


Question about installing Apache Spark [PySpark] computer requirements

2024-07-29 Thread mike Jadoo
Hello,

I am trying to run Pyspark on my computer without success.  I follow
several different directions from online sources and it appears that I need
to get a faster computer.

I wanted to ask what are some recommendations for computer specifications
to run PySpark (Apache Spark).

Any help would be greatly appreciated.

Thank you,

Mike


Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex,

Spark is an open source software available under  Apache License 2.0 (
https://www.apache.org/licenses/), further details can be found here in the
FAQ page (https://spark.apache.org/faq.html).

Hope this helps.


Thanks,

Sadha

On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX 
wrote:

> Hey guys!
>
>
>
> I am part of the team responsible for software approval at EMBRAER S.A.
> We are currently in the process of approving the Apache Spark 3.5.1
> software and are verifying the licensing of the application.
> Therefore, I would like to kindly request you to answer the questions
> below.
>
> -What type of software? (Commercial, Freeware, Component, etc...)
>  A:
>
> -What is the licensing model for commercial use? (Subscription, Perpetual,
> GPL, etc...)
> A:
>
> -What type of license? (By user, Competitor, Device, Server or others)?
> A:
>
> -Number of installations allowed per license/subscription?
> A:
>
> Can it be used in the defense and aerospace industry? (Company that
> manufactures products for national defense)
> A:
>
> -Does the license allow use in any location regardless of the origin of
> the purchase (tax restriction)?
> A:
>
> -Where can I find the End User License Agreement (EULA) for the version in
> question?
> A:
>
>
>
> Desde já, muito obrigado e qualquer dúvida estou à disposição. / Thank you
> very much in advance and I am at your disposal if you have any questions.
>
>
> Att,
>
>
> Alex Santos Souza
>
> Software Asset Management - Embraer
>
> WhatsApp: +55 12 99731-7579
>
> E-mail: alex.santosso...@dxc.com
>
> DXC Technology
>
> São José dos Campos, SP - Brazil
>
>


7368396 - Apache Spark 3.5.1 (Support)

2024-06-06 Thread SANTOS SOUZA, ALEX
Hey guys!



I am part of the team responsible for software approval at EMBRAER S.A.
We are currently in the process of approving the Apache Spark 3.5.1 software 
and are verifying the licensing of the application.
Therefore, I would like to kindly request you to answer the questions below.

-What type of software? (Commercial, Freeware, Component, etc...)
 A:

-What is the licensing model for commercial use? (Subscription, Perpetual, GPL, 
etc...)
A:

-What type of license? (By user, Competitor, Device, Server or others)?
A:

-Number of installations allowed per license/subscription?
A:

Can it be used in the defense and aerospace industry? (Company that 
manufactures products for national defense)
A:

-Does the license allow use in any location regardless of the origin of the 
purchase (tax restriction)?
A:

-Where can I find the End User License Agreement (EULA) for the version in 
question?
A:



Desde já, muito obrigado e qualquer dúvida estou à disposição. / Thank you very 
much in advance and I am at your disposal if you have any questions.


Att,

[cid:babbaea5-d892-4b6e-abd9-d0da0cc3e296]

Alex Santos Souza

Software Asset Management - Embraer

WhatsApp: +55 12 99731-7579

E-mail: alex.santosso...@dxc.com

DXC Technology

São José dos Campos, SP - Brazil



Inquiry Regarding Security Compliance of Apache Spark Docker Image

2024-06-05 Thread Tonmoy Sagar
Dear Apache Team,

I hope this email finds you well.

We are a team from Ernst and Young LLP - India, dedicated to providing 
innovative supply chain solutions for a diverse range of clients. Our team 
recently encountered a pivotal use case necessitating the utilization of 
PySpark for a project aimed at handling substantial volumes of data. As part of 
our deployment strategy, we are endeavouring to implement a Spark-based 
application on our Azure Kubernetes service.

Regrettably, we have encountered challenges from a security perspective with 
the latest Apache Spark Docker image, specifically apache/spark-py:latest. Our 
security team has meticulously conducted an assessment and has generated a 
comprehensive vulnerability report highlighting areas of concern.

Given the non-compliance of the Docker image with our organization's stringent 
security protocols, we find ourselves unable to proceed with its integration 
into our applications. We attach the vulnerability report herewith for your 
perusal.

Considering these circumstances, we kindly request your esteemed team to 
provide any resolutions or guidance that may assist us in mitigating the 
identified security vulnerabilities. Your prompt attention to this matter would 
be greatly appreciated, as it is crucial for the successful deployment and 
operation of our Spark-based application within our infrastructure.

Thank you for your attention to this inquiry, and we look forward to your 
valued support and assistance.



Please find attachment for the vulnerability report
Best Regards,
Tonmoy Sagar | Sr. Consultant | Advisory | Asterisk
Ernst & Young LLP
C-401, Panchshil Tech Park One, Yerawada, Pune, Maharashtra 411006, India
Mobile: +91 8724918230 | tonmoy.sa...@in.ey.com<mailto:tonmoy.sa...@in.ey.com>
Thrive in the Transformative Age with the better-connected consultants - 
ey.com/consulting<http://ey.com/consulting>



The information contained in this communication is intended solely for the use 
of the individual or entity to whom it is addressed and others authorized to 
receive it. It may contain confidential or legally privileged information. If 
you are not the intended recipient you are hereby notified that any disclosure, 
copying, distribution or taking any action in reliance on the contents of this 
information is strictly prohibited and may be unlawful. If you have received 
this communication in error, please notify us immediately by responding to this 
email and then delete it from your system. The firm is neither liable for the 
proper and complete transmission of the information contained in this 
communication nor for any delay in its receipt.


spark_vulnerability_report.xlsx
Description: spark_vulnerability_report.xlsx

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all,

To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a preview release of Spark 4.0. This
preview is not a stable release in terms of either API or functionality,
but it is meant to give the community early access to try the code that
will become Spark 4.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists or JIRA.

There are a lot of exciting new features added to Spark 4.0, including ANSI
mode by default, Python data source, polymorphic Python UDTF, string
collation support, new VARIANT data type, streaming state store data
source, structured logging, Java 17 by default, and many more.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 4.0.0-preview1, head over to the download page:
https://archive.apache.org/dist/spark/spark-4.0.0-preview1 . It's also
available in PyPI, with version name "4.0.0.dev1".

Thanks,

Wenchen


[apache-spark][spark-dataframe] DataFrameWriter.partitionBy does not guarantee previous sort result

2024-05-31 Thread leeyc0
I have a dataset that have the following schema:
(timestamp, partitionKey, logValue)

I want to have the dataset to be sorted by timestamp, but write to file in
the follow directory layout:
outputDir/partitionKey/files
The output file only contains logValue, that is, timestamp is used for
sorting only and is not used for output.
(FYI, logValue contains textual representation of timestamp which is not
sortable)

My first attempt is to use DataFrameWriter.partitionBy:
dataset
.sort("timestamp")
.select("partitionKey", "logValue")
.write()
.partitionBy("partitionKey")
.text("output");

However, as mentioned in SPARK-44512 (
https://issues.apache.org/jira/browse/SPARK-44512), this does not guarantee
the output is globally sorted.
(note: I found that even setting
spark.sql.optimizer.plannedWrite.enabled=false still does not guarantee
sorted result in low memory environment)

And the developers say DataFrameWriter.partitionBy does not guarantee
sorted results:
"Although I understand Apache Spark 3.4.0 changes the behavior like the
above, I don't think there is a contract that Apache Spark's `partitionBy`
operation preserves the previous ordering."

To workaround this problem, I have to resort to creating a hadoop output
format by extending org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
and output the file by saveAsHadoopFile:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;

public final class PartitionedMultipleTextOutputFormat
extends MultipleTextOutputFormat {
@SuppressWarnings("MissingJavadocMethod")
public PartitionedMultipleTextOutputFormat() {
super();
}

@Override
protected Object generateActualKey(final Object key, final V value) {
return NullWritable.get();
}

@Override
protected String generateFileNameForKeyValue(final Object key, final V
value, final String leaf) {
return new Path(key.toString(), leaf).toString();
}
}

private static Tuple2 mapRDDToDomainLogPair(final Row row) {
final String domain = row.getAs(" partitionKey ");
final var log = (String) row.getAs("logValue");
final var logTextClass = new Text(log);
return new Tuple2(domain, logTextClass);
}

dataset
.sort("timestamp")
.javaRDD()
.mapToPair(TheClass::mapRDDToDomainLogPair)
.saveAsHadoopFile(hdfsTmpPath, String.class, Text.class,
PartitionedMultipleTextOutputFormat.class, GzipCodec.class);

Which seems a little bit hacky.
Does anyone have another better method?


Request for Assistance: Adding User Authentication to Apache Spark Application

2024-05-16 Thread NIKHIL RAJ SHRIVASTAVA
Dear Team,

I hope this email finds you well. My name is Nikhil Raj, and I am currently
working with Apache Spark for one of my projects , where through the help
of a parquet file we are creating an external table in Spark.

I am reaching out to seek assistance regarding user authentication for our
Apache Spark application. Currently, we can connect to the application
using only the host and port information. However, for security reasons, we
would like to implement user authentication to control access and ensure
data integrity.

After reviewing the available documentation and resources, I found that
adding user authentication to our Spark setup requires additional
configurations or plugins. However, I'm facing challenges in understanding
the exact steps or best practices to implement this.

Could you please provide guidance or point me towards relevant
documentation/resources that detail how to integrate user authentication
into Apache Spark?  Additionally, if there are any recommended practices or
considerations for ensuring the security of our Spark setup, we would
greatly appreciate your insights on that as well.

Your assistance in this matter would be invaluable to us, as we aim to
enhance the security of our Spark application and safeguard our data
effectively.

Thank you very much for your time and consideration. I look forward to
hearing from you and your suggestions.

Warm regards,

NIKHIL RAJ
Developer
Estuate Software Pvt. Ltd.
Thanks & Regards


[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3!

Spark 3.4.3 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Apache Spark integration with Spring Boot 3.0.0+

2024-03-28 Thread Szymon Kasperkiewicz
Hello,  I've got a project which has to use newest versions of both Apache 
Spark and Spring Boot due to vulnerabilities issues.  I build my project using 
Gradle. And when I try to run it i get :   Unsatisfied dependecy exception 
about javax/servlet/Servlet.  I've tried to add jakarta servlet, javax 
older version, etc. None of them worked.  The only solution which I saw was to 
downgrade Spring Boot but i can't do that unfortunatelly.  Is there any 
known option to use both Apache Spark and Spring Boot in project?  Best regards 
 Szymon


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1

--
Thank You & Best Regards
Winston Lai

From: Jay Han 
Date: Sunday, 24 March 2024 at 08:39
To: Kiran Kumar Dusi 
Cc: Farshid Ashouri , Matei Zaharia 
, Mich Talebzadeh , Spark 
dev list , user @spark 
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
+1. It sounds awesome!

Kiran Kumar Dusi mailto:kirankumard...@gmail.com>> 
于2024年3月21日周四 14:16写道:
+1

On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri 
mailto:farsheed.asho...@gmail.com>> wrote:
+1

On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
mailto:mich.talebza...@gmail.com>> wrote:
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome!

Kiran Kumar Dusi  于2024年3月21日周四 14:16写道:

> +1
>
> On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri <
> farsheed.asho...@gmail.com> wrote:
>
>> +1
>>
>> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
>> wrote:
>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> good idea for the Apache Spark user group to have the same, especially
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> Streaming, Spark Mlib and so forth.
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>> They are serving their purpose . We went through creating a slack
>>> community that managed to create more more heat than light.. This is
>>> what Databricks community came up with and I quote
>>>
>>> "Knowledge Sharing Hub
>>> Dive into a collaborative space where members like YOU can exchange
>>> knowledge, tips, and best practices. Join the conversation today and
>>> unlock a wealth of collective wisdom to enhance your experience and
>>> drive success."
>>>
>>> I don't know the logistics of setting it up.but I am sure that should
>>> not be that difficult. If anyone is supportive of this proposal, let
>>> the usual +1, 0, -1 decide
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>


Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link

Leveraging Generative AI with Apache Spark: Transforming Data Engineering |
LinkedIn
<https://www.linkedin.com/pulse/leveraging-generative-ai-apache-spark-transforming-mich-lxbte/?trackingId=aqZMBOg4O1KYRB4Una7NEg%3D%3D>

Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Fri, 22 Mar 2024 at 16:16, Mich Talebzadeh 
wrote:

> You may find this link of mine in Linkedin for the said article. We
> can use Linkedin for now.
>
> Leveraging Generative AI with Apache Spark: Transforming Data
> Engineering | LinkedIn
>
>
> Mich Talebzadeh,
>
> Technologist | Data | Generative AI | Financial Fraud
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>


Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We
can use Linkedin for now.

Leveraging Generative AI with Apache Spark: Transforming Data
Engineering | LinkedIn


Mich Talebzadeh,

Technologist | Data | Generative AI | Financial Fraud

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
+1

On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri 
wrote:

> +1
>
> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
> wrote:
>
>> Some of you may be aware that Databricks community Home | Databricks
>> have just launched a knowledge sharing hub. I thought it would be a
>> good idea for the Apache Spark user group to have the same, especially
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>> Streaming, Spark Mlib and so forth.
>>
>> Apache Spark user and dev groups have been around for a good while.
>> They are serving their purpose . We went through creating a slack
>> community that managed to create more more heat than light.. This is
>> what Databricks community came up with and I quote
>>
>> "Knowledge Sharing Hub
>> Dive into a collaborative space where members like YOU can exchange
>> knowledge, tips, and best practices. Join the conversation today and
>> unlock a wealth of collective wisdom to enhance your experience and
>> drive success."
>>
>> I don't know the logistics of setting it up.but I am sure that should
>> not be that difficult. If anyone is supportive of this proposal, let
>> the usual +1, 0, -1 decide
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1

On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
One option that comes to my mind, is that given the cyclic nature of these
types of proposals in these two forums, we should be able to use
Databricks's existing knowledge sharing hub Knowledge Sharing Hub -
Databricks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
as well.

The majority of topics will be of interest to their audience as well. In
addition, they seem to invite everyone to contribute. Unless you have an
overriding concern why we should not take this approach, I can enquire from
Databricks community managers whether they can entertain this idea. They
seem to have a well defined structure for hosting topics.

Let me know your thoughts

Thanks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 19 Mar 2024 at 08:25, Joris Billen 
wrote:

> +1
>
>
> On 18 Mar 2024, at 21:53, Mich Talebzadeh 
> wrote:
>
> Well as long as it works.
>
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
>
> Knowledge Sharing Hub - Databricks
> <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
> wrote:
>
>> something like this  Spark community · GitHub
>> <https://github.com/Spark-community>
>>
>>
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
>> :
>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *ashok34...@yahoo.com.INVALID 
>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>> *To: *user @spark , Spark dev list <
>>> d...@spark.apache.org>, Mich Talebzadeh 
>>> *Cc: *Matei Zaharia 
>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>> Apache Spark Community
>>>
>>> External message, be mindful when clicking links or attachments
>>>
>>>
>>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>>
>>> have just launched a knowledge sharing hub. I thought it would be a
>>>
>>> good idea for the Apache Spark user group to have the same, especially
>>>
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>
>>> Streaming, Spark Mlib and so forth.
>>>
>>>
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>>
>>> They are serving their purpose . We went through creating a slack
>>>
>>> community that managed to create more more heat than light.. This is
>>>
>>> what Databricks community came up with and I quote
>>>
>>>
>>>
>>> "Knowledge Sharing Hub
>>>
>>> Dive into a collaborative space where members like YOU c

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
+1


On 18 Mar 2024, at 21:53, Mich Talebzadeh  wrote:

Well as long as it works.

Please all check this link from Databricks and let us know your thoughts. Will 
something similar work for us?. Of course Databricks have much deeper pockets 
than our ASF community. Will it require moderation in our side to block spams 
and nutcases.

Knowledge Sharing Hub - 
Databricks<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but 
of course cannot be guaranteed . It is essential to note that, as with any 
advice, quote "one test result is worth one-thousand expert opinions (Werner 
<https://en.wikipedia.org/wiki/Wernher_von_Braun> Von 
Braun<https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
mailto:bjornjorgen...@gmail.com>> wrote:
something like this  Spark community · 
GitHub<https://github.com/Spark-community>


man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud 
:
Good idea. Will be useful

+1



From: ashok34...@yahoo.com.INVALID 
Date: Monday, March 18, 2024 at 6:36 AM
To: user @spark mailto:user@spark.apache.org>>, Spark 
dev list mailto:d...@spark.apache.org>>, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>>
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
External message, be mindful when clicking links or attachments

Good idea. Will be useful

+1

On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:


Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297



Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Varun Shah
+1  Great initiative.

QQ : Stack overflow has a similar feature called "Collectives", but I am
not sure of the expenses to create one for Apache Spark. With SO being used
( atleast before ChatGPT became quite the norm for searching questions), it
already has a lot of questions asked and answered by the community over a
period of time and hence, if possible, we could leverage it as the starting
point for building a community before creating a complete new website from
scratch. Any thoughts on this?

Regards,
Varun Shah


On Mon, Mar 18, 2024, 16:29 Mich Talebzadeh 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
+1 .
I can contribute to it as well .

On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage 
wrote:

> +1
>
> Thanks for proposing
>
> On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
>  wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
OK thanks for the update.

What does officially blessed signify here? Can we have and run it as a
sister site? The reason this comes to my mind is that the interested
parties should have easy access to this site (from ISUG Spark sites) as a
reference repository. I guess the advice would be that the information
(topics) are provided as best efforts and cannot be guaranteed.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 21:04, Reynold Xin  wrote:

> One of the problem in the past when something like this was brought up was
> that the ASF couldn't have officially blessed venues beyond the already
> approved ones. So that's something to look into.
>
> Now of course you are welcome to run unofficial things unblessed as long
> as they follow trademark rules.
>
>
>
> On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well as long as it works.
>>
>> Please all check this link from Databricks and let us know your thoughts.
>> Will something similar work for us?. Of course Databricks have much deeper
>> pockets than our ASF community. Will it require moderation in our side to
>> block spams and nutcases.
>>
>> Knowledge Sharing Hub - Databricks
>> <https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
>> wrote:
>>
>>> something like this  Spark community · GitHub
>>> <https://github.com/Spark-community>
>>>
>>>
>>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud <
>>> mpars...@illumina.com.invalid>:
>>>
>>>> Good idea. Will be useful
>>>>
>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From: *ashok34...@yahoo.com.INVALID 
>>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>>> *To: *user @spark , Spark dev list <
>>>> d...@spark.apache.org>, Mich Talebzadeh 
>>>> *Cc: *Matei Zaharia 
>>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>>> Apache Spark Community
>>>>
>>>> External message, be mindful when clicking links or attachments
>>>>
>>>>
>>>>
>>>> Good idea. Will be useful
>>>>
>>>>
>>>>
>>>> +1
>>>>
>>>>
>>>>
>>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Some of you may be aware that Databricks community Home | Databricks
>>>>
>>>> have just launched a knowledge sharing hub. I thought it would be a
>>>>
>>>> good idea for the Apache Spark user group to have the same, especially
>>>>
>>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>>
>>>> Streaming, Spark Mlib and so forth.
>>>>
>>>>
>>>>
>>>> Apache Spark user and dev groups have been around for a good while.
>>>>
>>>> They are serving their purpose . We went

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
One of the problem in the past when something like this was brought up was that 
the ASF couldn't have officially blessed venues beyond the already approved 
ones. So that's something to look into.

Now of course you are welcome to run unofficial things unblessed as long as 
they follow trademark rules.

On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh < mich.talebza...@gmail.com > 
wrote:

> 
> Well as long as it works.
> 
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
> 
> 
> 
> Knowledge Sharing Hub - Databricks (
> https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub
> )
> 
> 
> 
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> 
> London
> 
> United Kingdom
> 
> 
> 
> 
> 
> 
> 
> ** view my Linkedin profile (
> https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/ )
> 
> 
> 
> 
> 
> 
> 
> 
> https:/ / en. everybodywiki. com/ Mich_Talebzadeh (
> https://en.everybodywiki.com/Mich_Talebzadeh )
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one - thousand
> expert opinions ( Werner ( https://en.wikipedia.org/wiki/Wernher_von_Braun
> ) Von Braun ( https://en.wikipedia.org/wiki/Wernher_von_Braun ) )".
> 
> 
> 
> 
> 
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen < bjornjorgensen@ gmail. com
> ( bjornjorgen...@gmail.com ) > wrote:
> 
> 
>> something like this Spark community · GitHub (
>> https://github.com/Spark-community )
>> 
>> 
>> 
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud < mparsian@ illumina. 
>> com.
>> invalid ( mpars...@illumina.com.invalid ) >:
>> 
>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> *From:* ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) <
>>> ashok34668@ yahoo. com. INVALID ( ashok34...@yahoo.com.INVALID ) >
>>> *Date:* Monday, March 18 , 2024 at 6:36 AM
>>> *To:* user @spark < user@ spark. apache. org ( user@spark.apache.org ) >,
>>> Spark dev list < dev@ spark. apache. org ( d...@spark.apache.org ) >, Mich
>>> Talebzadeh < mich. talebzadeh@ gmail. com ( mich.talebza...@gmail.com ) >
>>> *Cc:* Matei Zaharia < matei. zaharia@ gmail. com ( matei.zaha...@gmail.com
>>> ) >
>>> *Subject:* Re: A proposal for creating a Knowledge Sharing Hub for Apache
>>> Spark Community
>>> 
>>> 
>>> 
>>> 
>>> External message, be mindful when clicking links or attachments
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Good idea. Will be useful
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh < mich. 
>>> talebzadeh@
>>> gmail. com ( mich.talebza...@gmail.com ) > wrote:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Some of you may be aware that Databricks community Home | Databricks
>>> 
>>> 
>>> 
>>> 
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> 
>>> 
>>> 
>>> 
>>> good idea for the Apache Spark user group to have the same, especially
>>> 
>>> 
>>> 
>>> 
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> 
>>> 
>>> 
>>> 
>>> Streaming, Spark Mlib and so forth.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Well as long as it works.

Please all check this link from Databricks and let us know your thoughts.
Will something similar work for us?. Of course Databricks have much deeper
pockets than our ASF community. Will it require moderation in our side to
block spams and nutcases.

Knowledge Sharing Hub - Databricks
<https://community.databricks.com/t5/knowledge-sharing-hub/bd-p/Knowledge-Sharing-Hub>


Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
wrote:

> something like this  Spark community · GitHub
> <https://github.com/Spark-community>
>
>
> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
> :
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
something like this  Spark community · GitHub
<https://github.com/Spark-community>


man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
+1

Thanks for proposing

On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
 wrote:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
+1 for me

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
wrote:

> Good idea. Will be useful
>
>
>
> +1
>
>
>
>
>
>
>
> *From: *ashok34...@yahoo.com.INVALID 
> *Date: *Monday, March 18, 2024 at 6:36 AM
> *To: *user @spark , Spark dev list <
> d...@spark.apache.org>, Mich Talebzadeh 
> *Cc: *Matei Zaharia 
> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for Apache
> Spark Community
>
> External message, be mindful when clicking links or attachments
>
>
>
> Good idea. Will be useful
>
>
>
> +1
>
>
>
> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
>
>
>
> Some of you may be aware that Databricks community Home | Databricks
>
> have just launched a knowledge sharing hub. I thought it would be a
>
> good idea for the Apache Spark user group to have the same, especially
>
> for repeat questions on Spark core, Spark SQL, Spark Structured
>
> Streaming, Spark Mlib and so forth.
>
>
>
> Apache Spark user and dev groups have been around for a good while.
>
> They are serving their purpose . We went through creating a slack
>
> community that managed to create more more heat than light.. This is
>
> what Databricks community came up with and I quote
>
>
>
> "Knowledge Sharing Hub
>
> Dive into a collaborative space where members like YOU can exchange
>
> knowledge, tips, and best practices. Join the conversation today and
>
> unlock a wealth of collective wisdom to enhance your experience and
>
> drive success."
>
>
>
> I don't know the logistics of setting it up.but I am sure that should
>
> not be that difficult. If anyone is supportive of this proposal, let
>
> the usual +1, 0, -1 decide
>
>
>
> HTH
>
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>
>
>   view my Linkedin profile
>
>
>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
> <https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>
>
>
>
>
>
>
>
> Disclaimer: The information provided is correct to the best of my
>
> knowledge but of course cannot be guaranteed . It is essential to note
>
> that, as with any advice, quote "one test result is worth one-thousand
>
> expert opinions (Werner Von Braun)".
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful

+1



From: ashok34...@yahoo.com.INVALID 
Date: Monday, March 18, 2024 at 6:36 AM
To: user @spark , Spark dev list 
, Mich Talebzadeh 
Cc: Matei Zaharia 
Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark 
Community
External message, be mindful when clicking links or attachments

Good idea. Will be useful

+1

On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
 wrote:


Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https:/en.everybodywiki.com/Mich_Talebzadeh__;!!HrbR-XT-OQ!Wu9fFP8RFJW2N_YUvwl9yctGHxtM-CFPe6McqOJDrxGBjIaRoF8vRwpjT9WzHojwI2R09Nbg8YE9ggB4FtocU8cQFw$>



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread ashok34...@yahoo.com.INVALID
 Good idea. Will be useful
+1
On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh 
 wrote:  
 
 Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


  view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  

A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
Some of you may be aware that Databricks community Home | Databricks
have just launched a knowledge sharing hub. I thought it would be a
good idea for the Apache Spark user group to have the same, especially
for repeat questions on Spark core, Spark SQL, Spark Structured
Streaming, Spark Mlib and so forth.

Apache Spark user and dev groups have been around for a good while.
They are serving their purpose . We went through creating a slack
community that managed to create more more heat than light.. This is
what Databricks community came up with and I quote

"Knowledge Sharing Hub
Dive into a collaborative space where members like YOU can exchange
knowledge, tips, and best practices. Join the conversation today and
unlock a wealth of collective wisdom to enhance your experience and
drive success."

I don't know the logistics of setting it up.but I am sure that should
not be that difficult. If anyone is supportive of this proposal, let
the usual +1, 0, -1 decide

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, Let me double-check it carefully.

Thank you very much for your help!



发件人: Jungtaek Lim 
发送时间: 2024年3月5日 21:56:41
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Yeah the approach seems OK to me - please double check that the doc generation 
in Spark repo won't fail after the move of the js file. Other than that, it 
would be probably just a matter of updating the release process.

On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Okay, I see.

Perhaps we can solve this confusion by sharing the same file `version.json` 
across `all versions` in the `Spark website repo`? Make each version of the 
document display the `same` data in the dropdown menu.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 17:09:07
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released 
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last 
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the 
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this as 
done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving confusion 
that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the 
latest version of version lines. If you expand this to EOLed version lines and 
versions which aren't the latest in their version line, the problem gets much 
more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html<https://mailshield.baidu.com/check?q=uXELebgeq9S

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Yeah the approach seems OK to me - please double check that the doc
generation in Spark repo won't fail after the move of the js file. Other
than that, it would be probably just a matter of updating the release
process.

On Tue, Mar 5, 2024 at 7:24 PM Pan,Bingkun  wrote:

> Okay, I see.
>
> Perhaps we can solve this confusion by sharing the same file `version.json`
> across `all versions` in the `Spark website repo`? Make each version of
> the document display the `same` data in the dropdown menu.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 17:09:07
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Let me be more specific.
>
> We have two active release version lines, 3.4.x and 3.5.x. We just
> released Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact
> the last version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3.
> In the dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we
> call this as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown,
> giving confusion that 3.4.3 wasn't ever released.
>
> This is just about two active release version lines with keeping only the
> latest version of version lines. If you expand this to EOLed version lines
> and versions which aren't the latest in their version line, the problem
> gets much more complicated.
>
> On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:
>
>> Based on my understanding, we should not update versions that have
>> already been released,
>>
>> such as the situation you mentioned: `But what about dropout of version
>> D? Should we add E in the dropdown?` We only need to record the latest
>> `version. json` file that has already been published at the time of each
>> new document release.
>>
>> Of course, if we need to keep the latest in every document, I think it's
>> also possible.
>>
>> Only by sharing the same version. json file in each version.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月5日 16:47:30
>> *收件人:* Pan,Bingkun
>> *抄送:* Dongjoon Hyun; dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> But this does not answer my question about updating the dropdown for the
>> doc of "already released versions", right?
>>
>> Let's say we just released version D, and the dropdown has version A, B,
>> C. We have another release tomorrow as version E, and it's probably easy to
>> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
>> Should we add E in the dropdown? How do we maintain it if we will have 10
>> releases afterwards?
>>
>> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>>
>>> According to my understanding, the original intention of this feature is
>>> that when a user has entered the pyspark document, if he finds that the
>>> version he is currently in is not the version he wants, he can easily jump
>>> to the version he wants by clicking on the drop-down box. Additionally, in
>>> this PR, the current automatic mechanism for PRs did not merge in.
>>>
>>> https://github.com/apache/spark/pull/42881
>>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>>
>>> So, we need to manually update this file. I can manually submit an
>>> update first to get this feature working.
>>> --
>>> *发件人:* Jungtaek Lim 
>>> *发送时间:* 2024年3月4日 6:34:42
>>> *收件人:* Dongjoon Hyun
>>> *抄送:* dev; user
>>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>>
>>> Shall we revisit this functionality? The API doc is built with
>>> individual versions, and for each individual version we depend on other
>>> released versions. This does not seem to be right to me. Also, the
>>> functionality is only in PySpark API doc which does not seem to be
>>> consistent as well.
>>>
>>> I don't think this is manageable with the current approach (listing
>>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>>> How about the time we are going to release the new version after releasing
>>> 10 versions? What's the criteria of pruning the version?
>>>
>>> Unless we have a good answer to these questions, I think it's better to
>>> revert the functionality - it missed various c

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Okay, I see.

Perhaps we can solve this confusion by sharing the same file `version.json` 
across `all versions` in the `Spark website repo`? Make each version of the 
document display the `same` data in the dropdown menu.


发件人: Jungtaek Lim 
发送时间: 2024年3月5日 17:09:07
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released 
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last 
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the 
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this as 
done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving confusion 
that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the 
latest version of version lines. If you expand this to EOLed version lines and 
versions which aren't the latest in their version line, the problem gets much 
more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html<https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
https://spark.apache.org/docs/3.4.2/api/python/index.html<https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
https://spark.apache.org/docs/3.3.4/api/python/index.html<https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but th

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
Let me be more specific.

We have two active release version lines, 3.4.x and 3.5.x. We just released
Spark 3.5.1, having a dropdown as 3.5.1 and 3.4.2 given the fact the last
version of 3.4.x is 3.4.2. After a month we released Spark 3.4.3. In the
dropdown of Spark 3.4.3, there will be 3.5.1 and 3.4.3. But if we call this
as done, 3.5.1 (still latest) won't show 3.4.3 in the dropdown, giving
confusion that 3.4.3 wasn't ever released.

This is just about two active release version lines with keeping only the
latest version of version lines. If you expand this to EOLed version lines
and versions which aren't the latest in their version line, the problem
gets much more complicated.

On Tue, Mar 5, 2024 at 6:01 PM Pan,Bingkun  wrote:

> Based on my understanding, we should not update versions that have already
> been released,
>
> such as the situation you mentioned: `But what about dropout of version D?
> Should we add E in the dropdown?` We only need to record the latest
> `version. json` file that has already been published at the time of each
> new document release.
>
> Of course, if we need to keep the latest in every document, I think it's
> also possible.
>
> Only by sharing the same version. json file in each version.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月5日 16:47:30
> *收件人:* Pan,Bingkun
> *抄送:* Dongjoon Hyun; dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> But this does not answer my question about updating the dropdown for the
> doc of "already released versions", right?
>
> Let's say we just released version D, and the dropdown has version A, B,
> C. We have another release tomorrow as version E, and it's probably easy to
> add A, B, C, D in the dropdown of E. But what about dropdown of version D?
> Should we add E in the dropdown? How do we maintain it if we will have 10
> releases afterwards?
>
> On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:
>
>> According to my understanding, the original intention of this feature is
>> that when a user has entered the pyspark document, if he finds that the
>> version he is currently in is not the version he wants, he can easily jump
>> to the version he wants by clicking on the drop-down box. Additionally, in
>> this PR, the current automatic mechanism for PRs did not merge in.
>>
>> https://github.com/apache/spark/pull/42881
>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>
>> So, we need to manually update this file. I can manually submit an update
>> first to get this feature working.
>> --
>> *发件人:* Jungtaek Lim 
>> *发送时间:* 2024年3月4日 6:34:42
>> *收件人:* Dongjoon Hyun
>> *抄送:* dev; user
>> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
Based on my understanding, we should not update versions that have already been 
released,

such as the situation you mentioned: `But what about dropout of version D? 
Should we add E in the dropdown?` We only need to record the latest `version. 
json` file that has already been published at the time of each new document 
release.

Of course, if we need to keep the latest in every document, I think it's also 
possible.

Only by sharing the same version. json file in each version.


发件人: Jungtaek Lim 
发送时间: 2024年3月5日 16:47:30
收件人: Pan,Bingkun
抄送: Dongjoon Hyun; dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

But this does not answer my question about updating the dropdown for the doc of 
"already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C. We 
have another release tomorrow as version E, and it's probably easy to add A, B, 
C, D in the dropdown of E. But what about dropdown of version D? Should we add 
E in the dropdown? How do we maintain it if we will have 10 releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun 
mailto:panbing...@baidu.com>> wrote:

According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>>
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html<https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
https://spark.apache.org/docs/3.4.2/api/python/index.html<https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
https://spark.apache.org/docs/3.3.4/api/python/index.html<https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428<https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html<https://mailshield.baidu.com/c

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Jungtaek Lim
But this does not answer my question about updating the dropdown for the
doc of "already released versions", right?

Let's say we just released version D, and the dropdown has version A, B, C.
We have another release tomorrow as version E, and it's probably easy to
add A, B, C, D in the dropdown of E. But what about dropdown of version D?
Should we add E in the dropdown? How do we maintain it if we will have 10
releases afterwards?

On Tue, Mar 5, 2024 at 5:27 PM Pan,Bingkun  wrote:

> According to my understanding, the original intention of this feature is
> that when a user has entered the pyspark document, if he finds that the
> version he is currently in is not the version he wants, he can easily jump
> to the version he wants by clicking on the drop-down box. Additionally, in
> this PR, the current automatic mechanism for PRs did not merge in.
>
> https://github.com/apache/spark/pull/42881
>
> So, we need to manually update this file. I can manually submit an update
> first to get this feature working.
> --
> *发件人:* Jungtaek Lim 
> *发送时间:* 2024年3月4日 6:34:42
> *收件人:* Dongjoon Hyun
> *抄送:* dev; user
> *主题:* Re: [ANNOUNCE] Apache Spark 3.5.1 released
>
> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> <https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> <https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>> <https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> <https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>> <https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>> <https://mailshield.baidu.com/check?q=KwooIjNwx9R5XjkTxvpqs6ApF2YX2ZujKl%2bha1PX%2bf3X4CQowIWtvSFmFPVO1297fFYMkgFMgmFuEBDkuDwpig%3d%3d>
>>>
>>> PySpark Overview
>>> <https://mailshield.baidu.com/check?q=rahGq5g%2bcbjBOU3xXCbESExdvGhXXTpk%2f%2f3BUMatX7zAgGbgcBy3mkuJmlmgtZZIoahnY2Cj2t4uylAFmefkTY1%2bQbN0rqSWYUU6qjrQRqY%3d>
>>>
>>>Date: Feb 24, 2024 Version: master
>>>

答复: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-05 Thread Pan,Bingkun
According to my understanding, the original intention of this feature is that 
when a user has entered the pyspark document, if he finds that the version he 
is currently in is not the version he wants, he can easily jump to the version 
he wants by clicking on the drop-down box. Additionally, in this PR, the 
current automatic mechanism for PRs did not merge in.

https://github.com/apache/spark/pull/42881

So, we need to manually update this file. I can manually submit an update first 
to get this feature working.


发件人: Jungtaek Lim 
发送时间: 2024年3月4日 6:34:42
收件人: Dongjoon Hyun
抄送: dev; user
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html<https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
https://spark.apache.org/docs/3.4.2/api/python/index.html<https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
https://spark.apache.org/docs/3.3.4/api/python/index.html<https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428<https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html<https://mailshield.baidu.com/check?q=KwooIjNwx9R5XjkTxvpqs6ApF2YX2ZujKl%2bha1PX%2bf3X4CQowIWtvSFmFPVO1297fFYMkgFMgmFuEBDkuDwpig%3d%3d>

PySpark 
Overview<https://mailshield.baidu.com/check?q=rahGq5g%2bcbjBOU3xXCbESExdvGhXXTpk%2f%2f3BUMatX7zAgGbgcBy3mkuJmlmgtZZIoahnY2Cj2t4uylAFmefkTY1%2bQbN0rqSWYUU6qjrQRqY%3d>

   Date: Feb 24, 2024 Version: master

[Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer 
mailto:belie...@163.com>> wrote:

Congratulations!



At 2024-02-28 17:43:25, "Jungtaek Lim" 
mailto:kabhwan.opensou...@gmail.com>> wrote:

Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html<https://mailshield.baidu.com/check?q=aV5QpxMQ4pApHhycByY17SDpg%2fyWowLsFKuT2QIJ%2blgKNmM

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread yangjie01
That sounds like a great suggestion.

发件人: Jungtaek Lim 
日期: 2024年3月5日 星期二 10:46
收件人: Hyukjin Kwon 
抄送: yangjie01 , Dongjoon Hyun , 
dev , user 
主题: Re: [ANNOUNCE] Apache Spark 3.5.1 released

Yes, it's relevant to that PR. I wonder, if we want to expose version switcher, 
it should be in versionless doc (spark-website) rather than the doc being 
pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Is this related to 
https://github.com/apache/spark/pull/42428<https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>?

cc @Yang,Jie(INF)<mailto:yangji...@baidu.com>

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Shall we revisit this functionality? The API doc is built with individual 
versions, and for each individual version we depend on other released versions. 
This does not seem to be right to me. Also, the functionality is only in 
PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing versions in 
version-dependent doc). Let's say we release 3.4.3 after 3.5.1. Should we 
update the versions in 3.5.1 to add 3.4.3 in version switcher? How about the 
time we are going to release the new version after releasing 10 versions? 
What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to revert 
the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks for reporting - this is odd - the dropdown did not exist in other recent 
releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html<https://mailshield.baidu.com/check?q=uXELebgeq9ShKrQ3HDYtw08xGdWbbrT3FEzFk%2fzTZ%2bVxzlJrJa41y1xJkZ7RbZcLmQNMGzBVvVX6KlpxrtsKRQ%3d%3d>
https://spark.apache.org/docs/3.4.2/api/python/index.html<https://mailshield.baidu.com/check?q=vFHg6IjqXnlPilWEcpu6a0oCJLXpFeNnsL6hZ%2fpZY0nGPd6tnUFbimhVD6zVpMlL8RAgxzN8GQM6cNBFe8hXvA%3d%3d>
https://spark.apache.org/docs/3.3.4/api/python/index.html<https://mailshield.baidu.com/check?q=cfoH89Pu%2fNbZC4s7657SqqfHpT7hoppw7e6%2fZzsz8S7ZoEMm2LijOxwcGgKS5O29HzYUyQoooMRdy%2fd5Y36e2Q%3d%3d>

Looks like the dropdown feature was recently introduced but partially done. The 
addition of a dropdown was done, but the way how to bump the version was missed 
to be documented.
The contributor proposed the way to update the version "automatically", but the 
PR wasn't merged. As a result, we are neither having the instruction how to 
bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: 
https://github.com/apache/spark/pull/42428<https://mailshield.baidu.com/check?q=pSDq2Cdb4aBtjOEg7J1%2fXPtYeSxjVkQfXKV%2fmfX1Y7NeT77hnIS%2bsvMbbXwT3DLm>
* PR for automatically bumping version: 
https://github.com/apache/spark/pull/42881<https://mailshield.baidu.com/check?q=NXF5O0EN4F6TOoAzxFGzXSJvMnQlPeztKpu%2fBYaKpd2sRl6qEYTx2NGUrTYUrhOk>

We will probably need to add an instruction in the release process to update 
the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend in S. 
Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.


https://spark.apache.org/docs/3.5.1/api/python/index.html<https://mailshield.baidu.com/check?q=KwooIjNwx9R5XjkTxvpqs6ApF2YX2ZujKl%2bha1PX%2bf3X4CQowIWtvSFmFPVO1297fFYMkgFMgmFuEBDkuDwpig%3d%3d>

PySpark Overview

   Date: Feb 24, 2024 Version: master
[cid:image001.png@01DA6F13.CD4B0B00]



Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge 
mailto:jzh...@apache.org>> wrote:
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer 
mailto:belie...@163.com>> wrote:

Congratulations!





At 2024-02-28 17:43:25, "Jungtaek Lim" 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html<https://mailshield.baidu.com/check?q=aV5QpxMQ4pApHhycByY17SDpg%2fyWowLsFKuT2QIJ%2blg

Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Jungtaek Lim
Yes, it's relevant to that PR. I wonder, if we want to expose version
switcher, it should be in versionless doc (spark-website) rather than the
doc being pinned to a specific version.

On Tue, Mar 5, 2024 at 11:18 AM Hyukjin Kwon  wrote:

> Is this related to https://github.com/apache/spark/pull/42428?
>
> cc @Yang,Jie(INF) 
>
> On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
> wrote:
>
>> Shall we revisit this functionality? The API doc is built with individual
>> versions, and for each individual version we depend on other released
>> versions. This does not seem to be right to me. Also, the functionality is
>> only in PySpark API doc which does not seem to be consistent as well.
>>
>> I don't think this is manageable with the current approach (listing
>> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
>> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
>> How about the time we are going to release the new version after releasing
>> 10 versions? What's the criteria of pruning the version?
>>
>> Unless we have a good answer to these questions, I think it's better to
>> revert the functionality - it missed various considerations.
>>
>> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
>> wrote:
>>
>>> Thanks for reporting - this is odd - the dropdown did not exist in other
>>> recent releases.
>>>
>>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>>
>>> Looks like the dropdown feature was recently introduced but partially
>>> done. The addition of a dropdown was done, but the way how to bump the
>>> version was missed to be documented.
>>> The contributor proposed the way to update the version "automatically",
>>> but the PR wasn't merged. As a result, we are neither having the
>>> instruction how to bump the version manually, nor having the automatic bump.
>>>
>>> * PR for addition of dropdown:
>>> https://github.com/apache/spark/pull/42428
>>> * PR for automatically bumping version:
>>> https://github.com/apache/spark/pull/42881
>>>
>>> We will probably need to add an instruction in the release process to
>>> update the version. (For automatic bumping I don't have a good idea.)
>>> I'll look into it. Please expect some delay during the holiday weekend
>>> in S. Korea.
>>>
>>> Thanks again.
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> BTW, Jungtaek.
>>>>
>>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>>
>>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>>
>>>> PySpark Overview
>>>> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>>>>
>>>>Date: Feb 24, 2024 Version: master
>>>>
>>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>>
>>>>
>>>> Could you do the follow-up, please?
>>>>
>>>> Thank you in advance.
>>>>
>>>> Dongjoon.
>>>>
>>>>
>>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>>
>>>>> Excellent work, congratulations!
>>>>>
>>>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun <
>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>
>>>>>> Congratulations!
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>>>>
>>>>>>> Congratulations!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>>>>
>>>>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>>>>> strongly
>>>>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>>>>
>>>>>>> To download Spark 3.5.1, head over to the download page:
>>>>>>> https://spark.apache.org/downloads.html
>>>>>>>
>>>>>>> To view the release notes:
>>>>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>>>>
>>>>>>> We would like to acknowledge all community members for contributing
>>>>>>> to this
>>>>>>> release. This release would not have been possible without you.
>>>>>>>
>>>>>>> Jungtaek Lim
>>>>>>>
>>>>>>> ps. Yikun is helping us through releasing the official docker image
>>>>>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>>>>>> available.
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
>>>> Excellent work, congratulations!
>>>>
>>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Congratulations!
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>>>
>>>>>> Congratulations!
>>>>>>
>>>>>>
>>>>>>
>>>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>>>
>>>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>>>> strongly
>>>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>>>
>>>>>> To download Spark 3.5.1, head over to the download page:
>>>>>> https://spark.apache.org/downloads.html
>>>>>>
>>>>>> To view the release notes:
>>>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>>>
>>>>>> We would like to acknowledge all community members for contributing
>>>>>> to this
>>>>>> release. This release would not have been possible without you.
>>>>>>
>>>>>> Jungtaek Lim
>>>>>>
>>>>>> ps. Yikun is helping us through releasing the official docker image
>>>>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>>>>> available.
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-03 Thread Jungtaek Lim
Shall we revisit this functionality? The API doc is built with individual
versions, and for each individual version we depend on other released
versions. This does not seem to be right to me. Also, the functionality is
only in PySpark API doc which does not seem to be consistent as well.

I don't think this is manageable with the current approach (listing
versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
How about the time we are going to release the new version after releasing
10 versions? What's the criteria of pruning the version?

Unless we have a good answer to these questions, I think it's better to
revert the functionality - it missed various considerations.

On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
wrote:

> Thanks for reporting - this is odd - the dropdown did not exist in other
> recent releases.
>
> https://spark.apache.org/docs/3.5.0/api/python/index.html
> https://spark.apache.org/docs/3.4.2/api/python/index.html
> https://spark.apache.org/docs/3.3.4/api/python/index.html
>
> Looks like the dropdown feature was recently introduced but partially
> done. The addition of a dropdown was done, but the way how to bump the
> version was missed to be documented.
> The contributor proposed the way to update the version "automatically",
> but the PR wasn't merged. As a result, we are neither having the
> instruction how to bump the version manually, nor having the automatic bump.
>
> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
> * PR for automatically bumping version:
> https://github.com/apache/spark/pull/42881
>
> We will probably need to add an instruction in the release process to
> update the version. (For automatic bumping I don't have a good idea.)
> I'll look into it. Please expect some delay during the holiday weekend
> in S. Korea.
>
> Thanks again.
> Jungtaek Lim (HeartSaVioR)
>
>
> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
> wrote:
>
>> BTW, Jungtaek.
>>
>> PySpark document seems to show a wrong branch. At this time, `master`.
>>
>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>
>> PySpark Overview
>> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>>
>>Date: Feb 24, 2024 Version: master
>>
>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>
>>
>> Could you do the follow-up, please?
>>
>> Thank you in advance.
>>
>> Dongjoon.
>>
>>
>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>
>>> Excellent work, congratulations!
>>>
>>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Congratulations!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>>
>>>>> Congratulations!
>>>>>
>>>>>
>>>>>
>>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>>
>>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>>> strongly
>>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>>
>>>>> To download Spark 3.5.1, head over to the download page:
>>>>> https://spark.apache.org/downloads.html
>>>>>
>>>>> To view the release notes:
>>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>>
>>>>> We would like to acknowledge all community members for contributing to
>>>>> this
>>>>> release. This release would not have been possible without you.
>>>>>
>>>>> Jungtaek Lim
>>>>>
>>>>> ps. Yikun is helping us through releasing the official docker image
>>>>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>>>>> available.
>>>>>
>>>>>
>>>
>>> --
>>> John Zhuge
>>>
>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this!

Xinrong Meng  ezt írta (időpont: 2024. márc. 1.,
P, 5:24):

> Congratulations!
>
> Thanks,
> Xinrong
>
> On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Jungtaek Lim
Thanks for reporting - this is odd - the dropdown did not exist in other
recent releases.

https://spark.apache.org/docs/3.5.0/api/python/index.html
https://spark.apache.org/docs/3.4.2/api/python/index.html
https://spark.apache.org/docs/3.3.4/api/python/index.html

Looks like the dropdown feature was recently introduced but partially done.
The addition of a dropdown was done, but the way how to bump the version
was missed to be documented.
The contributor proposed the way to update the version "automatically", but
the PR wasn't merged. As a result, we are neither having the instruction
how to bump the version manually, nor having the automatic bump.

* PR for addition of dropdown: https://github.com/apache/spark/pull/42428
* PR for automatically bumping version:
https://github.com/apache/spark/pull/42881

We will probably need to add an instruction in the release process to
update the version. (For automatic bumping I don't have a good idea.)
I'll look into it. Please expect some delay during the holiday weekend
in S. Korea.

Thanks again.
Jungtaek Lim (HeartSaVioR)


On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
wrote:

> BTW, Jungtaek.
>
> PySpark document seems to show a wrong branch. At this time, `master`.
>
> https://spark.apache.org/docs/3.5.1/api/python/index.html
>
> PySpark Overview
> <https://spark.apache.org/docs/3.5.1/api/python/index.html#pyspark-overview>
>
>Date: Feb 24, 2024 Version: master
>
> [image: Screenshot 2024-02-29 at 21.12.24.png]
>
>
> Could you do the follow-up, please?
>
> Thank you in advance.
>
> Dongjoon.
>
>
> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>
>> Excellent work, congratulations!
>>
>> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
>> wrote:
>>
>>> Congratulations!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>>
>>>> Congratulations!
>>>>
>>>>
>>>>
>>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> We are happy to announce the availability of Spark 3.5.1!
>>>>
>>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>>> strongly
>>>> recommend all 3.5 users to upgrade to this stable release.
>>>>
>>>> To download Spark 3.5.1, head over to the download page:
>>>> https://spark.apache.org/downloads.html
>>>>
>>>> To view the release notes:
>>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>>
>>>> We would like to acknowledge all community members for contributing to
>>>> this
>>>> release. This release would not have been possible without you.
>>>>
>>>> Jungtaek Lim
>>>>
>>>> ps. Yikun is helping us through releasing the official docker image for
>>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally 
>>>> available.
>>>>
>>>>
>>
>> --
>> John Zhuge
>>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Dongjoon Hyun
BTW, Jungtaek.

PySpark document seems to show a wrong branch. At this time, `master`.

https://spark.apache.org/docs/3.5.1/api/python/index.html

PySpark Overview


   Date: Feb 24, 2024 Version: master

[image: Screenshot 2024-02-29 at 21.12.24.png]


Could you do the follow-up, please?

Thank you in advance.

Dongjoon.


On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:

> Excellent work, congratulations!
>
> On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
> wrote:
>
>> Congratulations!
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>>
>>> Congratulations!
>>>
>>>
>>>
>>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> We are happy to announce the availability of Spark 3.5.1!
>>>
>>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.5 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.5 users to upgrade to this stable release.
>>>
>>> To download Spark 3.5.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Jungtaek Lim
>>>
>>> ps. Yikun is helping us through releasing the official docker image for
>>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>>
>>>
>
> --
> John Zhuge
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread John Zhuge
Excellent work, congratulations!

On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>

-- 
John Zhuge


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Prem Sahoo
Congratulations 👍Sent from my iPhoneOn Feb 29, 2024, at 4:54 PM, Xinrong Meng  wrote:Congratulations!Thanks,XinrongOn Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun  wrote:Congratulations!Bests,Dongjoon.On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:Congratulations!At 2024-02-28 17:43:25, "Jungtaek Lim"  wrote:Hi everyone,We are happy to announce the availability of Spark 3.5.1!Spark 3.5.1 is a maintenance release containing stability fixes. Thisrelease is based on the branch-3.5 maintenance branch of Spark. We stronglyrecommend all 3.5 users to upgrade to this stable release.To download Spark 3.5.1, head over to the download page:https://spark.apache.org/downloads.htmlTo view the release notes:https://spark.apache.org/releases/spark-release-3-5-1.htmlWe would like to acknowledge all community members for contributing to thisrelease. This release would not have been possible without you.Jungtaek Limps. Yikun is helping us through releasing the official docker image for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.




Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Xinrong Meng
Congratulations!

Thanks,
Xinrong

On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun 
wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image for
>> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>>
>>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Dongjoon Hyun
Congratulations!

Bests,
Dongjoon.

On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:

> Congratulations!
>
>
>
> At 2024-02-28 17:43:25, "Jungtaek Lim" 
> wrote:
>
> Hi everyone,
>
> We are happy to announce the availability of Spark 3.5.1!
>
> Spark 3.5.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.5 maintenance branch of Spark. We strongly
> recommend all 3.5 users to upgrade to this stable release.
>
> To download Spark 3.5.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-5-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Jungtaek Lim
>
> ps. Yikun is helping us through releasing the official docker image for
> Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.
>
>


Re:[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread beliefer
Congratulations!







At 2024-02-28 17:43:25, "Jungtaek Lim"  wrote:

Hi everyone,


We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-5-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Jungtaek Lim



ps. Yikun is helping us through releasing the official docker image for Spark 
3.5.1 (Thanks Yikun!) It may take some time to be generally available.



[ANNOUNCE] Apache Spark 3.5.1 released

2024-02-28 Thread Jungtaek Lim
Hi everyone,

We are happy to announce the availability of Spark 3.5.1!

Spark 3.5.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.5 maintenance branch of Spark. We strongly
recommend all 3.5 users to upgrade to this stable release.

To download Spark 3.5.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-5-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Jungtaek Lim

ps. Yikun is helping us through releasing the official docker image for
Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally available.


[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path: string (nullable = false) |-- file_name: string (nullable = false) |-- file_size: long (nullable = false) |-- file_block_start: long (nullable = false) |-- file_block_length: long (nullable = false) |-- file_modification_time: timestamp (nullable = false)  As far as I can tell this feature was released as part of spark 3.20 based on this stack overflow post https://stackoverflow.com/questions/62846669/can-i-get-metadata-of-files-reading-by-spark/77238087#77238087 unfortunately I wasn’t able to locate this in the release notes. Though I may have missed it somehow. So I have  the following questions and seeking guidance from the list at how to best approach this Is the documentation “missing” from the spark 3.20 site or am I just unable to find it:While it provides the file_modification_time, there doesn’t seem to be a corresponding file_creation_time Would both of these be issues that should be opened in JIRA?  Both of these seem like simple and useful things to add but are above my ability to submit PR’s for without some guidance. I’m happy to help especially with a documentation PR’ if someone can confirm and get me started in the right direction. I don’t really have the java / scala skills needed to implement the feature.  Thanks for any pointers  

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ANNOUNCE] Apache Spark 3.3.4 released

2023-12-16 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.3.4!

Spark 3.3.4 is the last maintenance release based on the
branch-3.3 maintenance branch of Spark. It contains many fixes
including security and correctness domains. We strongly
recommend all 3.3 users to upgrade to this or higher stable release.

To download Spark 3.3.4, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-4.html

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

Dongjoon Hyun


Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the clarification. I will try to do plain jdbc connection on
Scala/Java and will update this thread on how it goes.

*Thanks,*
*Venkat*



On Thu, Dec 7, 2023 at 9:40 AM Nicholas Chammas 
wrote:

> PyMySQL has its own implementation
> <https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
>  of
> the MySQL client-server protocol. It does not use JDBC.
>
>
> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan <
> venkatesa...@noonacademy.com> wrote:
>
> Thanks for the advice Nicholas.
>
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using
> pymysql and sshtunnel and it worked fine. The problem happens only with
> Spark.
>
> *Thanks,*
> *Venkat*
>
>
>
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This is not a question for the dev list. Moving dev to bcc.
>>
>> One thing I would try is to connect to this database using JDBC + SSH
>> tunnel, but without Spark. That way you can focus on getting the JDBC
>> connection to work without Spark complicating the picture for you.
>>
>>
>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan <
>> venkatesa...@noonacademy.com> wrote:
>>
>> Hi Team,
>>
>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is
>> same as the one in this Stackoverflow question
>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>> but there are no answers there.
>>
>> This is what I am trying:
>>
>>
>> with SSHTunnelForwarder(
>> (ssh_host, ssh_port),
>> ssh_username=ssh_user,
>> ssh_pkey=ssh_key_file,
>> remote_bind_address=(sql_hostname, sql_port),
>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>> tunnel.local_bind_port
>> b1_semester_df = spark.read \
>> .format("jdbc") \
>> .option("url", b2b_mysql_url.replace("<>", 
>> str(tunnel.local_bind_port)))
>> \
>> .option("query", b1_semester_sql) \
>> .option("database", 'b2b') \
>> .option("password", b2b_mysql_password) \
>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>> .load()
>> b1_semester_df.count()
>>
>> Here, the b1_semester_df is loaded but when I try count on the same Df it
>> fails saying this
>>
>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4
>> times; aborting job
>> Traceback (most recent call last):
>>   File "", line 1, in 
>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in
>> show
>> print(self._jdf.showString(n, 20, vertical))
>>   File
>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
>> 1257, in __call__
>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>> return f(*a, **kw)
>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>> line 328, in get_return_value
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o284.showString.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3):
>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link
>> failure
>>
>> However, the same is working fine with pandas df. I have tried this below
>> and it worked.
>>
>>
>> with SSHTunnelForwarder(
>> (ssh_host, ssh_port),
>> ssh_username=ssh_user,
>> ssh_pkey=ssh_key_file,
>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>> passwd=sql_password, db=sql_main_database,
>> port=tunnel.local_bind_port)
>> df = pd.read_sql_query(b1_semester_sql, conn)
>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>
>> So wanted to check what I am missing with my Spark usage. Please help.
>>
>> *Thanks,*
>> *Venkat*
>>
>>
>>
>


Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Nicholas Chammas
PyMySQL has its own implementation 
<https://github.com/PyMySQL/PyMySQL/blob/f13f054abcc18b39855a760a84be0a517f0da658/pymysql/protocol.py>
 of the MySQL client-server protocol. It does not use JDBC.


> On Dec 6, 2023, at 10:43 PM, Venkatesan Muniappan 
>  wrote:
> 
> Thanks for the advice Nicholas. 
> 
> As mentioned in the original email, I have tried JDBC + SSH Tunnel using 
> pymysql and sshtunnel and it worked fine. The problem happens only with Spark.
> 
> Thanks,
> Venkat
> 
> 
> 
> On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas  <mailto:nicholas.cham...@gmail.com>> wrote:
>> This is not a question for the dev list. Moving dev to bcc.
>> 
>> One thing I would try is to connect to this database using JDBC + SSH 
>> tunnel, but without Spark. That way you can focus on getting the JDBC 
>> connection to work without Spark complicating the picture for you.
>> 
>> 
>>> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan 
>>> mailto:venkatesa...@noonacademy.com>> wrote:
>>> 
>>> Hi Team,
>>> 
>>> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is 
>>> same as the one in this Stackoverflow question 
>>> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
>>>  but there are no answers there.
>>> 
>>> This is what I am trying:
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port),
>>> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
>>> tunnel.local_bind_port
>>> b1_semester_df = spark.read \
>>> .format("jdbc") \
>>> .option("url", b2b_mysql_url.replace("<>", 
>>> str(tunnel.local_bind_port))) \
>>> .option("query", b1_semester_sql) \
>>> .option("database", 'b2b') \
>>> .option("password", b2b_mysql_password) \
>>> .option("driver", "com.mysql.cj.jdbc.Driver") \
>>> .load()
>>> b1_semester_df.count()
>>> 
>>> Here, the b1_semester_df is loaded but when I try count on the same Df it 
>>> fails saying this
>>> 
>>> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
>>> aborting job
>>> Traceback (most recent call last):
>>>   File "", line 1, in 
>>>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
>>> print(self._jdf.showString(n, 20, vertical))
>>>   File 
>>> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
>>> 1257, in __call__
>>>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
>>> return f(*a, **kw)
>>>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
>>> line 328, in get_return_value
>>> py4j.protocol.Py4JJavaError: An error occurred while calling 
>>> o284.showString.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
>>> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3): 
>>> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link 
>>> failure
>>> 
>>> However, the same is working fine with pandas df. I have tried this below 
>>> and it worked.
>>> 
>>> 
>>> with SSHTunnelForwarder(
>>> (ssh_host, ssh_port),
>>> ssh_username=ssh_user,
>>> ssh_pkey=ssh_key_file,
>>> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
>>> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
>>>passwd=sql_password, db=sql_main_database,
>>>port=tunnel.local_bind_port)
>>> df = pd.read_sql_query(b1_semester_sql, conn)
>>> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>>> 
>>> So wanted to check what I am missing with my Spark usage. Please help.
>>> 
>>> Thanks,
>>> Venkat
>>> 
>> 



Re: SSH Tunneling issue with Apache Spark

2023-12-06 Thread Venkatesan Muniappan
Thanks for the advice Nicholas.

As mentioned in the original email, I have tried JDBC + SSH Tunnel using
pymysql and sshtunnel and it worked fine. The problem happens only with
Spark.

*Thanks,*
*Venkat*



On Wed, Dec 6, 2023 at 10:21 PM Nicholas Chammas 
wrote:

> This is not a question for the dev list. Moving dev to bcc.
>
> One thing I would try is to connect to this database using JDBC + SSH
> tunnel, but without Spark. That way you can focus on getting the JDBC
> connection to work without Spark complicating the picture for you.
>
>
> On Dec 5, 2023, at 8:12 PM, Venkatesan Muniappan <
> venkatesa...@noonacademy.com> wrote:
>
> Hi Team,
>
> I am facing an issue with SSH Tunneling in Apache Spark. The behavior is
> same as the one in this Stackoverflow question
> <https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
> but there are no answers there.
>
> This is what I am trying:
>
>
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port),
> local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
> tunnel.local_bind_port
> b1_semester_df = spark.read \
> .format("jdbc") \
> .option("url", b2b_mysql_url.replace("<>", 
> str(tunnel.local_bind_port)))
> \
> .option("query", b1_semester_sql) \
> .option("database", 'b2b') \
> .option("password", b2b_mysql_password) \
> .option("driver", "com.mysql.cj.jdbc.Driver") \
> .load()
> b1_semester_df.count()
>
> Here, the b1_semester_df is loaded but when I try count on the same Df it
> fails saying this
>
> 23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4
> times; aborting job
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show
> print(self._jdf.showString(n, 20, vertical))
>   File
> "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
> 1257, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o284.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3):
> com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link
> failure
>
> However, the same is working fine with pandas df. I have tried this below
> and it worked.
>
>
> with SSHTunnelForwarder(
> (ssh_host, ssh_port),
> ssh_username=ssh_user,
> ssh_pkey=ssh_key_file,
> remote_bind_address=(sql_hostname, sql_port)) as tunnel:
> conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
> passwd=sql_password, db=sql_main_database,
> port=tunnel.local_bind_port)
> df = pd.read_sql_query(b1_semester_sql, conn)
> spark.createDataFrame(df).createOrReplaceTempView("b1_semester")
>
> So wanted to check what I am missing with my Spark usage. Please help.
>
> *Thanks,*
> *Venkat*
>
>
>


SSH Tunneling issue with Apache Spark

2023-12-05 Thread Venkatesan Muniappan
Hi Team,

I am facing an issue with SSH Tunneling in Apache Spark. The behavior is
same as the one in this Stackoverflow question
<https://stackoverflow.com/questions/68278369/how-to-use-pyspark-to-read-a-mysql-database-using-a-ssh-tunnel>
but there are no answers there.

This is what I am trying:


with SSHTunnelForwarder(
(ssh_host, ssh_port),
ssh_username=ssh_user,
ssh_pkey=ssh_key_file,
remote_bind_address=(sql_hostname, sql_port),
local_bind_address=(local_host_ip_address, sql_port)) as tunnel:
tunnel.local_bind_port
b1_semester_df = spark.read \
.format("jdbc") \
.option("url", b2b_mysql_url.replace("<>",
str(tunnel.local_bind_port)))
\
.option("query", b1_semester_sql) \
.option("database", 'b2b') \
.option("password", b2b_mysql_password) \
.option("driver", "com.mysql.cj.jdbc.Driver") \
.load()
b1_semester_df.count()

Here, the b1_semester_df is loaded but when I try count on the same Df it
fails saying this

23/12/05 11:49:17 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times;
aborting job

Traceback (most recent call last):

  File "", line 1, in 

  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 382, in show

print(self._jdf.showString(n, 20, vertical))

  File
"/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line
1257, in __call__

  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco

return f(*a, **kw)

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling
o284.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage
2.0 (TID 11, ip-172-32-108-1.eu-central-1.compute.internal, executor 3):
com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link
failure

However, the same is working fine with pandas df. I have tried this below
and it worked.


with SSHTunnelForwarder(
(ssh_host, ssh_port),
ssh_username=ssh_user,
ssh_pkey=ssh_key_file,
remote_bind_address=(sql_hostname, sql_port)) as tunnel:
conn = pymysql.connect(host=local_host_ip_address, user=sql_username,
passwd=sql_password, db=sql_main_database,
port=tunnel.local_bind_port)
df = pd.read_sql_query(b1_semester_sql, conn)
spark.createDataFrame(df).createOrReplaceTempView("b1_semester")

So wanted to check what I am missing with my Spark usage. Please help.

*Thanks,*
*Venkat*


Re:[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread beliefer
Congratulations!







At 2023-12-01 01:23:55, "Dongjoon Hyun"  wrote:

We are happy to announce the availability of Apache Spark 3.4.2!

Spark 3.4.2 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


[ANNOUNCE] Apache Spark 3.4.2 released

2023-11-30 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.2!

Spark 3.4.2 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.2, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-2.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-11-07 Thread Suyash Ajmera
Any update on this?


On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, 
wrote:

> This issue is related to CharVarcharCodegenUtils readSidePadding method .
>
> Appending white spaces while reading ENUM data from mysql
>
> Causing issue in querying , writing the same data to Cassandra.
>
> On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
> wrote:
>
>> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
>> querying to Mysql Database and applying
>>
>> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
>> as expected in spark 3.3.1 , but not working with 3.5.0.
>>
>> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
>> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>>
>> The *st *column is ENUM in the database and it is causing the issue.
>>
>> Below is the Physical Plan of *FILTER* phase :
>>
>> For 3.3.1 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
>> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>>
>> For 3.5.0 :
>>
>> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
>> (upper(staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>>
>> -
>>
>> I have debug it and found that Spark added a property in version 3.4.0 ,
>> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>>
>> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>>
>> Added a new method in Class **CharVarcharCodegenUtils**
>>
>> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
>> int numChars = inputStr.numChars();
>> if (numChars == limit) {
>>   return inputStr;
>> } else if (numChars < limit) {
>>   return inputStr.rpad(limit, SPACE);
>> } else {
>>   return inputStr;
>> }
>>   }
>>
>>
>> **This method is appending some whitespace padding to the ENUM values
>> while reading and causing the Issue.**
>>
>> ---
>>
>> When I am removing the UPPER function from the where condition the
>> **FILTER** Phase looks like this :
>>
>>  +- Filter (((staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
>> ) OR (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
>> (staticinvoke(class
>> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
>> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>>
>>
>> **You can see it has added some white space after the value and the query
>> runs fine giving the correct result.**
>>
>> But with the UPPER function I am not getting the data.
>>
>> --
>>
>> I have also tried to disable this Property *spark.sql.readSideCharPadding
>> = false* with following cases :
>>
>> 1. With Upper function in where clause :
>>It is not pushing the filters to Database and the *query works fine*.
>>
>>
>>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
>> (upper(st#42) = CLOSED))
>>
>> 2. But when I am removing the upper function
>>
>>  *It is pushing the filter to Mysql with the white spaces and I am not
>> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>>
>>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
>> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
>> )),EqualTo(st,CLOSED   ))]
>>
>> I cannot move this filter to JDBC read query , also I can't remove this
>> UPPER function in the where clause.
>>
>>
>> 
>>
>> Also I found same data getting written to CASSANDRA with *PADDING .*
>>
>


Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-13 Thread Suyash Ajmera
This issue is related to CharVarcharCodegenUtils readSidePadding method .

Appending white spaces while reading ENUM data from mysql

Causing issue in querying , writing the same data to Cassandra.

On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, 
wrote:

> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am
> querying to Mysql Database and applying
>
> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working
> as expected in spark 3.3.1 , but not working with 3.5.0.
>
> Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`
>
> The *st *column is ENUM in the database and it is causing the issue.
>
> Below is the Physical Plan of *FILTER* phase :
>
> For 3.3.1 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))
>
> For 3.5.0 :
>
> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
> (upper(staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true)) = CLOSED)))
>
> -
>
> I have debug it and found that Spark added a property in version 3.4.0 ,
> i.e. **spark.sql.readSideCharPadding** which has default value **true**.
>
> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697
>
> Added a new method in Class **CharVarcharCodegenUtils**
>
> public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
> int numChars = inputStr.numChars();
> if (numChars == limit) {
>   return inputStr;
> } else if (numChars < limit) {
>   return inputStr.rpad(limit, SPACE);
> } else {
>   return inputStr;
> }
>   }
>
>
> **This method is appending some whitespace padding to the ENUM values
> while reading and causing the Issue.**
>
> ---
>
> When I am removing the UPPER function from the where condition the
> **FILTER** Phase looks like this :
>
>  +- Filter (((staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
>  StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
> ) OR (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
> (staticinvoke(class
> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
> readSidePadding, st#42, 13, true, false, true) = CLOSED   ))
>
>
> **You can see it has added some white space after the value and the query
> runs fine giving the correct result.**
>
> But with the UPPER function I am not getting the data.
>
> --
>
> I have also tried to disable this Property *spark.sql.readSideCharPadding
> = false* with following cases :
>
> 1. With Upper function in where clause :
>It is not pushing the filters to Database and the *query works fine*.
>
>   +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
> (upper(st#42) = CLOSED))
>
> 2. But when I am removing the upper function
>
>  *It is pushing the filter to Mysql with the white spaces and I am not
> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*
>
>   PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
> )),EqualTo(st,CLOSED   ))]
>
> I cannot move this filter to JDBC read query , also I can't remove this
> UPPER function in the where clause.
>
>
> 
>
> Also I found same data getting written to CASSANDRA with *PADDING .*
>


[ SPARK SQL ]: PPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column

2023-10-12 Thread Suyash Ajmera
I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am querying
to Mysql Database and applying

`*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working as
expected in spark 3.3.1 , but not working with 3.5.0.

Where Condition ::  `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR
upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*`

The *st *column is ENUM in the database and it is causing the issue.

Below is the Physical Plan of *FILTER* phase :

For 3.3.1 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR
(upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED)))

For 3.5.0 :

+- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = OPEN) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR
(upper(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true)) = CLOSED)))

-

I have debug it and found that Spark added a property in version 3.4.0 ,
i.e. **spark.sql.readSideCharPadding** which has default value **true**.

Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697

Added a new method in Class **CharVarcharCodegenUtils**

public static UTF8String readSidePadding(UTF8String inputStr, int limit) {
int numChars = inputStr.numChars();
if (numChars == limit) {
  return inputStr;
} else if (numChars < limit) {
  return inputStr.rpad(limit, SPACE);
} else {
  return inputStr;
}
  }


**This method is appending some whitespace padding to the ENUM values while
reading and causing the Issue.**

---

When I am removing the UPPER function from the where condition the
**FILTER** Phase looks like this :

 +- Filter (((staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils,
 StringType, readSidePadding, st#42, 13, true, false, true) = OPEN
) OR (staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = REOPEN   )) OR
(staticinvoke(class
org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType,
readSidePadding, st#42, 13, true, false, true) = CLOSED   ))


**You can see it has added some white space after the value and the query
runs fine giving the correct result.**

But with the UPPER function I am not getting the data.

--

I have also tried to disable this Property *spark.sql.readSideCharPadding =
false* with following cases :

1. With Upper function in where clause :
   It is not pushing the filters to Database and the *query works fine*.

  +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR
(upper(st#42) = CLOSED))

2. But when I am removing the upper function

 *It is pushing the filter to Mysql with the white spaces and I am not
getting the data. (THIS IS A CAUSING VERY BIG ISSUE)*

  PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON),
*Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN
)),EqualTo(st,CLOSED   ))]

I cannot move this filter to JDBC read query , also I can't remove this
UPPER function in the where clause.



Also I found same data getting written to CASSANDRA with *PADDING .*


APACHE Spark adoption/growth chart

2023-09-12 Thread Andrew Petersen
Hello Spark community

Can anyone direct me to a simple graph/chart that shows APACHE Spark
adoption, preferably one that includes recent years? Of less importance, a
similar Databricks plot?

An internet search gave me plots only up to 2015. I also searched
spark.apache.org and databricks.com, but found no plots.

Regards
-- 
Andrew Petersen, PhD
Advanced Computing, Office of Information Technology
2620 Hillsborough Street
datascience.oit.ncsu.edu


Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-07 Thread Mich Talebzadeh
Hi Varun,

With all that said, I forgot one worthy sentence.

"It doesn't really matter what background you come from or your wealth,
everything is possible. Use every negative source in your life as a
positive and you will never ever fail!"

Cheers

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 18:33, Mich Talebzadeh 
wrote:

> Hi Varun,
>
> In answer to your questions, these are my views. However, they are just
> views and cannot be taken as facts so to speak
>
>
>1.
>
>*Focus and Time Management:* I often struggle with maintaining focus
>and effectively managing my time. This leads to productivity issues and
>affects my ability to undertake and complete projects efficiently.
>
>
>- Set clear goals.
>   - Prioritize tasks.
>   - Create a to-do list.
>   - Avoid multitasking.
>   - Eliminate distractions.
>   - Take regular breaks.
>   - Go to the gym and try to rest your mind and refresh yourself .
>
>
>1.
>
>*Graduate Studies Dilemma:*
>
>
>- Your mileage varies and it all depends on what you are trying to
>   achieve. Graduate Studies will help you to think independently and out 
> of
>   the box. Will also lead you on "how to go about solving the problem". 
> So it
>   will give you that experience.
>
>
>1.
>
>*Long-Term Project Building:* I am interested in working on long-term
>projects, but I am uncertain about the right approach and how to stay
>committed throughout the project's lifecycle.
>
>
>- I assume you have a degree. That means that you had the discipline
>   to wake up in the morning, go to lectures and not to miss the lectures
>   (hopefully you did not!). In other words, it proves that you have 
> already
>   been through a structured discipline and you have the will to do it.
>
>
>1.
>
>*Overcoming Fear of Failure and Procrastination:* I often find myself
>in a constant fear mode of failure, which leads to abandoning pet projects
>shortly after starting them or procrastinating over initiating new ones.
>
>
>- Failure is natural and can and do happen. However, the important
>   point is that you learn from your failures. Just call them experience. 
> You
>   need to overcome fear of failure and embrace the challenges.
>
>
>1.
>
>*Risk Aversion:* With no inherited wealth or financial security, I am
>often apprehensive about taking risks, even when they may potentially lead
>to significant personal or professional growth.
>- Welcome to the club! In 2020
>   
> <https://equalitytrust.org.uk/scale-economic-inequality-uk#:~:text=In%202020%2C%20the%20ONS%20calculated,and%202013%2C%20reaching%209%25.>,
>   it was estimated that in the UK, the richest 10% of households hold 43% 
> of
>   all wealth. The poorest 50% by contrast own just 9%  Risk is part of 
> life.
>   When crossing the street, you are taking a calculated view of the cars
>   coming and going.In short, risk assessment is a fundamental aspect of 
> life!
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 5 Sept 2023 at 22:17, Varun Shah 
> wrote:
>
>> Dear Apache Spark Community,
>>
>> I hope this email finds you well. I am writing to seek your valuable
>> insights and advice on some challenges I've been facing in my career and
>> personal development journey, particularly in the context of Apache Spark
>> and the broader big data eco

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread ashok34...@yahoo.com.INVALID
 Hello Mich,
Thanking you for providing these useful feedbacks and responses.
We appreciate your contribution to this community forum. I for myself find your 
posts insightful.
+1 for me
Best,
AK
On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh 
 wrote:  
 
 Hi Varun,
In answer to your questions, these are my views. However, they are just views 
and cannot be taken as facts so to speak
   
   -
Focus and Time Management: I often struggle with maintaining focus and 
effectively managing my time. This leads to productivity issues and affects my 
ability to undertake and complete projects efficiently.

   
   
   - Set clear goals.
   - Prioritize tasks.
   - Create a to-do list.
   - Avoid multitasking.   

   - Eliminate distractions.
   - Take regular breaks.   

   - Go to the gym and try to rest your mind and refresh yourself .
   
   -
Graduate Studies Dilemma:

   
   
   - Your mileage varies and it all depends on what you are trying to achieve. 
Graduate Studies will help you to think independently and out of the box. Will 
also lead you on "how to go about solving the problem". So it will give you 
that experience. 
   
   -
Long-Term Project Building: I am interested in working on long-term projects, 
but I am uncertain about the right approach and how to stay committed 
throughout the project's lifecycle.

   
   
   - I assume you have a degree. That means that you had the discipline to wake 
up in the morning, go to lectures and not to miss the lectures (hopefully you 
did not!). In other words, it proves that you have already been through a 
structured discipline and you have the will to do it. 
   
   -
Overcoming Fear of Failure and Procrastination: I often find myself in a 
constant fear mode of failure, which leads to abandoning pet projects shortly 
after starting them or procrastinating over initiating new ones.

   
   
   - Failure is natural and can and do happen. However, the important point is 
that you learn from your failures. Just call them experience. You need to 
overcome fear of failure and embrace the challenges.
   
   -
Risk Aversion: With no inherited wealth or financial security, I am often 
apprehensive about taking risks, even when they may potentially lead to 
significant personal or professional growth.

   
   - Welcome to the club! In 2020, it was estimated that in the UK, the richest 
10% of households hold 43% of all wealth. The poorest 50% by contrast own just 
9%  Risk is part of life. When crossing the street, you are taking a calculated 
view of the cars coming and going.In short, risk assessment is a fundamental 
aspect of life!
HTH
Mich Talebzadeh,Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom



   view my Linkedin profile




 https://en.everybodywiki.com/Mich_Talebzadeh

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Tue, 5 Sept 2023 at 22:17, Varun Shah  wrote:


Dear Apache Spark Community,

I hope this email finds you well. I am writing to seek your valuable insights 
and advice on some challenges I've been facing in my career and personal 
development journey, particularly in the context of Apache Spark and the 
broader big data ecosystem.

A little background about myself: I graduated in 2019 and have since been 
working in the field of AWS cloud and big data tools such as Spark, Airflow, 
AWS services, Databricks, and Snowflake. My interest in the world of big data 
tools dates back to 2016-17, where I initially began exploring concepts like 
big data with spark using scala, and the Scala ecosystem, including 
technologies like Akka. Additionally, I have a keen interest in functional 
programming and data structures and algorithms (DSA) applied to big data 
optimizations.

However, despite my enthusiasm and passion for these areas, I am encountering 
some challenges that are hindering my growth:
   
   -
Focus and Time Management: I often struggle with maintaining focus and 
effectively managing my time. This leads to productivity issues and affects my 
ability to undertake and complete projects efficiently.

   -
Graduate Studies Dilemma: I am unsure about whether to pursue a master's 
degree. The fear of GRE and uncertainty about getting into a reputable 
university have been holding me back. I'm unsure whether further education 
would significantly benefit my career in big data.

   -
Long-Term Project Building: I am interested in working on long-term projects, 
but I am uncertain about the right approach and how to stay committed 
throughout the project's lifecycle.

   -
Overcoming Fear of Failure and Procrastination: I o

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread Mich Talebzadeh
Hi Varun,

In answer to your questions, these are my views. However, they are just
views and cannot be taken as facts so to speak


   1.

   *Focus and Time Management:* I often struggle with maintaining focus and
   effectively managing my time. This leads to productivity issues and affects
   my ability to undertake and complete projects efficiently.


   - Set clear goals.
  - Prioritize tasks.
  - Create a to-do list.
  - Avoid multitasking.
  - Eliminate distractions.
  - Take regular breaks.
  - Go to the gym and try to rest your mind and refresh yourself .


   1.

   *Graduate Studies Dilemma:*


   - Your mileage varies and it all depends on what you are trying to
  achieve. Graduate Studies will help you to think independently and out of
  the box. Will also lead you on "how to go about solving the
problem". So it
  will give you that experience.


   1.

   *Long-Term Project Building:* I am interested in working on long-term
   projects, but I am uncertain about the right approach and how to stay
   committed throughout the project's lifecycle.


   - I assume you have a degree. That means that you had the discipline to
  wake up in the morning, go to lectures and not to miss the lectures
  (hopefully you did not!). In other words, it proves that you have already
  been through a structured discipline and you have the will to do it.


   1.

   *Overcoming Fear of Failure and Procrastination:* I often find myself in
   a constant fear mode of failure, which leads to abandoning pet projects
   shortly after starting them or procrastinating over initiating new ones.


   - Failure is natural and can and do happen. However, the important point
  is that you learn from your failures. Just call them experience. You need
  to overcome fear of failure and embrace the challenges.


   1.

   *Risk Aversion:* With no inherited wealth or financial security, I am
   often apprehensive about taking risks, even when they may potentially lead
   to significant personal or professional growth.
   - Welcome to the club! In 2020
  
<https://equalitytrust.org.uk/scale-economic-inequality-uk#:~:text=In%202020%2C%20the%20ONS%20calculated,and%202013%2C%20reaching%209%25.>,
  it was estimated that in the UK, the richest 10% of households
hold 43% of
  all wealth. The poorest 50% by contrast own just 9%  Risk is
part of life.
  When crossing the street, you are taking a calculated view of the cars
  coming and going.In short, risk assessment is a fundamental
aspect of life!

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 5 Sept 2023 at 22:17, Varun Shah  wrote:

> Dear Apache Spark Community,
>
> I hope this email finds you well. I am writing to seek your valuable
> insights and advice on some challenges I've been facing in my career and
> personal development journey, particularly in the context of Apache Spark
> and the broader big data ecosystem.
>
> A little background about myself: I graduated in 2019 and have since been
> working in the field of AWS cloud and big data tools such as Spark,
> Airflow, AWS services, Databricks, and Snowflake. My interest in the world
> of big data tools dates back to 2016-17, where I initially began exploring
> concepts like big data with spark using scala, and the Scala ecosystem,
> including technologies like Akka. Additionally, I have a keen interest in
> functional programming and data structures and algorithms (DSA) applied to
> big data optimizations.
>
> However, despite my enthusiasm and passion for these areas, I am
> encountering some challenges that are hindering my growth:
>
>1.
>
>*Focus and Time Management:* I often struggle with maintaining focus
>and effectively managing my time. This leads to productivity issues and
>affects my ability to undertake and complete projects efficiently.
>2.
>
>*Graduate Studies Dilemma:* I am unsure about whether to pursue a
>master's degree. The fear of GRE and uncertainty about getting into a
>reputable university have been holding me back. I'm unsure whether further
>education would significantly benefit my career in big data.
>3.
>
>*Long-Term Project Building:* I am interested in working on long-term
>projects, but 

Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-05 Thread Varun Shah
Dear Apache Spark Community,

I hope this email finds you well. I am writing to seek your valuable
insights and advice on some challenges I've been facing in my career and
personal development journey, particularly in the context of Apache Spark
and the broader big data ecosystem.

A little background about myself: I graduated in 2019 and have since been
working in the field of AWS cloud and big data tools such as Spark,
Airflow, AWS services, Databricks, and Snowflake. My interest in the world
of big data tools dates back to 2016-17, where I initially began exploring
concepts like big data with spark using scala, and the Scala ecosystem,
including technologies like Akka. Additionally, I have a keen interest in
functional programming and data structures and algorithms (DSA) applied to
big data optimizations.

However, despite my enthusiasm and passion for these areas, I am
encountering some challenges that are hindering my growth:

   1.

   *Focus and Time Management:* I often struggle with maintaining focus and
   effectively managing my time. This leads to productivity issues and affects
   my ability to undertake and complete projects efficiently.
   2.

   *Graduate Studies Dilemma:* I am unsure about whether to pursue a
   master's degree. The fear of GRE and uncertainty about getting into a
   reputable university have been holding me back. I'm unsure whether further
   education would significantly benefit my career in big data.
   3.

   *Long-Term Project Building:* I am interested in working on long-term
   projects, but I am uncertain about the right approach and how to stay
   committed throughout the project's lifecycle.
   4.

   *Overcoming Fear of Failure and Procrastination:* I often find myself in
   a constant fear mode of failure, which leads to abandoning pet projects
   shortly after starting them or procrastinating over initiating new ones.
   5.

   *Risk Aversion:* With no inherited wealth or financial security, I am
   often apprehensive about taking risks, even when they may potentially lead
   to significant personal or professional growth.

Given my background and aspirations, I am reaching out to the Apache Spark
Community, hoping to receive advice, guidance, and mentorship from
experienced professionals who may have faced similar challenges or can
offer valuable insights. I believe that the collective wisdom and
experience within this community can provide me with valuable perspectives
to navigate these hurdles.

If any of you have experienced similar challenges or have insights to share
on any of the mentioned points, I would greatly appreciate your guidance.
Additionally, if you are aware of resources, courses, or opportunities that
could help me address these challenges, please do let me know.

Thank you in advance for considering my request. Your advice will play a
crucial role in shaping my career and personal development journey in the
world of big data and Apache Spark.

I am looking forward to hearing from you and learning from your experiences.

Sincerely,

Varun Shah


[ANNOUNCE] Apache Spark 3.3.3 released

2023-08-22 Thread Yuming Wang
We are happy to announce the availability of Apache Spark 3.3.3!

Spark 3.3.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.

To download Spark 3.3.3, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-3-3.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.


Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-10 Thread Mich Talebzadeh
Hi Mark,

I created a spark3.4.1 docker file. Details from
spark-py-3.4.1-scala_2.12-11-jre-slim-buster
<https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated>

Pull instructions are given

docker pull
michtalebzadeh/spark_dockerfiles:spark-py-3.4.1-scala_2.12-11-jre-slim-buster

It is 3.4.1 spark-py with no extra python packages

you can tag it as you wish

login to it as below

docker run -it
michtalebzadeh/spark_dockerfiles:spark-py-3.4.1-scala_2.12-11-jre-slim-buster
bash

185@b031a15c6730:/opt/spark/work-dir$ pip list

Package   Version
- ---
asn1crypto0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring   17.1.1
keyrings.alt  3.1.1
pip   23.2.1
pycrypto  2.6.1
PyGObject 3.30.4
pyxdg 0.25
SecretStorage 2.3.1
setuptools68.0.0
six   1.12.0
wheel 0.32.3

$SPARK_HOME/bin/spark-submit --version
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.4.1
  /_/

Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2023-06-19T23:01:01Z
Revision 6b1ff22dde1ead51cbf370be6e48a802daae58b6
Url https://github.com/apache/spark

Built on java 11

185@b031a15c6730:/opt/spark/work-dir$ java --version
openjdk 11.0.11 2021-04-20
OpenJDK Runtime Environment 18.9 (build 11.0.11+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9, mixed mode, sharing)

HTH


Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 9 Aug 2023 at 17:41, Mich Talebzadeh 
wrote:

> Hi Mark,
>
> you can build it yourself, no big deal :)
>
> REPOSITORY TAG
> IMAGE ID   CREATED
>  SIZE
> sparkpy/spark-py
>  3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206   1
> second ago1.09GB
> sparkpy/spark
> 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile 6f74f7475e01   3
> minutes ago   695MB
>
> Based on
>
> ARG java_image_tag=11-jre-slim  ## java 11
> FROM openjdk:${java_image_tag}
>
> BASE_OS="buster"
> SPARK_VERSION="3.4.1"
> SCALA_VERSION="scala_2.12"
> DOCKERFILE="Dockerfile"
> DOCKERIMAGETAG="11-jre-slim"
>
> You need to modify the file
>
> $SPARK_HOME/kubernetes/dockerfiles/spark/Dockerfile
>
> and replace
>
> #ARG java_image_tag=17-jre
> #FROM eclipse-temurin:${java_image_tag}
>
> With
>
> ARG java_image_tag=11-jre-slim
> FROM openjdk:${java_image_tag}
>
> Which is Java 11
>
> HTH
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 9 Aug 2023 at 16:43, Mark Elliot 
> wrote:
>
>> Hello,
>>
>> I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
>> available (apache/spark@3.4.1 is available). Would it be possible to get
>> the 3.4.1 release build for the apache/spark-py image published?
>>
>> Thanks,
>>
>> Mark
>>
>> --
>>
>> This communication, together with any attachments, is intended only for
>> the addressee(s) and may contain confidential, privileged or proprietary
>> information of Theorem Partners LLC ("Theorem"). By accepting this
>> communication you agree to keep confidential all information contained in
>> this communication, as well as any information derived by you from the
>> confidential information contained in this communication. Theorem does not
>> waive any confidentiality by misdelivery.
>>
>> If you receive this communication in error, any use,

Re: dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mich Talebzadeh
Hi Mark,

you can build it yourself, no big deal :)

REPOSITORY TAG
  IMAGE ID   CREATED
 SIZE
sparkpy/spark-py
 3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206   1
second ago1.09GB
sparkpy/spark
3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile 6f74f7475e01   3
minutes ago   695MB

Based on

ARG java_image_tag=11-jre-slim  ## java 11
FROM openjdk:${java_image_tag}

BASE_OS="buster"
SPARK_VERSION="3.4.1"
SCALA_VERSION="scala_2.12"
DOCKERFILE="Dockerfile"
DOCKERIMAGETAG="11-jre-slim"

You need to modify the file

$SPARK_HOME/kubernetes/dockerfiles/spark/Dockerfile

and replace

#ARG java_image_tag=17-jre
#FROM eclipse-temurin:${java_image_tag}

With

ARG java_image_tag=11-jre-slim
FROM openjdk:${java_image_tag}

Which is Java 11

HTH


Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 9 Aug 2023 at 16:43, Mark Elliot  wrote:

> Hello,
>
> I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
> available (apache/spark@3.4.1 is available). Would it be possible to get
> the 3.4.1 release build for the apache/spark-py image published?
>
> Thanks,
>
> Mark
>
> --
>
> This communication, together with any attachments, is intended only for
> the addressee(s) and may contain confidential, privileged or proprietary
> information of Theorem Partners LLC ("Theorem"). By accepting this
> communication you agree to keep confidential all information contained in
> this communication, as well as any information derived by you from the
> confidential information contained in this communication. Theorem does not
> waive any confidentiality by misdelivery.
>
> If you receive this communication in error, any use, dissemination,
> printing or copying of all or any part of it is strictly prohibited; please
> destroy all electronic and paper copies and notify the sender immediately.
> Nothing in this email is intended to constitute (1) investment, legal or
> tax advice, (2) any recommendation to purchase or sell any security, (3)
> any advertisement or offer of advisory services or (4) any offer to sell or
> solicitation of an offer to buy any securities or other financial
> instrument in any jurisdiction.
>
> Theorem, including its agents or affiliates, reserves the right to
> intercept, archive, monitor and review all communications to and from its
> network, including this email and any email response to it.
>
> Theorem makes no representation as to the accuracy or completeness of the
> information in this communication and does not accept liability for any
> errors or omissions in this communication, including any liability
> resulting from its transmission by email, and undertakes no obligation to
> update any information in this email or its attachments.
>


dockerhub does not contain apache/spark-py 3.4.1

2023-08-09 Thread Mark Elliot
Hello,

I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
available (apache/spark@3.4.1 is available). Would it be possible to get
the 3.4.1 release build for the apache/spark-py image published?

Thanks,

Mark

-- 










This communication, together with any attachments, is intended 
only for the addressee(s) and may contain confidential, privileged or 
proprietary information of Theorem Partners LLC ("Theorem"). By accepting 
this communication you agree to keep confidential all information contained 
in this communication, as well as any information derived by you from the 
confidential information contained in this communication. Theorem does not 
waive any confidentiality by misdelivery.

If you receive this 
communication in error, any use, dissemination, printing or copying of all 
or any part of it is strictly prohibited; please destroy all electronic and 
paper copies and notify the sender immediately. Nothing in this email is 
intended to constitute (1) investment, legal or tax advice, (2) any 
recommendation to purchase or sell any security, (3) any advertisement or 
offer of advisory services or (4) any offer to sell or solicitation of an 
offer to buy any securities or other financial instrument in any 
jurisdiction.

Theorem, including its agents or affiliates, reserves the 
right to intercept, archive, monitor and review all communications to and 
from its network, including this email and any email response to it.

Theorem makes no representation as to the accuracy or completeness of the 
information in this communication and does not accept liability for any 
errors or omissions in this communication, including any liability 
resulting from its transmission by email, and undertakes no obligation to 
update any information in this email or its attachments.


Re: The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Mich Talebzadeh
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN
resource manager. YARN  is the most widely used resource manager not just
for Spark but for other artefacts as well. On-premise YARN is used
extensively. In Cloud it is also used widely in Infrastructure as a Service
such as Google Dataproc which I mentioned.

With regard to your questions:

Q1: What are the causes and reasons for Spark on K8s to be slower than
Serverful?
--> It should be noted that Spark on Kubernetes is work in progress and as
of now there is future work outstanding.  It is not in parity with Spark on
Yarn

Q2: How or is there a scenario to show the most apparent difference in
performance and cost of these two environments (Serverless (K8S) and
Serverful (Traditional server)?
--> Simple. One experiment is worth 10 hypothesis  Install spark on
serverful and install spark on k8s and run the same workload and observer
the performance through SPARK GUI for the same workload

See this article of mine to help you with some features. A bit dated but
still covers concepts

Spark on Kubernetes, A Practitioner’s Guide


HTH

Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 27 Jul 2023 at 18:20, Trường Trần Phan An 
wrote:

> Hi all,
>
> I am learning about the performance difference of Spark when performing a
> JOIN problem on Serverless (K8S) and Serverful (Traditional server)
> environments.
>
> Through experiment, Spark on K8s tends to run slower than Serverful.
> Through understanding the architecture, I know that Spark runs on K8s as
> Containers (Pods) so it takes a certain time to initialize, but when I look
> at each job, stage, and task, Spark K8s tends to be slower. Serverful.
>
> *I have some questions:*
> Q1: What are the causes and reasons for Spark on K8s to be slower than
> Serverful?
> Q2: How or is there a scenario to show the most apparent difference in
> performance and cost of these two environments (Serverless (K8S) and
> Serverful (Traditional server)?
>
> Thank you so much!
>
> Best regards,
> Truong
>
>
>


The performance difference when running Apache Spark on K8s and traditional server

2023-07-27 Thread Trường Trần Phan An
Hi all,

I am learning about the performance difference of Spark when performing a
JOIN problem on Serverless (K8S) and Serverful (Traditional server)
environments.

Through experiment, Spark on K8s tends to run slower than Serverful.
Through understanding the architecture, I know that Spark runs on K8s as
Containers (Pods) so it takes a certain time to initialize, but when I look
at each job, stage, and task, Spark K8s tends to be slower. Serverful.

*I have some questions:*
Q1: What are the causes and reasons for Spark on K8s to be slower than
Serverful?
Q2: How or is there a scenario to show the most apparent difference in
performance and cost of these two environments (Serverless (K8S) and
Serverful (Traditional server)?

Thank you so much!

Best regards,
Truong


Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing!

On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang  wrote:

> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> <https://github.com/databrickslabs/pyspark-ai/>. Powered by the
> application of Generative AI, the English SDK
> <https://github.com/databrickslabs/pyspark-ai/> allows you to execute
> complex tasks with simple English instructions. This exciting news was 
> announced
> recently at the Data+AI Summit
> <https://www.youtube.com/watch?v=yj7XlTB1Jvc&t=511s> and also introduced
> through a detailed blog post
> <https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark>
> .
>
> Now, we need your invaluable feedback and contributions. The aim of the
> English SDK is not only to simplify and enrich your Apache Spark experience
> but also to grow with the community. We're calling upon Spark developers
> and users to explore this innovative tool, offer your insights, provide
> feedback, and contribute to its evolution.
>
> You can find more details about the SDK and usage examples on the GitHub
> repository https://github.com/databrickslabs/pyspark-ai/. If you have any
> feedback or suggestions, please feel free to open an issue directly on the
> repository. We are actively monitoring the issues and value your insights.
>
> We also welcome pull requests and are eager to see how you might extend or
> refine this tool. Let's come together to continue making Apache Spark more
> approachable and user-friendly.
>
> Thank you in advance for your attention and involvement. We look forward
> to hearing your thoughts and seeing your contributions!
>
> Best,
> Gengliang Wang
>


Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing.

On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri 
wrote:

> This is wonderful news!
>
> On Tue, 4 Jul 2023 at 01:14, Gengliang Wang  wrote:
>
>> Dear Apache Spark community,
>>
>> We are delighted to announce the launch of a groundbreaking tool that
>> aims to make Apache Spark more user-friendly and accessible - the
>> English SDK <https://github.com/databrickslabs/pyspark-ai/>. Powered by
>> the application of Generative AI, the English SDK
>> <https://github.com/databrickslabs/pyspark-ai/> allows you to execute
>> complex tasks with simple English instructions. This exciting news was 
>> announced
>> recently at the Data+AI Summit
>> <https://www.youtube.com/watch?v=yj7XlTB1Jvc&t=511s> and also introduced
>> through a detailed blog post
>> <https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark>
>> .
>>
>> Now, we need your invaluable feedback and contributions. The aim of the
>> English SDK is not only to simplify and enrich your Apache Spark experience
>> but also to grow with the community. We're calling upon Spark developers
>> and users to explore this innovative tool, offer your insights, provide
>> feedback, and contribute to its evolution.
>>
>> You can find more details about the SDK and usage examples on the GitHub
>> repository https://github.com/databrickslabs/pyspark-ai/. If you have
>> any feedback or suggestions, please feel free to open an issue directly on
>> the repository. We are actively monitoring the issues and value your
>> insights.
>>
>> We also welcome pull requests and are eager to see how you might extend
>> or refine this tool. Let's come together to continue making Apache Spark
>> more approachable and user-friendly.
>>
>> Thank you in advance for your attention and involvement. We look forward
>> to hearing your thoughts and seeing your contributions!
>>
>> Best,
>> Gengliang Wang
>>
> --
>
>
> *Farshid Ashouri*,
> Senior Vice President,
> J.P. Morgan & Chase Co.
> +44 7932 650 788
>
>


Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Farshid Ashouri
This is wonderful news!

On Tue, 4 Jul 2023 at 01:14, Gengliang Wang  wrote:

> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> <https://github.com/databrickslabs/pyspark-ai/>. Powered by the
> application of Generative AI, the English SDK
> <https://github.com/databrickslabs/pyspark-ai/> allows you to execute
> complex tasks with simple English instructions. This exciting news was 
> announced
> recently at the Data+AI Summit
> <https://www.youtube.com/watch?v=yj7XlTB1Jvc&t=511s> and also introduced
> through a detailed blog post
> <https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark>
> .
>
> Now, we need your invaluable feedback and contributions. The aim of the
> English SDK is not only to simplify and enrich your Apache Spark experience
> but also to grow with the community. We're calling upon Spark developers
> and users to explore this innovative tool, offer your insights, provide
> feedback, and contribute to its evolution.
>
> You can find more details about the SDK and usage examples on the GitHub
> repository https://github.com/databrickslabs/pyspark-ai/. If you have any
> feedback or suggestions, please feel free to open an issue directly on the
> repository. We are actively monitoring the issues and value your insights.
>
> We also welcome pull requests and are eager to see how you might extend or
> refine this tool. Let's come together to continue making Apache Spark more
> approachable and user-friendly.
>
> Thank you in advance for your attention and involvement. We look forward
> to hearing your thoughts and seeing your contributions!
>
> Best,
> Gengliang Wang
>
-- 


*Farshid Ashouri*,
Senior Vice President,
J.P. Morgan & Chase Co.
+44 7932 650 788


Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gengliang Wang
Dear Apache Spark community,

We are delighted to announce the launch of a groundbreaking tool that aims
to make Apache Spark more user-friendly and accessible - the English SDK
<https://github.com/databrickslabs/pyspark-ai/>. Powered by the application
of Generative AI, the English SDK
<https://github.com/databrickslabs/pyspark-ai/> allows you to execute
complex tasks with simple English instructions. This exciting news was
announced
recently at the Data+AI Summit
<https://www.youtube.com/watch?v=yj7XlTB1Jvc&t=511s> and also introduced
through a detailed blog post
<https://www.databricks.com/blog/introducing-english-new-programming-language-apache-spark>
.

Now, we need your invaluable feedback and contributions. The aim of the
English SDK is not only to simplify and enrich your Apache Spark experience
but also to grow with the community. We're calling upon Spark developers
and users to explore this innovative tool, offer your insights, provide
feedback, and contribute to its evolution.

You can find more details about the SDK and usage examples on the GitHub
repository https://github.com/databrickslabs/pyspark-ai/. If you have any
feedback or suggestions, please feel free to open an issue directly on the
repository. We are actively monitoring the issues and value your insights.

We also welcome pull requests and are eager to see how you might extend or
refine this tool. Let's come together to continue making Apache Spark more
approachable and user-friendly.

Thank you in advance for your attention and involvement. We look forward to
hearing your thoughts and seeing your contributions!

Best,
Gengliang Wang


Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread yangjie01
Thanks Dongjoon ~

在 2023/6/24 10:29,“L. C. Hsieh”mailto:vii...@gmail.com>> 写入:


Thanks Dongjoon!


On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon mailto:gurwls...@apache.org>> wrote:
>
> Thanks!
>
> On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan  <mailto:mri...@gmail.com>> wrote:
>>
>>
>> Thanks Dongjoon !
>>
>> Regards,
>> Mridul
>>
>> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun > <mailto:dongj...@apache.org>> wrote:
>>>
>>> We are happy to announce the availability of Apache Spark 3.4.1!
>>>
>>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.4 maintenance branch of Spark. We strongly
>>> recommend all 3.4 users to upgrade to this stable release.
>>>
>>> To download Spark 3.4.1, head over to the download page:
>>> https://spark.apache.org/downloads.html 
>>> <https://spark.apache.org/downloads.html>
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-4-1.html 
>>> <https://spark.apache.org/releases/spark-release-3-4-1.html>
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>>
>>> Dongjoon Hyun


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re:[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-24 Thread beliefer
Thanks! Dongjoon Hyun.
Congratulation too!







At 2023-06-24 07:57:05, "Dongjoon Hyun"  wrote:

We are happy to announce the availability of Apache Spark 3.4.1!

Spark 3.4.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Apache Spark with watermark - processing data different LogTypes in same kafka topic

2023-06-24 Thread karan alang
Hello All -

I'm using Apache Spark Structured Streaming to read data from Kafka topic,
and do some processing. I'm using watermark to account for late-coming
records and the code works fine.

Here is the working(sample) code:
```

from pyspark.sql import SparkSessionfrom pyspark.sql.functions import
from_json, col, to_timestamp, window, max,exprfrom pyspark.sql.types
import StructType, StructField, StringType, DoubleType,IntegerType

spark = SparkSession \
.builder \
.master("local[3]") \
.appName("Sliding Window Demo") \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.sql.shuffle.partitions", 1) \
.getOrCreate()


stock_schema = StructType([
StructField("LogType", StringType()),
StructField("CreatedTime", StringType()),
StructField("Type", StringType()),
StructField("Amount", IntegerType()),
StructField("BrokerCode", StringType())
])

kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "trades") \
.option("startingOffsets", "earliest") \
.load()

value_df = kafka_df.select(from_json(col("value").cast("string"),
stock_schema).alias("value"))

trade_df = value_df.select("value.*") \
.withColumn("CreatedTime", to_timestamp(col("CreatedTime"),
"-MM-dd HH:mm:ss")) \
.withColumn("Buy", expr("case when Type == 'BUY' then Amount
else 0 end")) \
.withColumn("Sell", expr("case when Type == 'SELL' then Amount
else 0 end"))


window_agg_df = trade_df \
.withWatermark("CreatedTime", "10 minute") \
.groupBy(window(col("CreatedTime"), "10 minute")) \
.agg({"Buy":"sum",
"Sell":"sum"}).withColumnRenamed("sum(Buy)",
"TotalBuy").withColumnRenamed("sum(Sell)", "TotalSell")

output_df = window_agg_df.select("window.start", "window.end",
"TotalBuy", "TotalSell")

window_query = output_df.writeStream \
.format("console") \
.outputMode("append") \
.option("checkpointLocation", "chk-point-dir-mar28") \
.trigger(processingTime="30 second") \
.start()

window_query.awaitTermination()


```

Currently, I'm processing a single LogType, the requirement is to process
multiple LogTypes in the same flow .. LogTypes will be config driven (not
hard-coded). Objective is to have generic code that can process all
logTypes.

As an example, for LogType X, I will need to get groupby columns col1, col2
and get the sum of values 'sent' & 'received'. for LogType Y, the groupBy
columns will remain the same but the sum will be on column col3 instead.

w/o the watermark, I can look at the LogType and do the processing in batch
mode (using foreachBatch). However, with watermark - i'm unable to figure
out how to process based on LogType.

Any inputs on this ?

Here is the stackoverflow for this

https://stackoverflow.com/questions/76547349/apache-spark-with-watermark-processing-data-different-logtypes-in-same-kafka-t

tia!


Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread L. C. Hsieh
Thanks Dongjoon!

On Fri, Jun 23, 2023 at 7:10 PM Hyukjin Kwon  wrote:
>
> Thanks!
>
> On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan  wrote:
>>
>>
>> Thanks Dongjoon !
>>
>> Regards,
>> Mridul
>>
>> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>>>
>>> We are happy to announce the availability of Apache Spark 3.4.1!
>>>
>>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.4 maintenance branch of Spark. We strongly
>>> recommend all 3.4 users to upgrade to this stable release.
>>>
>>> To download Spark 3.4.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>>
>>> Dongjoon Hyun

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks!

On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan 
wrote:

>
> Thanks Dongjoon !
>
> Regards,
> Mridul
>
> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>
>> We are happy to announce the availability of Apache Spark 3.4.1!
>>
>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.4 maintenance branch of Spark. We
>> strongly
>> recommend all 3.4 users to upgrade to this stable release.
>>
>> To download Spark 3.4.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>>
>> Dongjoon Hyun
>>
>


Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Mridul Muralidharan
Thanks Dongjoon !

Regards,
Mridul

On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:

> We are happy to announce the availability of Apache Spark 3.4.1!
>
> Spark 3.4.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.4 maintenance branch of Spark. We strongly
> recommend all 3.4 users to upgrade to this stable release.
>
> To download Spark 3.4.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-4-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>
> Dongjoon Hyun
>


[ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.1!

Spark 3.4.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.

To download Spark 3.4.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-4-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Enrico Minack
Sean is right, casting timestamps to strings (which is what show() does) 
uses the local timezone, either the Java default zone `user.timezone`, 
the Spark default zone `spark.sql.session.timeZone` or the default 
DataFrameWriter zone `timeZone`(when writing to file).


You say you are in PST, which is UTC - 8 hours. But I think this 
currently observes daylight saving, so PDT, which is UTC - 7 hours.


Then, your UTC timestamp is correctly displayed in local PDT time. Try 
the change above settings to display in different timezones. Inspecting 
the underlying long value as suggested by Sean is best practice to get 
hold of the true timestamp.


Cheers,
Enrico


Am 09.06.23 um 00:53 schrieb Sean Owen:
You sure it is not just that it's displaying in your local TZ? Check 
the actual value as a long for example. That is likely the same time.


On Thu, Jun 8, 2023, 5:50 PM karan alang  wrote:

ref :

https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly

Hello All,
I've data stored in MongoDB collection and the timestamp column is
not being read by Apache Spark correctly. I'm running Apache Spark
on GCP Dataproc.

Here is sample data :

-

|In Mongo : timeslot_date : timeslot |timeslot_date |
+--+--
1683527400|{2023-05-08T06:30:00Z}| When I use pyspark to read this
: +--+---+ timeslot |timeslot_date |
+--+---+ 1683527400|2023-05-07 23:30:00|
++---+-|

|-|

|

My understanding is, data in Mongo is in UTC format i.e.
2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm
not clear why spark is reading it a different timezone format
(neither PST nor UTC) Note - it is not reading it as PST timezone,
if it was doing that it would advance the time by 7 hours, instead
it is doing the opposite.

Where is the default timezone format taken from, when Spark is
reading data from MongoDB ?

Any ideas on this ?

tia!

|




Re: Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread Sean Owen
You sure it is not just that it's displaying in your local TZ? Check the
actual value as a long for example. That is likely the same time.

On Thu, Jun 8, 2023, 5:50 PM karan alang  wrote:

> ref :
> https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
>
> Hello All,
> I've data stored in MongoDB collection and the timestamp column is not
> being read by Apache Spark correctly. I'm running Apache Spark on GCP
> Dataproc.
>
> Here is sample data :
>
> -
>
> In Mongo :
>
> timeslot_date  :
> timeslot  |timeslot_date |
> +--+--1683527400|{2023-05-08T06:30:00Z}|
>
>
> When I use pyspark to read this  :
>
> +--+---+
> timeslot  |timeslot_date  |
> +--+---+1683527400|2023-05-07 23:30:00|
> ++---+-
>
> -
>
> My understanding is, data in Mongo is in UTC format i.e. 2023-05-08T06:30:00Z 
> is in UTC format. I'm in PST timezone. I'm not clear why spark is reading it 
> a different timezone format (neither PST nor UTC) Note - it is not reading it 
> as PST timezone, if it was doing that it would advance the time by 7 hours, 
> instead it is doing the opposite.
>
> Where is the default timezone format taken from, when Spark is reading data 
> from MongoDB ?
>
> Any ideas on this ?
>
> tia!
>
>
>
>
>


Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread karan alang
ref :
https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly

Hello All,
I've data stored in MongoDB collection and the timestamp column is not
being read by Apache Spark correctly. I'm running Apache Spark on GCP
Dataproc.

Here is sample data :

-

In Mongo :

timeslot_date  :
timeslot  |timeslot_date |
+--+--1683527400|{2023-05-08T06:30:00Z}|


When I use pyspark to read this  :

+--+---+
timeslot  |timeslot_date  |
+--+---+1683527400|2023-05-07 23:30:00|
++---+-

-

My understanding is, data in Mongo is in UTC format i.e.
2023-05-08T06:30:00Z is in UTC format. I'm in PST timezone. I'm not
clear why spark is reading it a different timezone format (neither PST
nor UTC) Note - it is not reading it as PST timezone, if it was doing
that it would advance the time by 7 hours, instead it is doing the
opposite.

Where is the default timezone format taken from, when Spark is reading
data from MongoDB ?

Any ideas on this ?

tia!


Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-05-02 Thread Trường Trần Phan An
Hi all,

I have written a program and overridden two events onStageCompleted and
onTaskEnd. However, these two events do not provide information on when a
Task/Stage is completed.

What I want to know is which Task corresponds to which stage of a DAG (the
Spark history server only tells me how many stages a Job has and how many
Jobs a Stage has).

Can I print out the edges of the Tasks according to the DAGScheduler?
Below is the program I have written:

import org.apache.spark.rdd.RDD
import org.apache.spark.TaskContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext, TaskEndReason}
import org.apache.spark.scheduler.{SparkListener,
SparkListenerEnvironmentUpdate, SparkListenerStageCompleted,
SparkListenerTaskEnd}
import scala.collection.mutable
import org.apache.spark.sql.execution.SparkPlan

class CustomListener extends SparkListener {
  override def onStageCompleted(stageCompleted:
SparkListenerStageCompleted): Unit = {
val rdds = stageCompleted.stageInfo.rddInfos
val stageInfo = stageCompleted.stageInfo
println(s"Stage ${stageInfo.stageId}")
println(s"Number of tasks: ${stageInfo.numTasks}")

stageInfo.rddInfos.foreach { rddInfo =>
  println(s"RDD ${rddInfo.id} has ${rddInfo.numPartitions} partitions.")
}
  }

  override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
val stageId = taskEnd.stageId
val stageAttemptId = taskEnd.stageAttemptId
val taskInfo = taskEnd.taskInfo
println(s"Task: ${taskInfo.taskId}; Stage: $stageId; Duration:
${taskInfo.duration} ms.")
  }

  def wordCount(sc: SparkContext, inputPath: String): Unit = {
val data = sc.textFile(inputPath)
val flatMap = data.flatMap(line => line.split(","))
val map = flatMap.map(word => (word, 1))
val reduceByKey = map.reduceByKey(_ + _)
reduceByKey.foreach(println)
  }
}

object Scenario1 {
  def main(args: Array[String]): Unit = {

val appName = "scenario1"
val spark = SparkSession.builder()
  .master("local[*]")
  .appName(appName)
  .getOrCreate()

val sc = spark.sparkContext
val sparkListener = new CustomListener()
sc.addSparkListener(sparkListener)
val inputPath = "s3a://data-join/file00"
sparkListener.wordCount(sc, inputPath)
sc.stop()

  }
}

Best regards,

Truong


Vào CN, 16 thg 4, 2023 vào lúc 09:32 Trường Trần Phan An <
truong...@vlute.edu.vn> đã viết:

> Dear Jacek Laskowski,
>
> Thank you for your guide. I will try it out for my problem.
>
> Best regards,
> Truong
>
>
> Vào Th 6, 14 thg 4, 2023 vào lúc 21:00 Jacek Laskowski 
> đã viết:
>
>> Hi,
>>
>> Start with intercepting stage completions
>> using SparkListenerStageCompleted [1]. That's Spark Core (jobs, stages and
>> tasks).
>>
>> Go up the execution chain to Spark SQL
>> with SparkListenerSQLExecutionStart [2] and SparkListenerSQLExecutionEnd
>> [3], and correlate infos.
>>
>> You may want to look at how web UI works under the covers to collect all
>> the information. Start from SQLTab that should give you what is displayed
>> (that should give you then what's needed and how it's collected).
>>
>> [1]
>> https://github.com/apache/spark/blob/8cceb3946bdfa5ceac0f2b4fe6a7c43eafb76d59/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L46
>> [2]
>> https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L44
>> [3]
>> https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60
>> [4]
>> https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Thu, Apr 13, 2023 at 10:40 AM Trường Trần Phan An <
>> truong...@vlute.edu.vn> wrote:
>>
>>> Hi,
>>>
>>> Can you give me more details or give me a tutorial on "You'd have to
>>> intercept execution events and correlate them. Not an easy task yet doable"
>>>
>>> Thank
>>>
>>> Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski <
>>> ja...@japila.pl> đã viết:
>>>
>>>> Hi,
>>>>
>>>> tl;dr it's not possible to "reverse-engineer" tasks to fun

CVE-2023-32007: Apache Spark: Shell command injection via Spark UI

2023-05-02 Thread Arnout Engelen
Severity: important

Affected versions:

- Apache Spark 3.1.1 before 3.2.2

Description:

** UNSUPPORTED WHEN ASSIGNED ** The Apache Spark UI offers the possibility to 
enable ACLs via the configuration option spark.acls.enable. With an 
authentication filter, this checks whether a user has access permissions to 
view or modify the application. If ACLs are enabled, a code path in 
HttpSecurityFilter can allow someone to perform impersonation by providing an 
arbitrary user name. A malicious user might then be able to reach a permission 
check function that will ultimately build a Unix shell command based on their 
input, and execute it. This will result in arbitrary shell command execution as 
the user Spark is currently running as. This issue was disclosed earlier as 
CVE-2022-33891, but incorrectly claimed version 3.1.3 (which has since gone 
EOL) would not be affected.

NOTE: This vulnerability only affects products that are no longer supported by 
the maintainer.

Users are recommended to upgrade to a supported version of Apache Spark, such 
as version 3.4.0.

Credit:

Sven Krewitt, Flashpoint (reporter)

References:

https://www.cve.org/CVERecord?id=CVE-2022-33891
https://spark.apache.org/security.html
https://spark.apache.org/
https://www.cve.org/CVERecord?id=CVE-2023-32007


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



CVE-2023-22946: Apache Spark proxy-user privilege escalation from malicious configuration class

2023-04-15 Thread Sean R. Owen
Description:

In Apache Spark versions prior to 3.4.0, applications using spark-submit can 
specify a 'proxy-user' to run as, limiting privileges. The application can 
execute code with the privileges of the submitting user, however, by providing 
malicious configuration-related classes on the classpath. This affects 
architectures relying on proxy-user, for example those using Apache Livy to 
manage submitted applications.

This issue is being tracked as SPARK-41958 

Work Arounds:

Update to Apache Spark 3.4.0 or later, and ensure that 
spark.submit.proxyUser.allowCustomClasspathInClusterMode is set to its default 
of "false", and is not overridden by submitted applications.

Credit:

Hideyuki Furue (finder)
Yi Wu (Databricks) (remediation developer)

References:

https://spark.apache.org/
https://www.cve.org/CVERecord?id=CVE-2023-22946
https://issues.apache.org/jira/browse/SPARK-41958


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
Hi,

Start with intercepting stage completions using SparkListenerStageCompleted
[1]. That's Spark Core (jobs, stages and tasks).

Go up the execution chain to Spark SQL with SparkListenerSQLExecutionStart
[2] and SparkListenerSQLExecutionEnd [3], and correlate infos.

You may want to look at how web UI works under the covers to collect all
the information. Start from SQLTab that should give you what is displayed
(that should give you then what's needed and how it's collected).

[1]
https://github.com/apache/spark/blob/8cceb3946bdfa5ceac0f2b4fe6a7c43eafb76d59/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L46
[2]
https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L44
[3]
https://github.com/apache/spark/blob/24cdae8f3dcfc825c6c0b8ab8aa8505ae194050b/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L60
[4]
https://github.com/apache/spark/blob/c124037b97538b2656d29ce547b2a42209a41703/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLTab.scala#L24

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Thu, Apr 13, 2023 at 10:40 AM Trường Trần Phan An 
wrote:

> Hi,
>
> Can you give me more details or give me a tutorial on "You'd have to
> intercept execution events and correlate them. Not an easy task yet doable"
>
> Thank
>
> Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski 
> đã viết:
>
>> Hi,
>>
>> tl;dr it's not possible to "reverse-engineer" tasks to functions.
>>
>> In essence, Spark SQL is an abstraction layer over RDD API that's made up
>> of partitions and tasks. Tasks are Scala functions (possibly with some
>> Python for PySpark). A simple-looking high-level operator like
>> DataFrame.join can end up with multiple RDDs, each with a set of partitions
>> (and hence tasks). What the tasks do is an implementation detail that you'd
>> have to know about by reading the source code of Spark SQL that produces
>> the "bytecode".
>>
>> Just looking at the DAG or the tasks screenshots won't give you that
>> level of detail. You'd have to intercept execution events and correlate
>> them. Not an easy task yet doable. HTH.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> "The Internals Of" Online Books <https://books.japila.pl/>
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> <https://twitter.com/jaceklaskowski>
>>
>>
>> On Tue, Apr 11, 2023 at 6:53 PM Trường Trần Phan An <
>> truong...@vlute.edu.vn> wrote:
>>
>>> Hi all,
>>>
>>> I am conducting a study comparing the execution time of Bloom Filter
>>> Join operation on two environments: Apache Spark Cluster and Apache Spark.
>>> I have compared the overall time of the two environments, but I want to
>>> compare specific "tasks on each stage" to see which computation has the
>>> most significant difference.
>>>
>>> I have taken a screenshot of the DAG of Stage 0 and the list of tasks
>>> executed in Stage 0.
>>> - DAG.png
>>> - Task.png
>>>
>>> *I have questions:*
>>> 1. Can we determine which tasks are responsible for executing each step
>>> scheduled on the DAG during the processing?
>>> 2. Is it possible to know the function of each task (e.g., what is task
>>> ID 0 responsible for? What is task ID 1 responsible for? ... )?
>>>
>>> Best regards,
>>> Truong
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


[ANNOUNCE] Apache Spark 3.2.4 released

2023-04-13 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.4!

Spark 3.2.4 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.4, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-4.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Dongjoon Hyun


Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-13 Thread Trường Trần Phan An
Hi,

Can you give me more details or give me a tutorial on "You'd have to
intercept execution events and correlate them. Not an easy task yet doable"

Thank

Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski 
đã viết:

> Hi,
>
> tl;dr it's not possible to "reverse-engineer" tasks to functions.
>
> In essence, Spark SQL is an abstraction layer over RDD API that's made up
> of partitions and tasks. Tasks are Scala functions (possibly with some
> Python for PySpark). A simple-looking high-level operator like
> DataFrame.join can end up with multiple RDDs, each with a set of partitions
> (and hence tasks). What the tasks do is an implementation detail that you'd
> have to know about by reading the source code of Spark SQL that produces
> the "bytecode".
>
> Just looking at the DAG or the tasks screenshots won't give you that level
> of detail. You'd have to intercept execution events and correlate them. Not
> an easy task yet doable. HTH.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Tue, Apr 11, 2023 at 6:53 PM Trường Trần Phan An <
> truong...@vlute.edu.vn> wrote:
>
>> Hi all,
>>
>> I am conducting a study comparing the execution time of Bloom Filter Join
>> operation on two environments: Apache Spark Cluster and Apache Spark. I
>> have compared the overall time of the two environments, but I want to
>> compare specific "tasks on each stage" to see which computation has the
>> most significant difference.
>>
>> I have taken a screenshot of the DAG of Stage 0 and the list of tasks
>> executed in Stage 0.
>> - DAG.png
>> - Task.png
>>
>> *I have questions:*
>> 1. Can we determine which tasks are responsible for executing each step
>> scheduled on the DAG during the processing?
>> 2. Is it possible to know the function of each task (e.g., what is task
>> ID 0 responsible for? What is task ID 1 responsible for? ... )?
>>
>> Best regards,
>> Truong
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Maytas Monsereenusorn
Hi,

I was wondering if it's not possible to determine tasks to functions, is it
still possible to easily figure out which job and stage completed which
part of the query from the UI?
For example, in the SQL tab of the Spark UI, I am able to see the query and
the Job IDs for that query. However, when looking at the details for the
Query, how do I know which part of the execution plan was completed by
which job/stage?

Thanks,
Maytas


On Wed, Apr 12, 2023 at 7:06 AM Jacek Laskowski  wrote:

> Hi,
>
> tl;dr it's not possible to "reverse-engineer" tasks to functions.
>
> In essence, Spark SQL is an abstraction layer over RDD API that's made up
> of partitions and tasks. Tasks are Scala functions (possibly with some
> Python for PySpark). A simple-looking high-level operator like
> DataFrame.join can end up with multiple RDDs, each with a set of partitions
> (and hence tasks). What the tasks do is an implementation detail that you'd
> have to know about by reading the source code of Spark SQL that produces
> the "bytecode".
>
> Just looking at the DAG or the tasks screenshots won't give you that level
> of detail. You'd have to intercept execution events and correlate them. Not
> an easy task yet doable. HTH.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Tue, Apr 11, 2023 at 6:53 PM Trường Trần Phan An <
> truong...@vlute.edu.vn> wrote:
>
>> Hi all,
>>
>> I am conducting a study comparing the execution time of Bloom Filter Join
>> operation on two environments: Apache Spark Cluster and Apache Spark. I
>> have compared the overall time of the two environments, but I want to
>> compare specific "tasks on each stage" to see which computation has the
>> most significant difference.
>>
>> I have taken a screenshot of the DAG of Stage 0 and the list of tasks
>> executed in Stage 0.
>> - DAG.png
>> - Task.png
>>
>> *I have questions:*
>> 1. Can we determine which tasks are responsible for executing each step
>> scheduled on the DAG during the processing?
>> 2. Is it possible to know the function of each task (e.g., what is task
>> ID 0 responsible for? What is task ID 1 responsible for? ... )?
>>
>> Best regards,
>> Truong
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
Hi,

tl;dr it's not possible to "reverse-engineer" tasks to functions.

In essence, Spark SQL is an abstraction layer over RDD API that's made up
of partitions and tasks. Tasks are Scala functions (possibly with some
Python for PySpark). A simple-looking high-level operator like
DataFrame.join can end up with multiple RDDs, each with a set of partitions
(and hence tasks). What the tasks do is an implementation detail that you'd
have to know about by reading the source code of Spark SQL that produces
the "bytecode".

Just looking at the DAG or the tasks screenshots won't give you that level
of detail. You'd have to intercept execution events and correlate them. Not
an easy task yet doable. HTH.

Pozdrawiam,
Jacek Laskowski

"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski

<https://twitter.com/jaceklaskowski>


On Tue, Apr 11, 2023 at 6:53 PM Trường Trần Phan An 
wrote:

> Hi all,
>
> I am conducting a study comparing the execution time of Bloom Filter Join
> operation on two environments: Apache Spark Cluster and Apache Spark. I
> have compared the overall time of the two environments, but I want to
> compare specific "tasks on each stage" to see which computation has the
> most significant difference.
>
> I have taken a screenshot of the DAG of Stage 0 and the list of tasks
> executed in Stage 0.
> - DAG.png
> - Task.png
>
> *I have questions:*
> 1. Can we determine which tasks are responsible for executing each step
> scheduled on the DAG during the processing?
> 2. Is it possible to know the function of each task (e.g., what is task ID
> 0 responsible for? What is task ID 1 responsible for? ... )?
>
> Best regards,
> Truong
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid.

I have created a section in Apache Spark Community Stack called spark
foundation.  spark-foundation - Apache Spark Community - Slack
<https://app.slack.com/client/T04URTRBZ1R/C051CL5T1KL/thread/C0501NBTNQG-1680132989.091199>

I invite you to add your weblink to that section.

HTH
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 1 Apr 2023 at 13:12, Khalid Mammadov 
wrote:

> Hey AN-TRUONG
>
> I have got some articles about this subject that should help.
> E.g.
> https://khalidmammadov.github.io/spark/spark_internals_rdd.html
>
> Also check other Spark Internals on web.
>
> Regards
> Khalid
>
> On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, 
> wrote:
>
>> Thank you for your information,
>>
>> I have tracked the spark history server on port 18080 and the spark UI on
>> port 4040. I see the result of these two tools as similar right?
>>
>> I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in
>> the images does, is it possible?
>> https://i.stack.imgur.com/Azva4.png
>>
>> Best regards,
>>
>> An - Truong
>>
>>
>> On Fri, Mar 31, 2023 at 9:38 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are you familiar with spark GUI default on port 4040?
>>>
>>> have a look.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Lead Solutions Architect/Engineering Lead
>>> Palantir Technologies Limited
>>>
>>>
>>>view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 31 Mar 2023 at 15:15, AN-TRUONG Tran Phan <
>>> tr.phan.tru...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am learning about Apache Spark and want to know the meaning of each
>>>> Task created on the Jobs recorded on Spark history.
>>>>
>>>> For example, the application I write creates 17 jobs, in which job 0
>>>> runs for 10 minutes, there are 2384 small tasks and I want to learn about
>>>> the meaning of these 2384, is it possible?
>>>>
>>>> I found a picture of DAG in the Jobs and want to know the relationship
>>>> between DAG and Task, is it possible (Specifically from the attached file
>>>> DAG and 2384 tasks below)?
>>>>
>>>> Thank you very much, have a nice day everyone.
>>>>
>>>> Best regards,
>>>>
>>>> An-Trường.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Trân Trọng,
>>
>> An Trường.
>>
>


  1   2   3   4   5   6   7   8   9   10   >