Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread FengYu Cao
https://spark.apache.org/downloads.html

*2. Choose a package type:* menu shows that Pre-built for Hadoop 3.3

but download link is *spark-3.2.1-bin-hadoop3.2.tgz*

need an update?

L. C. Hsieh  于2022年1月29日周六 14:26写道:

> Thanks Huaxin for the 3.2.1 release!
>
> On Fri, Jan 28, 2022 at 10:14 PM Dongjoon Hyun 
> wrote:
> >
> > Thank you again, Huaxin!
> >
> > Dongjoon.
> >
> > On Fri, Jan 28, 2022 at 6:23 PM DB Tsai  wrote:
> >>
> >> Thank you, Huaxin for the 3.2.1 release!
> >>
> >> Sent from my iPhone
> >>
> >> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
> >>
> >> 
> >> Thanks Huaxin for driving the release!
> >>
> >> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng 
> wrote:
> >>>
> >>> It's Great!
> >>> Congrats and thanks, huaxin!
> >>>
> >>>
> >>> -- 原始邮件 --
> >>> 发件人: "huaxin gao" ;
> >>> 发送时间: 2022年1月29日(星期六) 上午9:07
> >>> 收件人: "dev";"user";
> >>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
> >>>
> >>> We are happy to announce the availability of Spark 3.2.1!
> >>>
> >>> Spark 3.2.1 is a maintenance release containing stability fixes. This
> >>> release is based on the branch-3.2 maintenance branch of Spark. We
> strongly
> >>> recommend all 3.2 users to upgrade to this stable release.
> >>>
> >>> To download Spark 3.2.1, head over to the download page:
> >>> https://spark.apache.org/downloads.html
> >>>
> >>> To view the release notes:
> >>> https://spark.apache.org/releases/spark-release-3-2-1.html
> >>>
> >>> We would like to acknowledge all community members for contributing to
> this
> >>> release. This release would not have been possible without you.
> >>>
> >>> Huaxin Gao
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
*camper42 (曹丰宇)*
Douban, Inc.

Mobile: +86 15691996359
E-mail:  camper.x...@gmail.com


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread L. C. Hsieh
Thanks Huaxin for the 3.2.1 release!

On Fri, Jan 28, 2022 at 10:14 PM Dongjoon Hyun  wrote:
>
> Thank you again, Huaxin!
>
> Dongjoon.
>
> On Fri, Jan 28, 2022 at 6:23 PM DB Tsai  wrote:
>>
>> Thank you, Huaxin for the 3.2.1 release!
>>
>> Sent from my iPhone
>>
>> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
>>
>> 
>> Thanks Huaxin for driving the release!
>>
>> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:
>>>
>>> It's Great!
>>> Congrats and thanks, huaxin!
>>>
>>>
>>> -- 原始邮件 --
>>> 发件人: "huaxin gao" ;
>>> 发送时间: 2022年1月29日(星期六) 上午9:07
>>> 收件人: "dev";"user";
>>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
>>>
>>> We are happy to announce the availability of Spark 3.2.1!
>>>
>>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>>> recommend all 3.2 users to upgrade to this stable release.
>>>
>>> To download Spark 3.2.1, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-2-1.html
>>>
>>> We would like to acknowledge all community members for contributing to this
>>> release. This release would not have been possible without you.
>>>
>>> Huaxin Gao

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Dongjoon Hyun
Thank you again, Huaxin!

Dongjoon.

On Fri, Jan 28, 2022 at 6:23 PM DB Tsai  wrote:

> Thank you, Huaxin for the 3.2.1 release!
>
> Sent from my iPhone
>
> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
>
> 
> Thanks Huaxin for driving the release!
>
> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng 
> wrote:
>
>> It's Great!
>> Congrats and thanks, huaxin!
>>
>>
>> -- 原始邮件 --
>> *发件人:* "huaxin gao" ;
>> *发送时间:* 2022年1月29日(星期六) 上午9:07
>> *收件人:* "dev";"user";
>> *主题:* [ANNOUNCE] Apache Spark 3.2.1 released
>>
>> We are happy to announce the availability of Spark 3.2.1!
>>
>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We
>> strongly
>> recommend all 3.2 users to upgrade to this stable release.
>>
>> To download Spark 3.2.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Huaxin Gao
>>
>


[mongo-spark-connector] How can I improve the performance of Mongo spark write?

2022-01-28 Thread sj p

hello, I'm having a performance problem with writing with a Mongo spark 
connector.

I want to write 0.3M data in the form of string and hex binary (4kB) through 
the Mongo spark connector.
spark = create_c3s_spark_session(app_name, spark_config=[
("spark.executor.cores", "1"), 
("spark.executor.memory", "6g"),
("spark.executor.instances", "50"), # ("spark.executor.instances", 
"50"),
("spark.archives", f'{GIT_SOURCE_BASE}/{pyenv 
path}.tar.gz#environment'),
("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.12:3.0.1"),
("spark.mongodb.output.uri", default_mongodb_uri),
], c3s_username = c3s_username)
However, it takes 30 hours to get the instances set to 1, and it takes 30 hours 
to get to 50 too.
writer = list_df.write.format("mongo").mode("append").option("database", 
mongo_config.get('database')).option("collection", f'{collection_name}')
writer.save()
I don't understand that the total data size is only 1.2GB, and it takes the 
same time even if the number of instances is increased.

The strange thing is that if the hex binary is tested at 400B instead of 4K, 
the completion will be completed within an hour, and increasing the instances 
will definitely reduce the time required.

What actions are needed to address performance issues?

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Bitfox
Is there a guide for upgrading from 3.2.0 to 3.2.1?

thanks

On Sat, Jan 29, 2022 at 9:14 AM huaxin gao  wrote:

> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread DB Tsai
Thank you, Huaxin for the 3.2.1 release!

Sent from my iPhone

> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
> 
> 
> Thanks Huaxin for driving the release!
> 
>> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:
>> It's Great!
>> Congrats and thanks, huaxin!
>> 
>> 
>> -- 原始邮件 --
>> 发件人: "huaxin gao" ;
>> 发送时间: 2022年1月29日(星期六) 上午9:07
>> 收件人: "dev";"user";
>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
>> 
>> We are happy to announce the availability of Spark 3.2.1!
>> 
>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>> recommend all 3.2 users to upgrade to this stable release.
>> 
>> To download Spark 3.2.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>> 
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-1.html
>> 
>> We would like to acknowledge all community members for contributing to this
>> release. This release would not have been possible without you.
>> 
>> Huaxin Gao


Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Chao Sun
Thanks Huaxin for driving the release!

On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:

> It's Great!
> Congrats and thanks, huaxin!
>
>
> -- 原始邮件 --
> *发件人:* "huaxin gao" ;
> *发送时间:* 2022年1月29日(星期六) 上午9:07
> *收件人:* "dev";"user";
> *主题:* [ANNOUNCE] Apache Spark 3.2.1 released
>
> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


回复:[ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Ruifeng Zheng
It's Great!
Congrats and thanks, huaxin!




-- 原始邮件 --
发件人:
"huaxin gao"

https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.


Huaxin Gao

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread Yuming Wang
Thank you Huaxin.

On Sat, Jan 29, 2022 at 9:08 AM huaxin gao  wrote:

> We are happy to announce the availability of Spark 3.2.1!
>
> Spark 3.2.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.2 maintenance branch of Spark. We strongly
> recommend all 3.2 users to upgrade to this stable release.
>
> To download Spark 3.2.1, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-2-1.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Huaxin Gao
>


[ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread huaxin gao
We are happy to announce the availability of Spark 3.2.1!

Spark 3.2.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.2 maintenance branch of Spark. We strongly
recommend all 3.2 users to upgrade to this stable release.

To download Spark 3.2.1, head over to the download page:
https://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-2-1.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Huaxin Gao


Kafka to spark streaming

2022-01-28 Thread Amit Sharma
Hello everyone, we have spark streaming application. We send request to
stream through Akka actor using Kafka topic. We wait for response as it is
real time. Just want a suggestion is there any better option like Livy
where we can send and receive request to spark streaming.


Thanks
Amit


Re: Small optimization questions

2022-01-28 Thread Gourav Sengupta
Hi,

have you looked at the function which says record number in SPARK SQL, or
SPARK functions.

Having more volume of data means that you may get into memory issues as
well. Therefore you scale in and scale out rule has to be responsive, as
mentioned by Sean, along with the number of core per tasks to take care of
memory.

We do not have any other data regarding your clusters or environments
therefore it is difficult to imagine things and provide more information.

Regards,
Gourav Sengupta

On Thu, Jan 27, 2022 at 12:58 PM Aki Riisiö  wrote:

> Ah, sorry for spamming, I found the answer from documentation. Thank you
> for the clarification!
>
> Best regards, Aki Riisiö
>
> On Thu, 27 Jan 2022 at 10:39, Aki Riisiö  wrote:
>
>> Hello.
>>
>> Thank you for the reply again. I just checked how many tasks are spawned
>> when we read the data from S3 and in the latest run, this was a little over
>> 23000. What determines the amount of tasks during the read? Is it directly
>> corresponding to the number of files to be read?
>>
>> Thank you.
>>
>> On Tue, 25 Jan 2022 at 17:35, Sean Owen  wrote:
>>
>>> Yes, you will end up with 80 partitions, and if you write the result,
>>> you end up with 80 files. If you don't have at least 80 partitions, there
>>> is no point in have 80 cores. You will probably see 56 are idle even under
>>> load.
>>> The partitionBy might end up causing the whole job to have more
>>> partitions anyway. I would settle this by actually watching how many tasks
>>> the streaming job spawns. Is it 1, 24, more?
>>>
>>> On Tue, Jan 25, 2022 at 7:57 AM Aki Riisiö  wrote:
>>>
 Thank you for the reply.
 The stream is partitioned by year/month/day/hour, and we read the data
 once a day, so we are reading 24 partitions.

 " A crude rule of thumb is to have 2-3x as many tasks as cores" thank
 you very much, I will set this as default. Will this however change, if we
 also partition the data by year/month/day/hour? If I set:
 df.repartition(80),write ... partitionBy("year", "month", "day",
 "hour"), will this cause each hour to have 80 output files?

 The output data in a "normal" run is very small, so a big partition
 size would result in a large number of too small files.
 I am not sure how Glue autoscales itself, but I definitely need to look
 that up a bit more.

 One of our jobs actually has a requirement to have only one
 output-file, so is the only way to achieve that by repartition(1)? As I
 understand it, this is a major issue in performance.

 Thank you!


 On Tue, 25 Jan 2022 at 15:29, Sean Owen  wrote:

> How many partitions does the stream have? With 80 cores, you need at
> least 80 tasks to even take advantage of them, so if it's less than 80, at
> least .repartition(80). A crude rule of thumb is to have 2-3x as many 
> tasks
> as cores, to help even out differences in task size by more finely
> distributing the work. You might even go for more. I'd watch the task
> length, and as long as the tasks aren't completing in a few seconds or
> less, you probably don't have too many.
>
> This is also a good reason to use autoscaling, so that when not busy
> you can (for example) scale down to 1 executor, but under load, scale up 
> to
> 10 or 20 machines if needed. That is also a good reason to repartition
> more, so that it's possible to take advantage of more parallelism when
> needed.
>
> On Tue, Jan 25, 2022 at 7:07 AM Aki Riisiö 
> wrote:
>
>> Hello.
>>
>> We have a very simple AWS Glue job running with Spark: get some
>> events from Kafka stream, do minor transformations, and write to S3.
>>
>> Recently, there was a change in Kafka topic which suddenly increased
>> our data size * 10 and at the same time we were testing with different
>> repartition values during df.repartition(n).write ...
>> At the time when Kafka started sending an increased volume of data,
>> we didn't actually have the repartition value set in our write.
>> Suddenly, our Glue job (or save at NativeMethodAccessorImpl.java:0)
>> jumped from 2h to 10h. Here are some details of the save stage from 
>> SparkUI:
>> - Only 5 executors, which can run 16 tasks parallel each
>> - 10500 tasks (job is still running...) with medians for
>> duration=2,6min and GC time= 2s
>> - Input size per executor is 9GB and output is 4,5GB
>> - executor memory is 20GB
>>
>> My question is now that we're trying to find a proper value for
>> repartition, what would be the optimal value here? Our data volume was 
>> not
>> expected to go this high, but there are times when it might be. As this 
>> job
>> is running in AWS Glue, should I also consider setting the executor 
>> amount,
>> cores, and memory manually? I think Glue is actually setting those based 
>> on
>> 

Re: how can I remove the warning message

2022-01-28 Thread Mich Talebzadeh
you can try

spark.sparkContext.setLogLevel("ERROR")


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 28 Jan 2022 at 11:14,  wrote:

> When I submitted the job from scala client, I got the warning messages:
>
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> WARNING: All illegal access operations will be denied in a future
> release
>
> How can I just remove those messages?
>
> spark: 3.2.0
> scala: 2.13.7
>
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


how can I remove the warning message

2022-01-28 Thread capitnfrakass

When I submitted the job from scala client, I got the warning messages:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/opt/spark/jars/spark-unsafe_2.12-3.2.0.jar) to constructor 
java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations
WARNING: All illegal access operations will be denied in a future 
release


How can I just remove those messages?

spark: 3.2.0
scala: 2.13.7

Thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org