[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all,

To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a preview release of Spark 4.0. This
preview is not a stable release in terms of either API or functionality,
but it is meant to give the community early access to try the code that
will become Spark 4.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists or JIRA.

There are a lot of exciting new features added to Spark 4.0, including ANSI
mode by default, Python data source, polymorphic Python UDTF, string
collation support, new VARIANT data type, streaming state store data
source, structured logging, Java 17 by default, and many more.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 4.0.0-preview1, head over to the download page:
https://archive.apache.org/dist/spark/spark-4.0.0-preview1 . It's also
available in PyPI, with version name "4.0.0.dev1".

Thanks,

Wenchen


Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more
comfortable with.

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


On Sun, Jun 2, 2024 at 9:59 PM Perez  wrote:

> Hello,
>
> Can I get some suggestions?
>
> On Sat, Jun 1, 2024 at 1:18 PM Perez  wrote:
>
>> Hi Team,
>>
>> I am planning to load and process around 2 TB historical data. For that
>> purpose I was planning to go ahead with Glue.
>>
>> So is it ok if I use glue if I calculate my DPUs needed correctly? or
>> should I go with EMR.
>>
>> This will be a one time activity.
>>
>>
>> TIA
>>
>