Re: [Java 17] --add-exports required?

2022-06-22 Thread Yang,Jie(INF)
Hi, Greg "--add-exports java.base/sun.nio.ch=ALL-UNNAMED " does not need to be added when SPARK-33772 is completed, so in order to answer your question, I need more details for testing: 1. Where can I download Java 17 (Temurin-17+35)? 2. What test commands do you use? Yang Jie 在 2022/6/23

Need help with the configuration for AWS glue jobs

2022-06-22 Thread Sid
Hi Team, Could anyone help me in the below problem: https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data Thanks, Sid

Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Thanks all for your answers. Much appreciated. On Thu, Jun 23, 2022 at 6:07 AM Yong Walt wrote: > We have many cases like this. it won't cause OOM. > > Thanks > > On Wed, Jun 22, 2022 at 8:28 PM Sid wrote: > >> I have a 150TB CSV file. >> >> I have a total of 100 TB RAM and 100TB disk. So If I

[Java 17] --add-exports required?

2022-06-22 Thread Greg Kopff
Hi. According to the release notes[1], and specifically the ticket Build and Run Spark on Java 17 (SPARK-33772)[2], Spark now supports running on Java 17. However, using Java 17 (Temurin-17+35) with Maven (3.8.6) and maven-surefire-plugin (3.0.0-M7), when running a unit test that uses Spark

Re: Will it lead to OOM error?

2022-06-22 Thread Yong Walt
We have many cases like this. it won't cause OOM. Thanks On Wed, Jun 22, 2022 at 8:28 PM Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it lead to an

StructuredStreaming - read from Kafka, writing data into Mongo every 10 minutes

2022-06-22 Thread karan alang
Hello All, I have data in Kafka topic(data published every 10 mins) and I'm planning to read this data using Apache Spark Structured Stream(batch mode) and push it in MongoDB. Pls note : This will be scheduled using Composer/Airflow on GCP - which will create a Dataproc cluster, run the spark

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
I'd argue it's strange and unexpected. I understand there is precision issues here, but I'm fine that result might be slightly different each time for the specific column What I'm not expecting(as end user for sure) is that presumably trivial computation might under retries scenarios cause few

Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack
Yes, a single file compressed with a non-splitable compression (e.g. gzip) would have to be read by a single executor. That takes forever. You should consider to recompress the file with a splitable compression first. You will not want to read that file more than once, so you should

Re: Will it lead to OOM error?

2022-06-22 Thread Sid
Hi Enrico, Thanks for the insights. Could you please help me to understand with one example of compressed files where the file wouldn't be split in partitions and will put load on a single partition and might lead to OOM error? Thanks, Sid On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack wrote:

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Sean Owen
Eh, there is a huge caveat - you are making your input non-deterministic, where determinism is assumed. I don't think that supports such a drastic statement. On Wed, Jun 22, 2022 at 12:39 PM Igor Berman wrote: > Hi All > tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >

repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
Hi All tldr; IMHO repartition(n) should be deprecated or red-flagged, so that everybody will understand consequences of usage of this method Following conversation in https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for recent versions of spark) I think it's very important to

[Spark Dataframe] How to load compressed file? (lz4, snappy)

2022-06-22 Thread HelloWorld
Hello. I am developer who is learning spark programming I am asking for help because it is difficult for me to solve the current problem on my own. My development environment is as follows. ---

Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack
The RAM and disk memory consumtion depends on what you do with the data after reading them. Your particular action will read 20 lines from the first partition and show them. So it will not use any RAM or disk, no matter how large the CSV is. If you do a count instead of show, it will

Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma
It will spill to disk if everything can’t be loaded in memory . On Wed, 22 Jun 2022 at 5:58 PM, Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it

Will it lead to OOM error?

2022-06-22 Thread Sid
I have a 150TB CSV file. I have a total of 100 TB RAM and 100TB disk. So If I do something like this spark.read.option("header","true").csv(filepath).show(false) Will it lead to an OOM error since it doesn't have enough memory? or it will spill data onto the disk and process it? Thanks, Sid

Re: Spark Doubts

2022-06-22 Thread Sid
Hi, Thanks for your answers. Much appreciated I know that we can cache the data frame in memory or disk but I want to understand when the data frame is loaded initially and where does it reside by default? Thanks, Sid On Wed, Jun 22, 2022 at 6:10 AM Yong Walt wrote: > These are the basic