date:20220622

Re: [Java 17] --add-exports required?

2022-06-22 Thread Yang,Jie(INF)

Hi, Greg "--add-exports java.base/sun.nio.ch=ALL-UNNAMED " does not need to be added when SPARK-33772 is completed, so in order to answer your question, I need more details for testing: 1. Where can I download Java 17 (Temurin-17+35)? 2. What test commands do you use? Yang Jie 在 2022/6/23

Need help with the configuration for AWS glue jobs

2022-06-22 Thread Sid

Hi Team, Could anyone help me in the below problem: https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data Thanks, Sid

Re: Will it lead to OOM error?

2022-06-22 Thread Sid

Thanks all for your answers. Much appreciated. On Thu, Jun 23, 2022 at 6:07 AM Yong Walt wrote: > We have many cases like this. it won't cause OOM. > > Thanks > > On Wed, Jun 22, 2022 at 8:28 PM Sid wrote: > >> I have a 150TB CSV file. >> >> I have a total of 100 TB RAM and 100TB disk. So If I

[Java 17] --add-exports required?

2022-06-22 Thread Greg Kopff

Hi. According to the release notes[1], and specifically the ticket Build and Run Spark on Java 17 (SPARK-33772)[2], Spark now supports running on Java 17. However, using Java 17 (Temurin-17+35) with Maven (3.8.6) and maven-surefire-plugin (3.0.0-M7), when running a unit test that uses Spark

Re: Will it lead to OOM error?

2022-06-22 Thread Yong Walt

We have many cases like this. it won't cause OOM. Thanks On Wed, Jun 22, 2022 at 8:28 PM Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it lead to an

StructuredStreaming - read from Kafka, writing data into Mongo every 10 minutes

2022-06-22 Thread karan alang

Hello All, I have data in Kafka topic(data published every 10 mins) and I'm planning to read this data using Apache Spark Structured Stream(batch mode) and push it in MongoDB. Pls note : This will be scheduled using Composer/Airflow on GCP - which will create a Dataproc cluster, run the spark

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman

I'd argue it's strange and unexpected. I understand there is precision issues here, but I'm fine that result might be slightly different each time for the specific column What I'm not expecting(as end user for sure) is that presumably trivial computation might under retries scenarios cause few

Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack

Yes, a single file compressed with a non-splitable compression (e.g. gzip) would have to be read by a single executor. That takes forever. You should consider to recompress the file with a splitable compression first. You will not want to read that file more than once, so you should

Re: Will it lead to OOM error?

2022-06-22 Thread Sid

Hi Enrico, Thanks for the insights. Could you please help me to understand with one example of compressed files where the file wouldn't be split in partitions and will put load on a single partition and might lead to OOM error? Thanks, Sid On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack wrote:

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Sean Owen

Eh, there is a huge caveat - you are making your input non-deterministic, where determinism is assumed. I don't think that supports such a drastic statement. On Wed, Jun 22, 2022 at 12:39 PM Igor Berman wrote: > Hi All > tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >

repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman

Hi All tldr; IMHO repartition(n) should be deprecated or red-flagged, so that everybody will understand consequences of usage of this method Following conversation in https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for recent versions of spark) I think it's very important to

[Spark Dataframe] How to load compressed file? (lz4, snappy)

2022-06-22 Thread HelloWorld

Hello. I am developer who is learning spark programming I am asking for help because it is difficult for me to solve the current problem on my own. My development environment is as follows. ---

Re: Will it lead to OOM error?

2022-06-22 Thread Enrico Minack

The RAM and disk memory consumtion depends on what you do with the data after reading them. Your particular action will read 20 lines from the first partition and show them. So it will not use any RAM or disk, no matter how large the CSV is. If you do a count instead of show, it will

Re: Will it lead to OOM error?

2022-06-22 Thread Deepak Sharma

It will spill to disk if everything can’t be loaded in memory . On Wed, 22 Jun 2022 at 5:58 PM, Sid wrote: > I have a 150TB CSV file. > > I have a total of 100 TB RAM and 100TB disk. So If I do something like this > > spark.read.option("header","true").csv(filepath).show(false) > > Will it

Will it lead to OOM error?

2022-06-22 Thread Sid

I have a 150TB CSV file. I have a total of 100 TB RAM and 100TB disk. So If I do something like this spark.read.option("header","true").csv(filepath).show(false) Will it lead to an OOM error since it doesn't have enough memory? or it will spill data onto the disk and process it? Thanks, Sid

Re: Spark Doubts

2022-06-22 Thread Sid

Hi, Thanks for your answers. Much appreciated I know that we can cache the data frame in memory or disk but I want to understand when the data frame is loaded initially and where does it reside by default? Thanks, Sid On Wed, Jun 22, 2022 at 6:10 AM Yong Walt wrote: > These are the basic

Re: [Java 17] --add-exports required?

Need help with the configuration for AWS glue jobs

Re: Will it lead to OOM error?

[Java 17] --add-exports required?

Re: Will it lead to OOM error?

StructuredStreaming - read from Kafka, writing data into Mongo every 10 minutes

Re: repartition(n) should be deprecated/alerted

Re: Will it lead to OOM error?

Re: Will it lead to OOM error?

Re: repartition(n) should be deprecated/alerted

repartition(n) should be deprecated/alerted

[Spark Dataframe] How to load compressed file? (lz4, snappy)

Re: Will it lead to OOM error?

Re: Will it lead to OOM error?

Will it lead to OOM error?

Re: Spark Doubts

16 matches

Site Navigation

Mail list logo

Footer information