Hi.I am running on macOS 12.4, using an ‘Adoptium’ JDK from https://adoptium.net/download. The version details are:$ java -versionopenjdk version "17.0.3" 2022-04-19OpenJDK Runtime Environment Temurin-17.0.3+7 (build 17.0.3+7)OpenJDK 64-Bit Server VM Temurin-17.0.3+7 (build 17.0.3+7, mixed mode, sh
Hi, Greg
"--add-exports java.base/sun.nio.ch=ALL-UNNAMED " does not need to be added
when SPARK-33772 is completed, so in order to answer your question, I need more
details for testing:
1. Where can I download Java 17 (Temurin-17+35)?
2. What test commands do you use?
Yang Jie
在 2022/6/23 1
Hi Team,
Could anyone help me in the below problem:
https://stackoverflow.com/questions/72724999/how-to-calculate-number-of-g-1-workers-in-aws-glue-for-processing-1tb-data
Thanks,
Sid
Thanks all for your answers. Much appreciated.
On Thu, Jun 23, 2022 at 6:07 AM Yong Walt wrote:
> We have many cases like this. it won't cause OOM.
>
> Thanks
>
> On Wed, Jun 22, 2022 at 8:28 PM Sid wrote:
>
>> I have a 150TB CSV file.
>>
>> I have a total of 100 TB RAM and 100TB disk. So If I
Hi.
According to the release notes[1], and specifically the ticket Build and Run
Spark on Java 17 (SPARK-33772)[2], Spark now supports running on Java 17.
However, using Java 17 (Temurin-17+35) with Maven (3.8.6) and
maven-surefire-plugin (3.0.0-M7), when running a unit test that uses Spark
(3
We have many cases like this. it won't cause OOM.
Thanks
On Wed, Jun 22, 2022 at 8:28 PM Sid wrote:
> I have a 150TB CSV file.
>
> I have a total of 100 TB RAM and 100TB disk. So If I do something like this
>
> spark.read.option("header","true").csv(filepath).show(false)
>
> Will it lead to an
Hello All,
I have data in Kafka topic(data published every 10 mins) and I'm planning
to read this data using Apache Spark Structured Stream(batch mode) and push
it in MongoDB.
Pls note : This will be scheduled using Composer/Airflow on GCP - which
will create a Dataproc cluster, run the spark cod
I'd argue it's strange and unexpected.
I understand there is precision issues here, but I'm fine that result might
be slightly different each time for the specific column
What I'm not expecting(as end user for sure) is that presumably trivial
computation might under retries scenarios cause few hund
Yes, a single file compressed with a non-splitable compression (e.g.
gzip) would have to be read by a single executor. That takes forever.
You should consider to recompress the file with a splitable compression
first. You will not want to read that file more than once, so you should
uncompress
Hi Enrico,
Thanks for the insights.
Could you please help me to understand with one example of compressed files
where the file wouldn't be split in partitions and will put load on a
single partition and might lead to OOM error?
Thanks,
Sid
On Wed, Jun 22, 2022 at 6:40 PM Enrico Minack
wrote:
Eh, there is a huge caveat - you are making your input non-deterministic,
where determinism is assumed. I don't think that supports such a drastic
statement.
On Wed, Jun 22, 2022 at 12:39 PM Igor Berman wrote:
> Hi All
> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that
> ev
Hi All
tldr; IMHO repartition(n) should be deprecated or red-flagged, so that
everybody will understand consequences of usage of this method
Following conversation in https://issues.apache.org/jira/browse/SPARK-38388
(still relevant for recent versions of spark) I think it's very important
to mark
Hello. I am developer who is learning spark programming
I am asking for help because it is difficult for me to solve the current
problem on my own.
My development environment is as follows.
---
OS
The RAM and disk memory consumtion depends on what you do with the data
after reading them.
Your particular action will read 20 lines from the first partition and
show them. So it will not use any RAM or disk, no matter how large the
CSV is.
If you do a count instead of show, it will iterate
It will spill to disk if everything can’t be loaded in memory .
On Wed, 22 Jun 2022 at 5:58 PM, Sid wrote:
> I have a 150TB CSV file.
>
> I have a total of 100 TB RAM and 100TB disk. So If I do something like this
>
> spark.read.option("header","true").csv(filepath).show(false)
>
> Will it lead
I have a 150TB CSV file.
I have a total of 100 TB RAM and 100TB disk. So If I do something like this
spark.read.option("header","true").csv(filepath).show(false)
Will it lead to an OOM error since it doesn't have enough memory? or it
will spill data onto the disk and process it?
Thanks,
Sid
16 matches
Mail list logo