it deletes and copies the partitions.
My issue is something related to this -
https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
With Best Regards,
Dipayan Dev
On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh
wrote:
> Spark has no role in creating that hive stag
Spark has no role in creating that hive staging directory. That directory
belongs to Hive and Spark simply does ETL there, loading to the Hive
managed table in your case which ends up in saging directory
I suggest that you review your design and use an external hive table with
explicit location
It does help performance but not significantly.
I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?
On Mon, 17 Jul 2023
++ DEV community
On Mon, Jul 17, 2023 at 4:14 PM Varun Shah
wrote:
> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apache/spark project locally for my 1st
> Open Source Contribution, by building and creating a pa
Hi Team,
I am still looking for a guidance here. Really appreciate anything that
points me in the right direction.
On Mon, Jul 17, 2023, 16:14 Varun Shah wrote:
> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apach
ll take a long time to perform this step. One workaround will be
> to create smaller number of larger files if that is possible from Spark and
> if this is not possible then those configurations allow for configuring the
> threadpool which does the metadata copy.
>
> You can go thr
ote the
> MLlib-specific contribution guidelines section in particular.
>
> https://spark.apache.org/contributing.html
>
> Since you are looking for something to start with, take a look at this
> Jira query for starter issues.
>
>
> https://issues.apache.org/jira/browse/S
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
copy and delete operation in GCS and therefore if there are many number of
files it will take a long time to perform this step. One workaround will be
to create smaller number of larger files if that is possible from Spark and
restingly, it took only 10 minutes to write the output in the staging
> directory and rest of the time it took to rename the objects. Thats the
> concern.
>
> Looks like a known issue as spark behaves with GCS but not getting any
> workaround for this.
>
>
> On Mon, 17 Jul
It does support- It doesn’t error out for me atleast. But it took around 4
hours to finish the job.
Interestingly, it took only 10 minutes to write the output in the staging
directory and rest of the time it took to rename the objects. Thats the
concern.
Looks like a known issue as spark behaves
er algorithms?
>
> I tried v2 algorithm but its not enhancing the runtime. What’s the best
> practice in Dataproc for dynamic updates in Spark.
>
>
> On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote:
>
>> You can try increasing fs.gs.batch.threads and
>> fs.gs.max.requests
Thanks Jay,
I will try that option.
Any insight on the file committer algorithms?
I tried v2 algorithm but its not enhancing the runtime. What’s the best
practice in Dataproc for dynamic updates in Spark.
On Mon, 17 Jul 2023 at 7:05 PM, Jay wrote:
> You can try increas
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch.
The definitions for these flags are available here -
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md
On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote:
> No, I am using Sp
Resending this message with a proper Subject line
Hi Spark Community,
I am trying to set up my forked apache/spark project locally for my 1st
Open Source Contribution, by building and creating a package as mentioned here
under Running Individual Tests
<https://spark.apache.org/develo
No, I am using Spark 2.4 to update the GCS partitions . I have a managed
Hive table on top of this.
[image: image.png]
When I do a dynamic partition update of Spark, it creates the new file in a
Staging area as shown here.
But the GCS blob renaming takes a lot of time. I have a partition based on
So you are using GCP and your Hive is installed on Dataproc which happens
to run your Spark as well. Is that correct?
What version of Hive are you using?
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
Hi All,
Of late, I have encountered the issue where I have to overwrite a lot of
partitions of the Hive table through Spark. It looks like writing to
hive_staging_directory takes 25% of the total time, whereas 75% or more
time goes in moving the ORC files from staging directory to the final
this Jira
query for starter issues.
https://issues.apache.org/jira/browse/SPARK-38719?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20%22starter%22%20AND%20status%20%3D%20Open
Cheers,
Brian
On Sun, Jul 16, 2023 at 8:49 AM Dipayan Dev wrote:
> Hi Spark Community,
>
> A very good morni
Hi Spark Community,
A very good morning to you.
I am using Spark from last few years now, and new to the community.
I am very much interested to be a contributor.
I am looking to contribute to Spark MLLib. Can anyone please suggest me how
to start with contributing to any new MLLib feature? Is
Hey Spark Community,
Our Jupyterhub/Jupyterlab (with spark client) runs behind two layers of
HAProxy and the Yarn cluster runs remotely. We want to use deploy mode
'client' so that we can capture the output of any spark sql query in
jupyterlab. I'm aware of other technologies like
Gentle reminder on this.
On Sat, Jul 8, 2023 at 7:59 PM Surya Soma wrote:
> Hello,
>
> I am trying to publish custom metrics using Spark CustomMetric API as
> supported since spark 3.2 https://github.com/apache/spark/pull/31476,
>
>
> https://spark.apache.org/docs/3.2.
Well, in that case, you may want to make sure your Spark server is
running properly and you can access the Spark UI using your browser. If
you're not owning the spark cluster, contact your spark admin.
On 7/12/23 1:56 PM, timi ayoade wrote:
I can't even connect to the spark UI
O
unsubscribe
From: timi ayoade
Sent: Wednesday, July 12, 2023 6:11 AM
To: user@spark.apache.org
Subject: [EXTERNAL] Spark Not Connecting
Hi Apache spark community, I am a Data EngineerI have been using Apache spark
for some time now. I recently tried to use it
Hi Apache spark community, I am a Data EngineerI have been using Apache
spark for some time now. I recently tried to use it but I have been getting
some errors. I have tried debugging the error but to no avail. the
screenshot is attached below. I will be glad if responded to. thanks
Are you using Spark 3.4?
Under directory $SPARK_HOME get a list of jar files for hive and hadoop.
This one is for version 3.4.0
/opt/spark/jars> ltr *hive* *hadoop*
-rw-r--r--. 1 hduser hadoop 717820 Apr 7 03:43 spark-hive_2.12-3.4.0.jar
-rw-r--r--. 1 hduser hadoop 563632 Apr 7 03:43
sp
Hi all,
We made some changes to hive which require changes to the hive jars that
Spark is bundled with. Since Spark 3.3.1 comes bundled with Hive 2.3.9
jars, we built our changes in Hive 2.3.9 and put the necessary jars under
$SPARK_HOME/jars (replacing the original jars that were there
Hello,
I am trying to publish custom metrics using Spark CustomMetric API as
supported since spark 3.2 https://github.com/apache/spark/pull/31476,
https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html
I have created a custom metric implementing
Hello everyone,
I’m really sorry to use this mailing list, but seems impossible to notify a
strange behaviour that is happening with the Spark UI. I’m sending also the
link to the stackoverflow question here
https://stackoverflow.com/questions/76632692/spark-ui-executors-tab-its-empty
I’m
Dear spark users,
I'm experiencing an unusual issue with Spark 3.4.x.
When creating a new column as the sum of several existing columns, the time
taken almost doubles as the number of columns increases. This operation doesn't
require much resources, so I suspect there might be a pr
ds = spark.read().schema(schema).option("mode",
>>>> "PERMISSIVE").json("path").collect();
>>>> ds = ds.filter(functions.col("_corrupt_record").isNull()).collect();
>>>>
>>>> However, I haven't figured out on how to
er, I haven't figured out on how to ignore record 5.
>>>
>>> Any help is appreciated.
>>>
>>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao
>>> wrote:
>>>
>>>> Hi all,
>>>> I'm trying to read around 1,000,000 JSONL
ctions.col("_corrupt_record").isNull()).collect();
>>
>> However, I haven't figured out on how to ignore record 5.
>>
>> Any help is appreciated.
>>
>> On Mon, 3 Jul 2023 at 19:24, Shashank Rao
>> wrote:
>>
>>> Hi all,
>&g
>
> On Mon, 3 Jul 2023 at 19:24, Shashank Rao wrote:
>
>> Hi all,
>> I'm trying to read around 1,000,000 JSONL files present in S3 using
>> Spark. Once read, I need to write them to BigQuery.
>> I have a schema that may not be an exact match with all the
Wow, really neat -- thanks for sharing!
On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang wrote:
> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English
The demo was really amazing.
On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri
wrote:
> This is wonderful news!
>
> On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote:
>
>> Dear Apache Spark community,
>>
>> We are delighted to announce the launch of a groundbreaking to
This is wonderful news!
On Tue, 4 Jul 2023 at 01:14, Gengliang Wang wrote:
> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> <
Dear Apache Spark community,
We are delighted to announce the launch of a groundbreaking tool that aims
to make Apache Spark more user-friendly and accessible - the English SDK
<https://github.com/databrickslabs/pyspark-ai/>. Powered by the application
of Generative AI, the English SDK
ds.filter(functions.col("_corrupt_record").isNull()).collect();
However, I haven't figured out on how to ignore record 5.
Any help is appreciated.
On Mon, 3 Jul 2023 at 19:24, Shashank Rao wrote:
> Hi all,
> I'm trying to read around 1,000,000 JSONL files present in S3 usin
Hi Ruben,
I’m not sure if this answers your question, but if you’re interested in
exploring the underlying tables, you could always try something like the
below in a Databricks notebook:
display(spark.read.table(’samples.nyctaxi.trips’))
(For vanilla Spark users, it would be
spark.read.table
Hi all,
I'm trying to read around 1,000,000 JSONL files present in S3 using Spark.
Once read, I need to write them to BigQuery.
I have a schema that may not be an exact match with all the records.
How can I filter records where there isn't an exact schema match:
Eg: if my records wer
Dear Apache Spark community,
I hope this email finds you well. My name is Ruben, and I am an enthusiastic
user of Apache Spark, specifically through the Databricks platform. I am
reaching out to you today to seek your assistance and guidance regarding a
specific use case.
I have been
Hi,
When I launch pyspark CLI on my M1 Macbook (standalone mode), I
intermittently get the following error and the Spark session doesn't get
initialized. 7~8 times out of 10, it doesn't have the issue, but it
intermittently fails. And, this occurs only when I specify
`spark.jars.packag
Hi,
We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table
creation while writing dataframe as saveAsTable failed with below error.
Can not create the managed table(``) The associated
location('hdfs:') already exists.
On high level our code does below before writing da
t just be the reality of trying to process a 240m record file
> with 80+ columns, unless there's an obvious issue with my setup that
> someone sees. The solution is likely going to involve increasing
> parallelization.
>
> To that end, I extracted and re-zipped this file in bzip. Since
file with
80+ columns, unless there's an obvious issue with my setup that someone
sees. The solution is likely going to involve increasing parallelization.
To that end, I extracted and re-zipped this file in bzip. Since bzip is
splittable and gzip is not, Spark can process the bzip file in par
Hello,
I am trying to publish custom metrics using Spark CustomMetric API as
supported since spark 3.2 https://github.com/apache/spark/pull/31476,
https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/sql/connector/metric/CustomMetric.html
I have created a custom metric implementing
OK for now have you analyzed statistics in Hive external table
spark-sql (default)> ANALYZE TABLE test.stg_t2 COMPUTE STATISTICS FOR ALL
COLUMNS;
spark-sql (default)> DESC EXTENDED test.stg_t2;
Hive external tables have little optimization
HTH
Mich Talebzadeh,
Solutions Architect/Engin
Hello,
I'm using Spark 3.4.0 in standalone mode with Hadoop 3.3.5. The master node
has 2 cores and 8GB of RAM. There is a single worker node with 8 cores and
64GB of RAM.
I'm trying to process a large pipe delimited file that has been compressed
with gzip (9.2GB zipped, ~58GB unzip
(df.take(1))) > 0:
print(batchId)
# do your logic
else:
print("DataFrame is empty")
You should also have it in
option('checkpointLocation', checkpoint_path).
See this article on mine
Processing Change Data Capture with Spark Structured S
Hi,
I am using spark 3.3.1 distribution and spark stream in my application. Is
there a way to add a microbatch id to all logs generated by spark and spark
applications ?
Thanks.
to:mri...@gmail.com>> wrote:
>>
>>
>> Thanks Dongjoon !
>>
>> Regards,
>> Mridul
>>
>> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun > <mailto:dongj...@apache.org>> wrote:
>>>
>>> We are happy to announce the availability of A
Thanks! Dongjoon Hyun.
Congratulation too!
At 2023-06-24 07:57:05, "Dongjoon Hyun" wrote:
We are happy to announce the availability of Apache Spark 3.4.1!
Spark 3.4.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.4 maintenance branc
Hello All -
I'm using Apache Spark Structured Streaming to read data from Kafka topic,
and do some processing. I'm using watermark to account for late-coming
records and the code works fine.
Here is the working(sample) code:
```
from pyspark.sql import SparkSessionfrom pyspark.sql
M Dongjoon Hyun wrote:
>>>
>>> We are happy to announce the availability of Apache Spark 3.4.1!
>>>
>>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.4 maintenance branch of Spark. We strongly
Thanks!
On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan
wrote:
>
> Thanks Dongjoon !
>
> Regards,
> Mridul
>
> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote:
>
>> We are happy to announce the availability of Apache Spark 3.4.1!
>>
>> Spark
Thanks Dongjoon !
Regards,
Mridul
On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun wrote:
> We are happy to announce the availability of Apache Spark 3.4.1!
>
> Spark 3.4.1 is a maintenance release containing stability fixes. This
> release is based on the branch-3.4 maintenance bra
We are happy to announce the availability of Apache Spark 3.4.1!
Spark 3.4.1 is a maintenance release containing stability fixes. This
release is based on the branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade to this stable release.
To download Spark 3.4.1
> HI
>
> I am using spark with iceberg, updating the table with 1700 columns ,
> We are loading 0.6 Million rows from parquet files ,in future it will be
> 16 Million rows and trying to update the data in the table which has 16
> buckets .
> Using the default partitioner of sp
HI
I am using spark with iceberg, updating the table with 1700 columns ,
We are loading 0.6 Million rows from parquet files ,in future it will be 16
Million rows and trying to update the data in the table which has 16
buckets .
Using the default partitioner of spark .Also we don't d
at group)
>
> You need to raise a JIRA for this request plus related doc related
>
>
> Example JIRA
>
> https://issues.apache.org/jira/browse/SPARK-42485
>
> and the related *Spark project improvement proposals (SPIP) *to be filled
> in
>
> https://spark.apache.org/imp
Sean is right, casting timestamps to strings (which is what show() does)
uses the local timezone, either the Java default zone `user.timezone`,
the Spark default zone `spark.sql.session.timeZone` or the default
DataFrameWriter zone `timeZone`(when writing to file).
You say you are in PST
You sure it is not just that it's displaying in your local TZ? Check the
actual value as a long for example. That is likely the same time.
On Thu, Jun 8, 2023, 5:50 PM karan alang wrote:
> ref :
> https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-f
ref :
https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
Hello All,
I've data stored in MongoDB collection and the timestamp column is not
being read by Apache Spark correctly. I'm running Apache Spark on GCP
Dataproc.
Here is s
Try sending it to d...@spark.apache.org (and join that group)
You need to raise a JIRA for this request plus related doc related
Example JIRA
https://issues.apache.org/jira/browse/SPARK-42485
and the related *Spark project improvement proposals (SPIP) *to be filled in
https
Do Spark **devs** read this mailing list?
Is there another/a better way to make feature requests?
I tried in the past to write a mail to the dev mailing list but it did not
show at all.
Cheers
keen schrieb am Do., 1. Juni 2023, 07:11:
> Hi all,
> currently only *temporary* Spark Views
Hi all,
currently only *temporary* Spark Views can be created from a DataFrame
(df.createOrReplaceTempView or df.createOrReplaceGlobalTempView).
When I want a *permanent* Spark View I need to specify it via Spark SQL
(CREATE VIEW AS SELECT ...).
Sometimes it is easier to specify the desired
Great stuff Winston. I added a channel in Slack Community for Spark
https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT
cheers
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
<ht
Hi everyone,
We published an article on the performance and correctness of Trino, Spark,
and Hive-MR3, and thought that it could be of interest to Spark users.
https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/
Omitted in the article is the performance of Spark 2.3.1 vs
Hi Nikhil
Spark operator supports ingress for exposing all UIs of running spark
applications.
reference:
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress
On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote:
>
Hi Mich,
I have been using ChatGPT free version, Bing AI, Google Bard and other AI
chatbots.
My use cases so far include writing, debugging code, generating documentation
and explanation on Spark key terminologies for beginners to quickly pick up new
concepts, summarizing pros and cons or
Hi folks,
Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can
port-forward the UI from the driver pod for each but this process is
tedious given we have multiple jobs running. Is there a clever solution to
exposing all Driver UIs in a centralized place?
Thanks
Nikhil
I have started looking into ChatGPT as a consumer. The one I have tried is
the free not plus version.
I asked a question entitled "what is the future for spark" and asked for a
concise response
This was the answer
"Spark has a promising future due to its capabilities in
Hi,
Thanks for your response.
I understand there is no explicit way to configure dynamic scaling for
Spark Structured Streaming as the ticket is still open for that. But is
there a way to manage dynamic scaling with the existing Batch Dynamic
scaling algorithm as this kicks in when Dynamic
2.13.8
<https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/pom.xml#LL3606C46-L3606C46>
you must change 2.13.6 to 2.13.8
man. 29. mai 2023 kl. 18:02 skrev Mich Talebzadeh :
> Thanks everyone. Still not much progress :(. It is becoming a bit
> confusing as
parkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020)
at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215)
at
org.apac
Change
org.scala-lang
scala-library
2.13.11-M2
to
org.scala-lang
scala-library
${scala.version}
man. 29. mai 2023 kl. 13:20 skrev Lingzhe Sun :
> Hi Mich,
>
> Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8
> <https://github.com/ap
Hi Mich,
Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8. Since you
are using spark-core_2.13 and spark-sql_2.13, you should stick to the major(13)
and the minor version(8). Not using any of these may cause unexpected
behaviour(though scala claims compatibility among minor
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org
$apache$sp
ion` property matches the
version of `scala-library` defined in your dependencies. In your case, you
may want to change `scala.version` to `2.13.6`.
Here's the corrected part of your POM:
```xml
1.7
1.7
UTF-8
2.13.6
2.15.2
```
Additionally, ensure that the Scala versions in the
Hi,
Autoscaling is not compatible with Spark Structured Streaming
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
since
Spark Structured Streaming currently does not support dynamic allocation
(see SPARK-24815: Structured Streaming should support d
Hi Team,
I have been working on Spark Structured Streaming and trying to autoscale
our application through dynamic allocation. But I couldn't find any
documentation or configurations that supports dynamic scaling in Spark
Structured Streaming, due to which I had been using Spark Batch
a shuffle involved.
Raghavendra proposed to iterate the partitions. I presume that you
partition by "Date" and order within partition by "Row", which puts
multiple dates into one partition. Even if you have one date per
partition, AQE might coale
can you guys
> suggest a method to this scenario. Also for your concern this dataset is
> really large; it has around 1 records and I am using spark with
> scala
>
> Thank You,
> Best Regards
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_sourc
as a partition so row numbers cannot jumble. So can you guys
suggest a method to this scenario. Also for your concern this dataset is
really large; it has around 1 records and I am using spark with
scala
Thank You,
Best Regards
<https://www.avast.com/sig-email?utm_medium=email&utm
Just to correct the last sentence, if we end up starting a new instance of
Spark, I don't think it will be able to read the shuffle data from storage
from another instance, I stand corrected.
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
U
back to the source. In case of failure, there will be
delay in getting the results. The amount of delay depends on how much
reprocessing Spark needs to do.
Spark, by itself, doesn't add executors when executors fail. It just moves
the tasks to other executors. If you are installing plain va
levant.
We are running spark standalone mode.
Best regards,
maksym
On 2023/05/17 12:28:35 vaquar khan wrote:
> Following link you will get all required details
>
> https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/
>
> Let me know if you
functions as F
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField,
StringType,IntegerType, FloatType, TimestampType
import time
def main():
appName = "skew"
spark = SparkSession.builder.appName(appName).getOrCreate()
spa
Following link you will get all required details
https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/
Let me know if you required further informations.
Regards,
Vaquar khan
On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh
wrote:
> Couple of points
>
: Understanding Spark S3 Read Performance Hi,I'm
trying to set up a Spark pipeline which reads data from S3 and writes it into
Google Big Query.Environment Details:---Java 8AWS
EMR-6.10.0Spark v3.3.12 m5.xlarge executor nodesS3 Directory
structure:--- bucket-name:|---fo
Hi,
I'm trying to set up a Spark pipeline which reads data from S3 and writes
it into Google Big Query.
Environment Details:
---
Java 8
AWS EMR-6.10.0
Spark v3.3.1
2 m5.xlarge executor nodes
S3 Directory structure:
---
bucket-name:
|---folder1:
|---fo
Hi,
On the issue of Spark shuffle it is accepted that shuffle *often involves*
the following if not all below:
- Disk I/O
- Data serialization and deserialization
- Network I/O
Excluding external shuffle service and without relying on the configuration
options provided by spark for
On Mon, 15 May 2023 at 13:11, Faiz Halde
wrote:
> Hello,
>
> We've been in touch with a few spark specialists who suggested us a
> potential solution to improve the reliability of our jobs that are shuffle
> heavy
>
> Here is what our setup looks like
>
>-
Hello,
We've been in touch with a few spark specialists who suggested us a
potential solution to improve the reliability of our jobs that are shuffle
heavy
Here is what our setup looks like
- Spark version: 3.3.1
- Java version: 1.8
- We do not use external shuffle service
- W
In my view spark is behaving as expected.
TL:DR
Every time a dataframe is reused or branched or forked the sequence operations
evaluated run again. Use Cache or persist to avoid this behavior and un-persist
when no longer required, spark does not un-persist automatically.
Couple of things
Please see if this works
-- aggregate array into map of element of count
SELECT aggregate(array(1,2,3,4,5),
map('cnt',0),
(acc,x) -> map('cnt', acc.cnt+1)) as array_count
thanks
Vijay
On 2023/05/05 19:32:04 Yong Zhang wrote:
> Hi, This is on Spark 3.1 environment.
When I run this job in local mode spark-submit --master local[4]
with
spark = SparkSession.builder \
.appName("tests") \
.enableHiveSupport() \
.getOrCreate()
spark.conf.set("spark.sql.adaptive.enabled", "true")
df3.explain(extended=Tru
a'), map(),
(acc, x) -> ???, acc -> acc) AS feq_cnt
Here are my questions:
* Is using "map()" above the best way? The "start" structure in this case
should be Map.empty[String, Int], but of course, it won
v], PartitionFilters: [],
PushedFilters: [], ReadSchema: struct
```
On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh
wrote:
> When I run this job in local mode spark-submit --master local[4]
>
> with
>
> spark = SparkSession.builder \
> .appName(&
map(),
(acc, x) -> ???,
acc -> acc) AS feq_cnt
Here are my questions:
* Is using "map()" above the best way? The "start" structure in this case
should be Map.empty[String, Int], but of course, it won't wor
601 - 700 of 3577 matches
Mail list logo