Hi All,
We're using Spark 2.4.x to write dataframe into the Elasticsearch index.
As we're upgrading to Spark 3.3.0, it throwing out error
Caused by: java.lang.ClassNotFoundException: es.DefaultSource
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
at
I am using python 3.7 and Spark 2.4.7
I am not sure what the best way to do this is.
I have a dataframe with a url in one of the columns, and I want to download the
contents of that url and put it in a new column.
Can someone point me in the right direction on how to do this?I looked at the
UDFs
Hi Srivastan,
Ground investigation
1. Does this union explicitly exist in your code? If not, where are the
7 and 6 column counting coming from?
2. On 3.3.1 have you looked at spark UI and the relevant dag diagram
3. Check query execution plan using explain() functionality
4. Can
Hello Users,
I have been seeing some weird issues when I upgraded my
EMR setup to 6.11 (which uses spark 3.3.2) , the call stack seems to point
to a code location where there is no explicit union, also I have
unionByName everywhere in the codebase with allowMissingColumns set
Hi everyone,
I’m trying to use the “extension” feature of the Spark Connect CommandPlugin
(Spark 3.4.1).
I created a simple protobuf message `MyMessage` that I want to send from the
connect client-side to the connect server (where I registered my plugin).
The SparkSession class in `spark
Hi Jeremy,
This error concerns me
"23/08/23 20:01:03 ERROR LevelDBProvider: error opening leveldb file
file:/mnt/data_ebs/infrastructure/spark/tmp/registeredExecutors.ldb.
Creating new file, will not be able to recover state for existing
applications org.fusesource.leveldbjni.internal.Nat
Hi Spark Community,
We have a cluster running with Spark 3.3.1. All nodes are AWS EC2’s with an
Ubuntu OS version 22.04.
One of the workers disconnected from the main node. When we run
$SPARK_HOME/sbin/start-worker.sh spark://{main_host}:{cluster_port} it
appears to run successfully; there is no
We are happy to announce the availability of Apache Spark 3.3.3!
Spark 3.3.3 is a maintenance release containing stability fixes. This
release is based on the branch-3.3 maintenance branch of Spark. We strongly
recommend all 3.3 users to upgrade to this stable release.
To download Spark 3.3.3
In yours file /home/spark/real-estate/pullhttp/pull_apartments.py
replace import org.apache.spark.SparkContext with from pyspark import
SparkContext
man. 21. aug. 2023 kl. 15:13 skrev Kal Stevens :
> I am getting a class not found error
> import org.apache.spark.SparkContext
>
&g
I am getting a class not found error
import org.apache.spark.SparkContext
It sounds like this is because pyspark is not installed, but as far as I
can tell it is.
Pyspark is installed in the correct python verison
root@namenode:/home/spark/# pip3.10 install pyspark
Requirement already
Hi Team,
I need some help and if someone can replicate the issue at their end, or
let me know if I am doing anything wrong.
https://issues.apache.org/jira/browse/SPARK-44884
We have recently upgraded to Spark 3.3.0 in our Production Dataproc.
We have a lot of downstream application that relies
Interesting.
Spark supports the following cluster managers
- Standalone: A cluster-manager, limited in features, shipped with Spark.
- Apache Hadoop YARN is the most widely used resource manager not just
for Spark but for other artefacts as well. On-premise YARN is used
extensively
This should work
check your path. It should pyspark from
which pyspark
/opt/spark/bin/pyspark
And your installation should contain
cd $SPARK_HOME
/opt/spark> ls
LICENSE NOTICE R README.md RELEASE bin conf data examples jars
kubernetes licenses logs python sbin yarn
You sho
Nevermind I was doing something dumb
On Sun, Aug 20, 2023 at 9:53 PM Kal Stevens wrote:
> Are there installation instructions for Spark 3.4.1?
>
> I defined SPARK_HOME as it describes here
>
> https://spark.apache.org/docs/latest/api/python/getting_started/install.html
>
>
Good afternoon.
Perhaps you will be discouraged by what I will write below, but nevertheless, I
ask for help in solving my problem. Perhaps the architecture of our solution
will not seem correct to you.
There are backend services that communicate with a service that implements
spark-driver
Are there installation instructions for Spark 3.4.1?
I defined SPARK_HOME as it describes here
https://spark.apache.org/docs/latest/api/python/getting_started/install.html
ls $SPARK_HOME/python/lib
py4j-0.10.9.7-src.zip PY4J_LICENSE.txt pyspark.zip
I am getting a class not found error
ill in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 19 Aug 2023 at 21:36, Dipayan Dev wrote:
>
>> Hi Everyone,
>>
>> I'm stuck with one problem, where I need to provide a custom GCS locat
ed to provide a custom GCS location
> for the Hive table from Spark. The code fails while doing an *'insert
> into'* whenever my Hive table has a flag GS location like
> gs://, but works for nested locations like
> gs://bucket_name/blob_name.
>
> Is anyone aware if it
Hi Everyone,
I'm stuck with one problem, where I need to provide a custom GCS location
for the Hive table from Spark. The code fails while doing an *'insert into'*
whenever my Hive table has a flag GS location like gs://, but
works for nested locations like gs://bucket_name/blob_n
rency and Thread Issues: If there are too many concurrent
> connections or thread limitations,
> it could result in failed connections. *Adjust
> spark.shuffle.io.clientThreads*
> - It might be prudent to do the same to *spark.shuffle.io.server.Threads*
> - Check how stable your envi
,
it could result in failed connections. *Adjust
spark.shuffle.io.clientThreads*
- It might be prudent to do the same to *spark.shuffle.io.server.Threads*
- Check how stable your environment is. Observe any issues reported in
Spark UI
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
trust that you are familiar with the concept of shuffle in Spark.
> Spark Shuffle is an expensive operation since it involves the following
>
>-
>
>Disk I/O
>-
>
>Involves data serialization and deserialization
>-
>
>Network I/O
>
> Bas
Hi,
These two threads that you sent seem to be duplicates of each other?
Anyhow I trust that you are familiar with the concept of shuffle in Spark.
Spark Shuffle is an expensive operation since it involves the following
-
Disk I/O
-
Involves data serialization and deserialization
I want to learn differences among below thread configurations.
spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads
Thanks.
I want to learn differences among below thread configurations.
spark.shuffle.io.serverThreads
spark.shuffle.io.clientThreads
spark.shuffle.io.threads
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.io.threads
Thanks.
s back.
Thanks,
Sankavi
From: Bjørn Jørgensen
Sent: Monday, August 14, 2023 6:11 PM
To: Sankavi Nagalingam
Cc: user@spark.apache.org; Vijaya Kumar Mathupaiyan
Subject: [EXT MSG] Re: Spark Vulnerabilities
EXTERNAL source. Be CAREFUL with links / attachments
I have added links to the github
Yes, it sounds like it. So the broadcast DF size seems to be between 1 and
4GB. So I suggest that you leave it as it is.
I have not used the standalone mode since spark-2.4.3 so I may be missing a
fair bit of context here. I am sure there are others like you that are
still using it!
HTH
Mich
from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 17 Aug 2023 at 21:01, Patrick Tucci
> wrote:
>
>> Hi Mich,
>>
>> Here are my config values from spark-defaults.conf:
>>
>> spark.eventLog.enabled true
>> spark.eventLog.dir hdfs:/
ny monetary damages arising from
such loss, damage or destruction.
On Thu, 17 Aug 2023 at 21:01, Patrick Tucci wrote:
> Hi Mich,
>
> Here are my config values from spark-defaults.conf:
>
> spark.eventLog.enabled true
> spark.eventLog.dir hdfs://10.0.50.1:8020/spark-
Hi Mich,
Here are my config values from spark-defaults.conf:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://10.0.50.1:8020/spark-logs
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://10.0.50.1:8020/spark-logs
Hello Paatrick,
As a matter of interest what parameters and their respective values do
you use in spark-submit. I assume it is running in YARN mode.
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/m
Hi Mich,
Yes, that's the sequence of events. I think the big breakthrough is that
(for now at least) Spark is throwing errors instead of the queries hanging.
Which is a big step forward. I can at least troubleshoot issues if I know
what they are.
When I reflect on the issues I faced an
Hi Patrik,
glad that you have managed to sort this problem out. Hopefully it will go
away for good.
Still we are in the dark about how this problem is going away and coming
back :( As I recall the chronology of events were as follows:
1. The Issue with hanging Spark job reported
2
Hi Everyone,
I just wanted to follow up on this issue. This issue has continued since
our last correspondence. Today I had a query hang and couldn't resolve the
issue. I decided to upgrade my Spark install from 3.4.0 to 3.4.1. After
doing so, instead of the query hanging, I got an error me
For the Guava case, you may be interested in
https://github.com/apache/spark/pull/42493
Thanks,
Cheng Pan
> On Aug 14, 2023, at 16:50, Sankavi Nagalingam
> wrote:
>
> Hi Team,
> We could see there are many dependent vulnerabilities present in the latest
> spark-core:3.4.
Yeah, we generally don't respond to "look at the output of my static
analyzer".
Some of these are already addressed in a later version.
Some don't affect Spark.
Some are possibly an issue but hard to change without breaking lots of
things - they are really issues with upstrea
I have added links to the github PR. Or comment for those that I have not
seen before.
Apache Spark has very many dependencies, some can easily be upgraded while
others are very hard to fix.
Please feel free to open a PR if you wanna help.
man. 14. aug. 2023 kl. 14:06 skrev Sankavi Nagalingam
Hi Team,
We could see there are many dependent vulnerabilities present in the latest
spark-core:3.4.1.jar. PFA
Could you please let us know when will be the fix version available for the
users.
Thanks,
Sankavi
The information in this e-mail and any attachments is confidential and may be
r install an additional Java version, I attempted to use the
> latest alpha as well. This appears to have worked, although I couldn't
> figure out how to get it to use the metastore_db from Spark.
>
> After turning my attention back to Spark, I determined the issue. After
> much trouble
tions
suggest it might be a Java incompatibility issue. Since I didn't want to
downgrade or install an additional Java version, I attempted to use the
latest alpha as well. This appears to have worked, although I couldn't
figure out how to get it to use the metastore_db from Spark.
After turni
to migrate
>>> to Delta Lake and see if that solves the issue.
>>>
>>> Thanks again for your feedback.
>>>
>>> Patrick
>>>
>>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>
;
>> On Fri, Aug 11, 2023 at 10:09 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Patrick,
>>>
>>> There is not anything wrong with Hive On-premise it is the best data
>>> warehouse there is
>>>
>>> Hive handles bo
..@gmail.com> wrote:
>
>> Hi Patrick,
>>
>> There is not anything wrong with Hive On-premise it is the best data
>> warehouse there is
>>
>> Hive handles both ORC and Parquet formal well. They are both columnar
>> implementations of relational mod
both ORC and Parquet formal well. They are both columnar
> implementations of relational model. What you are seeing is the Spark API
> to Hive which prefers Parquet. I found out a few years ago.
>
> From your point of view I suggest you stick to parquet format with Hive
> specific t
Hi Patrick,
There is not anything wrong with Hive On-premise it is the best data
warehouse there is
Hive handles both ORC and Parquet formal well. They are both columnar
implementations of relational model. What you are seeing is the Spark API
to Hive which prefers Parquet. I found out a few
Thanks for the reply Stephen and Mich.
Stephen, you're right, it feels like Spark is waiting for something, but
I'm not sure what. I'm the only user on the cluster and there are plenty of
resources (+60 cores, +250GB RAM). I even tried restarting Hadoop, Spark
and the host serve
Steve may have a valid point. You raised an issue with concurrent writes
before, if I recall correctly. Since this limitation may be due to Hive
metastore. By default Spark uses Apache Derby for its database
persistence. *However
it is limited to only one Spark session at any time for the purposes
Hi Kezhi,
Yes, you no longer need to start a master to make the client work. Please
see the quickstart.
https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_connect.html
You can think of Spark Connect as an API on top of Master so workers can be
added to the cluster same
Hi Patrick,
When this has happened to me in the past (admittedly via spark-submit) it has
been because another job was still running and had already claimed some of the
resources (cores and memory).
I think this can also happen if your configuration tries to claim resources
that will never be
Hi Mich,
I don't believe Hive is installed. I set up this cluster from scratch. I
installed Hadoop and Spark by downloading them from their project websites.
If Hive isn't bundled with Hadoop or Spark, I don't believe I have it. I'm
running the Thrift server distributed
Hi Mark,
I created a spark3.4.1 docker file. Details from
spark-py-3.4.1-scala_2.12-11-jre-slim-buster
<https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated>
Pull instructions are given
docker pull
michtalebzadeh/spark_dockerfile
damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 10 Aug 2023 at 20:02, Patrick Tucci
> wrote:
>
>> Hi Mich,
>>
>> Thanks for the reply. Unfortunately I don't have Hive set up on my
>> cluster. I can explore this if th
my
> cluster. I can explore this if there are no other ways to troubleshoot.
>
> I'm using beeline to run commands against the Thrift server. Here's the
> command I use:
>
> ~/spark/bin/beeline -u jdbc:hive2://10.0.50.1:1 -n hadoop -f
> command.sql
>
> Thanks
Hi Mich,
Thanks for the reply. Unfortunately I don't have Hive set up on my cluster.
I can explore this if there are no other ways to troubleshoot.
I'm using beeline to run commands against the Thrift server. Here's the
command I use:
~/spark/bin/beeline -u jdbc:hive2://10.
mail's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Thu, 10 Aug 2023 at 18:39, Patrick Tucci wrote:
> Hello,
>
> I'm attempting to run a query on Spar
Hello,
I'm attempting to run a query on Spark 3.4.0 through the Spark
ThriftServer. The cluster has 64 cores, 250GB RAM, and operates in
standalone mode using HDFS for storage.
The query is as follows:
SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON
Hi,
I'm recently learning Spark Connect but have some questions regarding the
connect server's relation with master or workers: so when I'm using the
connect server, I don't have to start a master alone side to make clients
work. Is the connect server simply using "local[
Hi Mark,
you can build it yourself, no big deal :)
REPOSITORY TAG
IMAGE ID CREATED
SIZE
sparkpy/spark-py
3.4.1-scala_2.12-11-jre-slim-buster-Dockerfile a876102b2206 1
second ago
Hello,
I noticed that the apache/spark-py image for Spark's 3.4.1 release is not
available (apache/spark@3.4.1 is available). Would it be possible to get
the 3.4.1 release build for the apache/spark-py image published?
Thanks,
Mark
--
This communication, together wit
unsubscribe
From: Mich Talebzadeh
Sent: Tuesday, August 8, 2023 4:43 PM
To: user @spark
Subject: [EXTERNAL] Use of ML in certain aspects of Spark to improve the
performance
I am currently pondering and sharing my thoughts openly. Given our reliance on
gathered
I am currently pondering and sharing my thoughts openly. Given our reliance
on gathered statistics, it prompts the question of whether we could
integrate specific machine learning components into Spark Structured
Streaming. Consider a scenario where we aim to adjust configuration values
on the fly
Hi,
I would like to share experience on spark 3.4.1 running on k8s autopilot or
some refer to it as serverless.
My current experience is on Google GKE autopilot
<https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview>.
So essentially you specify the name and region a
pp4 has one row, I'm guessing - containing an array of 10 images. You want
10 rows of 1 image each.
But, just don't do this. Pass the bytes of the image as an array,
along with width/height/channels, and reshape it on use. It's just easier.
That is how the Spark image representati
Hello Adrian,
here is the snippet
import tensorflow_datasets as tfds
(ds_train, ds_test), ds_info = tfds.load(
dataset_name, data_dir='', split=["train",
"test"], with_info=True, as_supervised=True
)
schema = StructType([
StructField("image", ArrayType(ArrayType(ArrayType(Integer
will cover me as I plan to be out of the
> office soon)
>
> Hi Kent and Sean,
>
> Nice to meet you. I am working on the OSS legal aspects with Pavan who is
> planning to make the contribution request to the Spark project. I saw that
> Sean mentioned in his email that the contribu
(Adding my manager Eugene Kim who will cover me as I plan to be out of the
office soon)
Hi Kent and Sean,
Nice to meet you. I am working on the OSS legal aspects with Pavan who is
planning to make the contribution request to the Spark project. I saw that
Sean mentioned in his email that the
Hello,
can you also please show us how you created the pandas dataframe? I mean,
how you added the actual data into the dataframe. It would help us for
reproducing the error.
Thank you,
Pop-Tifrea Adrian
On Mon, Jul 31, 2023 at 5:03 AM second_co...@yahoo.com <
second_co...@yahoo.com> wrote:
> i
Hi,
I am new to Spark and looking for help regarding the session windowing
<https://spark.apache.org/docs/3.4.1/structured-streaming-programming-guide.html#types-of-time-windows>
in Spark. I want to create session windows on a user activity stream with a
gap duration of `x` minutes and als
i changed to
ArrayType(ArrayType(ArrayType(IntegerType( , still get same error
Thank you for responding
On Thursday, July 27, 2023 at 06:58:09 PM GMT+8, Adrian Pop-Tifrea
wrote:
Hello,
when you said your pandas Dataframe has 10 rows, does that mean it contains 10
images? Becaus
ok so as expected the underlying database is Hive. Hive uses hdfs storage.
You said you encountered limitations on concurrent writes. The order and
limitations are introduced by Hive metastore so to speak. Since this is all
happening through Spark, by default implementation of the Hive metastore
4:28 PM Mich Talebzadeh
> wrote:
>
>> It is not Spark SQL that throws the error. It is the underlying Database
>> or layer that throws the error.
>>
>> Spark acts as an ETL tool. What is the underlying DB where the table
>> resides? Is concurrency supported.
that
will work better in different use cases according to the writing pattern,
type of queries, data characteristics, etc.
*Pol Santamaria*
On Sat, Jul 29, 2023 at 4:28 PM Mich Talebzadeh
wrote:
> It is not Spark SQL that throws the error. It is the underlying Database
> or layer that
It is not Spark SQL that throws the error. It is the underlying Database or
layer that throws the error.
Spark acts as an ETL tool. What is the underlying DB where the table
resides? Is concurrency supported. Please send the error to this list
HTH
Mich Talebzadeh,
Solutions Architect
Hello,
I'm building an application on Spark SQL. The cluster is set up in
standalone mode with HDFS as storage. The only Spark application running is
the Spark Thrift Server using FAIR scheduling mode. Queries are submitted
to Thrift Server using beeline.
I have multiple queries that insert
Spark on tin boxes like Google Dataproc or AWS EC2 often utilise YARN
resource manager. YARN is the most widely used resource manager not just
for Spark but for other artefacts as well. On-premise YARN is used
extensively. In Cloud it is also used widely in Infrastructure as a Service
such as
Hi all,
I am learning about the performance difference of Spark when performing a
JOIN problem on Serverless (K8S) and Serverful (Traditional server)
environments.
Through experiment, Spark on K8s tends to run slower than Serverful.
Through understanding the architecture, I know that Spark runs
Hello,
when you said your pandas Dataframe has 10 rows, does that mean it contains
10 images? Because if that's the case, then you'd want ro only use 3 layers
of ArrayType when you define the schema.
Best regards,
Adrian
On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID
wrote:
> i h
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500,
333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500,
333, 3)
when using spark.createDataframe(panda_dataframe, schema), i need to specify
the schema,
schema = StructType([
StructField(
There is no such method in Spark. I think that's some EMR-specific
modification.
On Wed, Jul 26, 2023 at 11:06 PM second_co...@yahoo.com.INVALID
wrote:
> I ran the following code
>
> spark.sparkContext.list_packages()
>
> on spark 3.4.1 and i get below error
>
>
I ran the following code
spark.sparkContext.list_packages()
on spark 3.4.1 and i get below error
An error was encountered:
AttributeError
[Traceback (most recent call last):
, File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py",
line 113, in exec
self._exec
A with Twilio and consider
> establishing that to govern contributions.
> >
> > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com.invalid> wrote:
> >>
> >> Hi Spark Dev,
> >>
> >> My name is Pavan Kotikalapudi,
ributed to the project is assumed
> to have been licensed per above already.
>
> It might be wise to review the CCLA with Twilio and consider establishing
> that to govern contributions.
>
> On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi
> wrote:
>>
>> Hi S
e wise to review the CCLA with Twilio and consider establishing
that to govern contributions.
On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi
wrote:
> Hi Spark Dev,
>
> My name is Pavan Kotikalapudi, I work at Twilio.
>
> I am looking to contribute to this spark issue
> h
Hi Spark Dev,
My name is Pavan Kotikalapudi, I work at Twilio.
I am looking to contribute to this spark issue
https://issues.apache.org/jira/browse/SPARK-24815.
There is a clause from the company's OSS saying
- The proposed contribution is about 100 lines of code modification in the
personally I have not done it myself.
CCed to spark user group if some user has tried it among users.
HTH
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-p
This is the downloaded docker?
Try this with the added configuration options as below
/opt/spark/sbin/start-connect-server.sh *--conf
spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp"
*--packages
org.apache.spark:spark-connect_2.12:3.4.1
And you will get
Hello,
I am trying to launch Spark connect on Docker Image
❯ docker run -it apache/spark:3.4.1-scala2.12-java11-r-ubuntu /bin/bash
spark@aa0a670f7433:/opt/spark/work-dir$
/opt/spark/sbin/start-connect-server.sh --packages
org.apache.spark:spark-connect_2.12:3.4.1
starting
this link might help
https://stackoverflow.com/questions/46929351/spark-reading-orc-file-in-driver-not-in-executors
Mich Talebzadeh,
Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebza
t; partition updates(insert overwrite) daily for the last 30 days
>>> (partitions).
>>> The ETL inside the staging directories is completed in hardly 5minutes,
>>> but then renaming takes a lot of time as it deletes and copies the
>>> partitions.
>>> My issue is somethi
rtitions).
>> The ETL inside the staging directories is completed in hardly 5minutes,
>> but then renaming takes a lot of time as it deletes and copies the
>> partitions.
>> My issue is something related to this -
>> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhyt
tories is completed in hardly 5minutes,
> but then renaming takes a lot of time as it deletes and copies the
> partitions.
> My issue is something related to this -
> https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
>
>
>
> With Best Regards,
>
>
it deletes and copies the partitions.
My issue is something related to this -
https://groups.google.com/g/cloud-dataproc-discuss/c/neMyhytlfyg?pli=1
With Best Regards,
Dipayan Dev
On Wed, Jul 19, 2023 at 12:06 AM Mich Talebzadeh
wrote:
> Spark has no role in creating that hive stag
Spark has no role in creating that hive staging directory. That directory
belongs to Hive and Spark simply does ETL there, loading to the Hive
managed table in your case which ends up in saging directory
I suggest that you review your design and use an external hive table with
explicit location
It does help performance but not significantly.
I am just wondering, once Spark creates that staging directory along with
the SUCCESS file, can we just do a gsutil rsync command and move these
files to original directory? Anyone tried this approach or foresee any
concern?
On Mon, 17 Jul 2023
++ DEV community
On Mon, Jul 17, 2023 at 4:14 PM Varun Shah
wrote:
> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apache/spark project locally for my 1st
> Open Source Contribution, by building and creating a pa
Hi Team,
I am still looking for a guidance here. Really appreciate anything that
points me in the right direction.
On Mon, Jul 17, 2023, 16:14 Varun Shah wrote:
> Resending this message with a proper Subject line
>
> Hi Spark Community,
>
> I am trying to set up my forked apach
ll take a long time to perform this step. One workaround will be
> to create smaller number of larger files if that is possible from Spark and
> if this is not possible then those configurations allow for configuring the
> threadpool which does the metadata copy.
>
> You can go thr
ote the
> MLlib-specific contribution guidelines section in particular.
>
> https://spark.apache.org/contributing.html
>
> Since you are looking for something to start with, take a look at this
> Jira query for starter issues.
>
>
> https://issues.apache.org/jira/browse/S
Fileoutputcommitter v2 is supported in GCS but the rename is a metadata
copy and delete operation in GCS and therefore if there are many number of
files it will take a long time to perform this step. One workaround will be
to create smaller number of larger files if that is possible from Spark and
restingly, it took only 10 minutes to write the output in the staging
> directory and rest of the time it took to rename the objects. Thats the
> concern.
>
> Looks like a known issue as spark behaves with GCS but not getting any
> workaround for this.
>
>
> On Mon, 17 Jul
501 - 600 of 3568 matches
Mail list logo