Hi Chhavi,
Currently there is no way to handle backtick(`) spark StructType. Hence the
field name a.b and `a.b` are completely different within StructType.
To handle that, I have added a custom implementation fixing StringIndexer#
validateAndTransformSchema. You can refer to the code on my
Hi Someshwar,
Thanks for the response, I have added my comments to the ticket
<https://issues.apache.org/jira/browse/SPARK-48463>.
Thanks,
Chhavi Bansal
On Thu, 6 Jun 2024 at 17:28, Someshwar Kale wrote:
> As a fix, you may consider adding a transformer to rename columns (perhaps
Hey, do you perform stateful operations? Maybe your state is growing
indefinitely - a screenshot with state metrics would help (you can find it
in Spark UI -> Structured Streaming -> your query). Do you have a
driver-only cluster or do you have workers too? What's the memory usage
p
Hi All,
I am using the pyspark structure streaming with Azure Databricks for data
load process.
In the Pipeline I am using a Job cluster and I am running only one
pipeline, I am getting the OUT OF MEMORY issue while running for a
long time. When I inspect the metrics of the cluster I found that,
Hi Alex,
Spark is an open source software available under Apache License 2.0 (
https://www.apache.org/licenses/), further details can be found here in the
FAQ page (https://spark.apache.org/faq.html).
Hope this helps.
Thanks,
Sadha
On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX
wrote
Hey guys!
I am part of the team responsible for software approval at EMBRAER S.A.
We are currently in the process of approving the Apache Spark 3.5.1 software
and are verifying the licensing of the application.
Therefore, I would like to kindly request you to answer the questions below.
-What
")
val si = new
StringIndexer().setInputCol("location_longitude").setOutputCol("longitutdee")
val pipeline = new Pipeline().setStages(Array(renameColumn, si))
pipeline.fit(flattenedDf).transform(flattenedDf).show()
refer my comment
<https://issues.apache.org/jira/b
a ticket
on spark https://issues.apache.org/jira/browse/SPARK-48423, where I
describe the entire scenario. I tried debugging the code and found that
this key is being explicitly asked for in the code. The only solution was
to again set it part of spark.conf which could result to a race condition
since
Hello team
I was exploring feature transformation exposed via Mllib on nested dataset,
and encountered an error while applying any transformer to a column with
dot notation naming. I thought of raising a ticket on spark
https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned
aimed at handling substantial volumes of data. As part of
our deployment strategy, we are endeavouring to implement a Spark-based
application on our Azure Kubernetes service.
Regrettably, we have encountered challenges from a security perspective with
the latest Apache Spark Docker image
Hi all,
To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a preview release of Spark 4.0. This
preview is not a stable release in terms of either API or functionality,
but it is meant to give the community early access to try the code
tionKey")
.text("output");
However, as mentioned in SPARK-44512 (
https://issues.apache.org/jira/browse/SPARK-44512), this does not guarantee
the output is globally sorted.
(note: I found that even setting
spark.sql.optimizer.plannedWrite.enabled=false still does not
I am reading from a single file:
df = spark.read.text("s3a://test-bucket/testfile.csv")
On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh
wrote:
> Tell Spark to read from a single file
>
> data = spark.read.text("s3a://test-bucket/testfile.csv")
>
> This clar
Tell Spark to read from a single file
data = spark.read.text("s3a://test-bucket/testfile.csv")
This clarifies to Spark that you are dealing with a single file and avoids
any bucket-like interpretation.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI
I will work on the first two possible causes.
For the third one, which I guess is the real problem, Spark treats the
testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket
to access _spark_metadata with url
s3a://test-bucket/testfile.csv/_spark_metadata
testfile.csv
ok
some observations
- Spark job successfully lists the S3 bucket containing testfile.csv.
- Spark job can retrieve the file size (33 Bytes) for testfile.csv.
- Spark job fails to read the actual data from testfile.csv.
- The printed content from testfile.csv is an empty list
The code should read testfile.csv file from s3. and print the content. It
only prints a empty list although the file has content.
I have also checked our custom s3 storage (Ceph based) logs and I see only
LIST operations coming from Spark, there is no GET object operation for
testfile.csv
Hello,
Overall, the exit code of 0 suggests a successful run of your Spark job.
Analyze the intended purpose of your code and verify the output or Spark UI
for further confirmation.
24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode
0.
what to check
1. Verify
Hi Mich,
Thank you for the help and sorry about the late reply.
I ran your provided but I got "exitCode 0". Here is the complete output:
===
24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0
24/05/30 01:23:38 INFO SparkContext: OS info Li
Hi, team! I have a spark on k8s issue which posts in
https://stackoverflow.com/questions/78537132/spark-on-k8s-resource-creation-order
Need help
Did you try using to_protobuf and from_protobuf ?
https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html
On Mon, May 27, 2024 at 15:45 Satyam Raj wrote:
> Hello guys,
> We're using Spark 3.5.0 for processing Kafka source that contains protobuf
> serialized data. T
When you use applyInPandasWithState, Spark processes each input row as it
arrives, regardless of whether certain columns, such as the timestamp
column, contain NULL values. This behavior is useful where you want to
handle incomplete or missing data gracefully within your stateful
processing logic
Hello guys,
We're using Spark 3.5.0 for processing Kafka source that contains protobuf
serialized data. The format is as follows:
message Request {
long sent_ts = 1;
Event[] event = 2;
}
message Event {
string event_name = 1;
bytes event_bytes = 2;
}
The event_bytes contains the data
: Is this a supported feature of Spark? Can I rely on this
behavior for my production code?
Thanks,
Juan
drawbacks, and is not recommended in general, but at least we had something
working. The real, long-term solution was to replace the many ( > 200)
withColumn() calls to only a few select() calls. You can easily find sources on
the internet for why this is better. (it was on Spark 2.3, but I th
involves deep nesting, consider using techniques like explode or flattening
to transform it into a less nested structure. This can reduce memory usage
during operations like withColumn.
3. Lazy Evaluation: Use select before withColumn: this ensures lazy
evaluation, meaning Spark only
Dear Community,
I'm reaching out to seek your assistance with a memory issue we've been
facing while processing certain large and nested DataFrames using Apache
Spark. We have encountered a scenario where the driver runs out of memory
when applying the `withColumn` method on specific DataFrames
sorry i thought i gave an explanation
The issue you are encountering with incorrect record numbers in the
"ShuffleWrite Size/Records" column in the Spark DAG UI when data is read
from cache/persist is a known limitation. This discrepancy arises due to
the way Spark handles and repor
Just to further clarify that the Shuffle Write Size/Records column in
the Spark UI can be misleading when working with cached/persisted data
because it reflects the shuffled data size and record count, not the
entire cached/persisted data., So it is fair to say that this is a
limitation
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show
incorrect record counts *when data is retrieved from cache or persisted
data*. This happens because the record count reflects the number of records
written to disk for shuffling, and not the actual number of re
please assist me ?
On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote:
Does anyone have a clue ?
On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote:
Hello Team,in spark DAG UI , we have Stages tab. Once you click on each stage
you can view the tasks.
In each task we have a column "Shuffle
Can anyone please assist me ?
On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote:
> Does anyone have a clue ?
>
> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote:
>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>> can
Something like this in Python
from pyspark.sql import SparkSession
# Configure Spark Session with JDBC URLs
spark_conf = SparkConf() \
.setAppName("SparkCatalogMultipleSources") \
.set("hive.metastore.uris",
"thrift://hive1-metastore:9080,thrift://hive2-me
I have two clusters hive1 and hive2, as well as a MySQL database. Can I use
Spark Catalog for registration, but can I only use one catalog at a time? Can
multiple catalogs be joined across databases.
select * from
hive1.table1 join hive2.table2 join mysql.table1
where
308027
Does anyone have a clue ?
On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote:
> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> pr
Could be a number of reasons
First test reading the file with a cli
aws s3 cp s3a://input/testfile.csv .
cat testfile.csv
Try this code with debug option to diagnose the problem
from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException
try:
# Initialize Spark
Hello Team,
in spark DAG UI , we have Stages tab. Once you click on each stage you can
view the tasks.
In each task we have a column "ShuffleWrite Size/Records " that column
prints wrong data when it gets the data from cache/persist . it
typically will show the wrong record number thoug
I am trying to read an s3 object from a local S3 storage (Ceph based)
using Spark 3.5.1. I see it can access the bucket and list the files (I
have verified it on Ceph side by checking its logs), even returning the
correct size of the object. But the content is not read.
The object url is:
s3a
Hello,
We are hoping someone can help us understand the spark behavior
for scenarios listed below.
Q. *Will spark running queries fail when S3 parquet object changes
underneath with S3A remote file change detection enabled? Is it 100%? *
Our understanding is that S3A has
cannot be guaranteed . It is essential to note that, as with any
advice, quote "one test result is worth one-thousand expert opinions (Werner
Von Braun)".
On Tue, 21 May 2024 at 16:21, Mich Talebzadeh wrote:
I just wanted to share a tool I built called spark-column-analyzer. It's a
Py
as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
On Tue, 21 May 2024 at 16:21, Mich Talebzadeh
wrote:
> I just wanted to s
I just wanted to share a tool I built called *spark-column-analyzer*. It's
a Python package that helps you dig into your Spark DataFrames with ease.
Ever spend ages figuring out what's going on in your columns? Like, how
many null values are there, or how many unique entries? Built with data
Dear Team,
I hope this email finds you well. My name is Nikhil Raj, and I am currently
working with Apache Spark for one of my projects , where through the help
of a parquet file we are creating an external table in Spark.
I am reaching out to seek assistance regarding user authentication
Hi Spark Community,
I have a question regarding the support for User-Defined Functions (UDFs)
in Spark Connect, specifically when using Kubernetes as the Cluster Manager.
According to the Spark documentation, UDFs are supported by default for the
shell and in standalone applications
nt M-CS) ;
user@spark.apache.org
Subject: Re: [spark-graphframes]: Generating incorrect edges
Hi Steve,
Thanks for your statement. I tend to use uuid myself to avoid collisions. This
built-in function generates random IDs that are highly likely to be unique
across systems. My concerns are
Hi everyone,
We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark
3.5.1.
I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java
21?
Thanks,
Steve C
This email contains confidential information of and is the copyright of
Infomedia
ided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.or
Hi Folks,
I wanted to check why spark doesn't create staging dir while doing an
insertInto on partitioned tables. I'm running below example code –
```
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3
Hi Team,
We are trying to use *spark structured streaming *for our use case.
We will be joining 2 streaming sources(from kafka topic) with watermarks.
As time progresses, the records that are prior to the watermark timestamp
are removed from the state. For our use case, we want to *store
Hi Kartrick,
Unfortunately Materialised views are not available in Spark as yet. I
raised Jira [SPARK-48117] Spark Materialized Views: Improve Query
Performance and Data Management - ASF JIRA (apache.org)
<https://issues.apache.org/jira/browse/SPARK-48117> as a feature request.
Let me
the view data into elastic index by using cdc?
Thanks in advance.
On Fri, May 3, 2024 at 3:39 PM Mich Talebzadeh
wrote:
> My recommendation! is using materialized views (MVs) created in Hive with
> Spark Structured Streaming and Change Data Capture (CDC) is a good
> combination for ef
Hi,
I have raised a ticket SPARK-48117
<https://issues.apache.org/jira/browse/SPARK-48117> for enhancing Spark
capabilities with Materialised Views (MV). Currently both Hive and
Databricks support this. I have added these potential benefits to the
ticket
-* Improved Query Perfo
Sadly Apache Spark sounds like it has nothing to do within materialised
views. I was hoping it could read it!
>>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()*
Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/p
Dear Spark Community,
I'm writing to seek your expertise in optimizing the performance of our
Spark History Server (SHS) deployed on Amazon EKS. We're encountering
timeouts (HTTP 504) when loading large event logs exceeding 5 GB.
*Our Setup:*
- Deployment: SHS on EKS with Nginx ingress (idle
My recommendation! is using materialized views (MVs) created in Hive with
Spark Structured Streaming and Change Data Capture (CDC) is a good
combination for efficiently streaming view data updates in your scenario.
HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI
Thanks for the comments I received.
So in summary, Apache Spark itself doesn't directly manage materialized
views,(MV) but it can work with them through integration with the
underlying data storage systems like Hive or through iceberg. I believe
databricks through unity catalog support MVs
(removing dev@ as I don't think this is dev@ related thread but more about
"question")
My understanding is that Apache Spark does not support Materialized View.
That's all. IMHO it's not a proper expectation that all operations in
Apache Hive will be supported in Apache Spark. They are
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with
CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess
you must have created the view from Hive and are trying to drop it from
Spark and that is why you are running to the issue with DROP first
An issue I encountered while working with Materialized Views in Spark SQL.
It appears that there is an inconsistency between the behavior of
Materialized Views in Spark SQL and Hive.
When attempting to execute a statement like DROP MATERIALIZED VIEW IF
EXISTS test.mv in Spark SQL, I encountered
from view definition) by using spark structured
streaming.
Issue:
1. Here we are facing issue - For each incomming id here we running view
definition(so it will read all the data from all the data) and check if any
of the incomming id is present in the collective id's of view result, Due
to which
Hi Steve,
Thanks for your statement. I tend to use uuid myself to avoid
collisions. This built-in function generates random IDs that are highly
likely to be unique across systems. My concerns are on edge so to speak. If
the Spark application runs for a very long time or encounters restarts
Hi Mich,
I was just reading random questions on the user list when I noticed that you
said:
On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote:
1) You are using monotonically_increasing_id(), which is not
collision-resistant in distributed environments like Spark. Multiple hosts
can
.
My suggestions
- Increase Executor Memory: Allocate more memory per executor (e.g., 2GB
or 3GB) to allow for multiple executors within available cluster memory.
- Adjust Driver Pod Resources: Ensure the driver pod has enough memory
to run Spark and manage executors.
- Optimize
Respected Sir/Madam,
I am Tarunraghav. I have a query regarding spark on kubernetes.
We have an eks cluster, within which we have spark installed in the pods.
We set the executor memory as 1GB and set the executor instances as 2, I
have also set dynamic allocation as true. So when I try to read
Thank you.
My main purpose is pass "MaxDop 1" to MSSQL to control the CPU usage. From the
offical doc, I guess the problem of my codes is spark wrap the query to
select * from (SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 option
(maxdop 1)) spark_gen_alias
"128G"
).set("spark.executor.memoryOverhead", "32G"
).set("spark.driver.cores", "16"
).set("spark.driver.memory", "64G"
)
I dont think b) applies as its a single machine.
Kind regards,
Jelle
Fr
OK let us have a look at these
1) You are using monotonically_increasing_id(), which is not
collision-resistant in distributed environments like Spark. Multiple hosts
can generate the same ID. I suggest switching to UUIDs (e.g.,
uuid.uuid4()) for guaranteed uniqueness.
2) Missing values
___
From: Mich Talebzadeh
Sent: Wednesday, April 24, 2024 4:40 PM
To: Nijland, J.G.W. (Jelle, Student M-CS)
Cc: user@spark.apache.org
Subject: Re: [spark-graphframes]: Generating incorrect edges
OK few observations
1) ID Generation Method: How are you generating unique IDs (UUIDs, seque
jl...@student.utwente.nl> wrote:
> tags: pyspark,spark-graphframes
>
> Hello,
>
> I am running pyspark in a podman container and I have issues with
> incorrect edges when I build my graph.
> I start with loading a source dataframe from a parquet directory on my
&
You might be able to leverage the prepareQuery option, that is at
https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option
... this was introduced in Spark 3.4.0 to handle temp table query and CTE
query against MSSQL server since what you send in is not actually what
tags: pyspark,spark-graphframes
Hello,
I am running pyspark in a podman container and I have issues with incorrect
edges when I build my graph.
I start with loading a source dataframe from a parquet directory on my server.
The source dataframe has the following columns
[QUESTION] How to pass MAXDOP option · Issue #2395 · microsoft/mssql-jdbc
(github.com)
Hi team,
I am suggested to require help form spark community.
We suspect spark rewerite the query before pass to ms sql, and it lead to
syntax error.
Is there any work around to let make my codes work
In Flink, you can create flow calculation tables using Flink SQL, and directly
connect with SQL through CDC and Kafka. How to use SQL for flow calculation in
Spark
308027...@qq.com
I want to use spark jdbc to access alibaba cloud hologres
(https://www.alibabacloud.com/product/hologres) internal hidden column
`hg_binlog_timestamp_us ` but met the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot
resolve 'hg_binlog_ti
Hello,
In my organization, we have an accounting system for spark jobs that uses
the task execution time to determine how much time a spark job uses the
executors for and we use it as a way to segregate cost. We sum all the task
times per job and apply proportions. Our clusters follow a 1 task
Hi all,
Is it possible to integrate StreamingQueryListener with Spark metrics so
that metrics can be reported through Spark's internal metric system?
Ideally, I would like to report some custom metrics through
StreamingQueryListener and export them to Spark's JmxSink.
Best,
Mason
We are happy to announce the availability of Apache Spark 3.4.3!
Spark 3.4.3 is a maintenance release containing many fixes including
security and correctness domains. This release is based on the
branch-3.4 maintenance branch of Spark. We strongly
recommend all 3.4 users to upgrade
Hello,
I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.
I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark
Hello,
I'm very new to the Spark ecosystem, apologies if this question is a bit
simple.
I want to modify a custom fork of Spark to remove function support. For
example, I want to remove the query runners ability to call reflect and
java_method. I saw that there exists a data structure in spark
(slice library, used by trino)
https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/XxHash64.java
Was there a special motivation behind this? or is 42 just used for the sake
of the hitchhiker's guide reference? It's very common for spark to interact
with other tools (either via
h consumes committed
messages from kafka directly(, which is not so scalable, I think.).
But the main point of this approach which I need is that spark
session needs to be used to save rdd(parallelized consumed messages) to
iceberg table.
Consumed messages will be converted to spark rdd which wil
Interesting
My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within
foreachRDD creates an infinite loop within each Spark executor. This might
not be the most efficient approach, especially since offsets are committed
asynchronously.?
HTH
Mich Talebzadeh,
Technologist
Because spark streaming for kafk transaction does not work correctly to
suit my need, I moved to another approach using raw kafka consumer which
handles read_committed messages from kafka correctly.
My codes look like the following.
JavaDStream stream = ssc.receiverStream(new CustomReceiver
)
(chango-private-1.chango.private executor driver):
java.lang.IllegalArgumentException: requirement failed: Got wrong record
for spark-executor-school-student-group school-student-7 even after seeking
to offset 11206961 got offset 11206962 instead. If this is a compacted
topic, consider enabling
Hi Kidong,
There may be few potential reasons why the message counts from your Kafka
producer and Spark Streaming consumer might not match, especially with
transactional messages and read_committed isolation level.
1) Just ensure that both your Spark Streaming job and the Kafka consumer
written
Hi,
I have a kafka producer which sends messages transactionally to kafka and
spark streaming job which should consume read_committed messages from kafka.
But there is a problem for spark streaming to consume read_committed
messages.
The count of messages sent by kafka producer transactionally
convention for Spark DataFrames
(usually snake_case). Use snake_case for better readability like:
"total_price_in_millions_gbp"
So this is the gist
+--+-+---+
|district |NumberOfOffshoreOwned|total_p
I think this answers your question about what to do if you need more space
on nodes.
https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage
Local Storage
<https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage>
Spark supports using volumes to
Hi Mich,
Thanks for the reply.
I did come across that file but it didn't align with the appearance of
`PartitionedFile`:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala
In fact, the code snippet you shared also
thanks everyone for their contributions
>>
>> I was going to reply to @Enrico Minack but
>> noticed additional info. As I understand for example, Apache Uniffle is an
>> incubating project aimed at providing a pluggable shuffle service for
>> Spark. So basically, all these &quo
ons
>
> I was going to reply to @Enrico Minack but
> noticed additional info. As I understand for example, Apache Uniffle is an
> incubating project aimed at providing a pluggable shuffle service for
> Spark. So basically, all these "external shuffle services" have in c
interesting. So below should be the corrected code with the suggestion in
the [SPARK-47718] .sql() does not recognize watermark defined upstream -
ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718>
# Define schema for parsing Kafka messages
schema = Stru
Sorry this is not a bug but essentially a user error. Spark throws a really
confusing error and I'm also confused. Please see the reply in the ticket
for how to make things correct.
https://issues.apache.org/jira/browse/SPARK-47718
刘唯 于2024年4月6日周六 11:41写道:
> This indeed looks like a bug
If you're using just Spark you could try turning on the history server
<https://spark.apache.org/docs/latest/monitoring.html> and try to glean
statistics from there. But there is no one location or log file which
stores them all.
Databricks, which is a managed Spark solution, pr
Hi,
I believe this is the package
https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala
And the code
case class FilePartition(index: Int, files: Array[PartitionedFile])
extends Partition
Well you can do a fair bit with the available tools
The Spark UI, particularly the Staging and Executors tabs, do provide some
valuable insights related to database health metrics for applications using
a JDBC source.
Stage Overview:
This section provides a summary of all the stages executed
Hi All,
I've been diving into the source code to get a better understanding of how
file splitting works from a user perspective. I've hit a deadend at
`PartitionedFile`, for which I cannot seem to find a definition? It appears
though it should be found at
Hi,
First thanks everyone for their contributions
I was going to reply to @Enrico Minack but
noticed additional info. As I understand for example, Apache Uniffle is an
incubating project aimed at providing a pluggable shuffle service for
Spark. So basically, all these "external sh
I see that both Uniffle and Celebron support S3/HDFS backends which is
great.
In the case someone is using S3/HDFS, I wonder what would be the advantages
of using Celebron or Uniffle vs IBM shuffle service plugin
<https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin
fr
Hello, I have a spark application with jdbc source and do some calculation.
To monitor application healthy, I need db related metrics per database like
number of connections, sql execution time and sql fired time distribution etc.
Does anybody know how to get them? Thanks!
1 - 100 of 34838 matches
Mail list logo