Request for Assistance: Adding User Authentication to Apache Spark Application

2024-05-16 Thread NIKHIL RAJ SHRIVASTAVA
Dear Team, I hope this email finds you well. My name is Nikhil Raj, and I am currently working with Apache Spark for one of my projects , where through the help of a parquet file we are creating an external table in Spark. I am reaching out to seek assistance regarding user authentication

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

2024-05-15 Thread Nagatomi Yasukazu
Hi Spark Community, I have a question regarding the support for User-Defined Functions (UDFs) in Spark Connect, specifically when using Kubernetes as the Cluster Manager. According to the Spark documentation, UDFs are supported by default for the shell and in standalone applications

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
nt M-CS) ; user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are

Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
ided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.or

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-48117> as a feature request. Let me

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
the view data into elastic index by using cdc? Thanks in advance. On Fri, May 3, 2024 at 3:39 PM Mich Talebzadeh wrote: > My recommendation! is using materialized views (MVs) created in Hive with > Spark Structured Streaming and Change Data Capture (CDC) is a good > combination for ef

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 <https://issues.apache.org/jira/browse/SPARK-48117> for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Perfo

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv <http://test.mv>").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/p

Help needed optimize spark history server performance

2024-05-03 Thread Vikas Tharyani
Dear Spark Community, I'm writing to seek your expertise in optimizing the performance of our Spark History Server (SHS) deployed on Amazon EKS. We're encountering timeouts (HTTP 504) when loading large event logs exceeding 5 GB. *Our Setup:* - Deployment: SHS on EKS with Nginx ingress (idle

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered

********Spark streaming issue to Elastic data**********

2024-05-02 Thread Karthick Nk
from view definition) by using spark structured streaming. Issue: 1. Here we are facing issue - For each incomming id here we running view definition(so it will read all the data from all the data) and check if any of the incomming id is present in the collective id's of view result, Due to which

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
. My suggestions - Increase Executor Memory: Allocate more memory per executor (e.g., 2GB or 3GB) to allow for multiple executors within available cluster memory. - Adjust Driver Pod Resources: Ensure the driver pod has enough memory to run Spark and manage executors. - Optimize

Spark on Kubernetes

2024-04-29 Thread Tarun raghav
Respected Sir/Madam, I am Tarunraghav. I have a query regarding spark on kubernetes. We have an eks cluster, within which we have spark installed in the pods. We set the executor memory as 1GB and set the executor instances as 2, I have also set dynamic allocation as true. So when I try to read

Re:RE: How to add MaxDOP option in spark mssql JDBC

2024-04-25 Thread Elite
Thank you. My main purpose is pass "MaxDop 1" to MSSQL to control the CPU usage. From the offical doc, I guess the problem of my codes is spark wrap the query to select * from (SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 option (maxdop 1)) spark_gen_alias

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
"128G" ).set("spark.executor.memoryOverhead", "32G" ).set("spark.driver.cores", "16" ).set("spark.driver.memory", "64G" ) I dont think b) applies as its a single machine. Kind regards, Jelle Fr

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) for guaranteed uniqueness. 2) Missing values

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
___ From: Mich Talebzadeh Sent: Wednesday, April 24, 2024 4:40 PM To: Nijland, J.G.W. (Jelle, Student M-CS) Cc: user@spark.apache.org Subject: Re: [spark-graphframes]: Generating incorrect edges OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, seque

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
jl...@student.utwente.nl> wrote: > tags: pyspark,spark-graphframes > > Hello, > > I am running pyspark in a podman container and I have issues with > incorrect edges when I build my graph. > I start with loading a source dataframe from a parquet directory on my &

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns

How to add MaxDOP option in spark mssql JDBC

2024-04-23 Thread Elite
[QUESTION] How to pass MAXDOP option · Issue #2395 · microsoft/mssql-jdbc (github.com) Hi team, I am suggested to require help form spark community. We suspect spark rewerite the query before pass to ms sql, and it lead to syntax error. Is there any work around to let make my codes work

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

How to access the internal hidden columns of table by spark jdbc

2024-04-20 Thread casel.chen
I want to use spark jdbc to access alibaba cloud hologres (https://www.alibabacloud.com/product/hologres) internal hidden column `hg_binlog_timestamp_us ` but met the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'hg_binlog_ti

Accounting the impact of failures in spark jobs

2024-04-19 Thread Faiz Halde
Hello, In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task

StreamingQueryListener integration with Spark native metric sink (JmxSink)

2024-04-18 Thread Mason Chen
Hi all, Is it possible to integrate StreamingQueryListener with Spark metrics so that metrics can be reported through Spark's internal metric system? Ideally, I would like to report some custom metrics through StreamingQueryListener and export them to Spark's JmxSink. Best, Mason

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3! Spark 3.4.3 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in spark

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
(slice library, used by trino) https://github.com/airlift/slice/blob/master/src/main/java/io/airlift/slice/XxHash64.java Was there a special motivation behind this? or is 42 just used for the sake of the hitchhiker's guide reference? It's very common for spark to interact with other tools (either via

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
h consumes committed messages from kafka directly(, which is not so scalable, I think.). But the main point of this approach which I need is that spark session needs to be used to save rdd(parallelized consumed messages) to iceberg table. Consumed messages will be converted to spark rdd which wil

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
) (chango-private-1.chango.private executor driver): java.lang.IllegalArgumentException: requirement failed: Got wrong record for spark-executor-school-student-group school-student-7 even after seeking to offset 11206961 got offset 11206962 instead. If this is a compacted topic, consider enabling

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
Hi, I have a kafka producer which sends messages transactionally to kafka and spark streaming job which should consume read_committed messages from kafka. But there is a problem for spark streaming to consume read_committed messages. The count of messages sent by kafka producer transactionally

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
convention for Spark DataFrames (usually snake_case). Use snake_case for better readability like: "total_price_in_millions_gbp" So this is the gist +--+-+---+ |district |NumberOfOffshoreOwned|total_p

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage <https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage> Spark supports using volumes to

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-11 Thread Ashley McManamon
Hi Mich, Thanks for the reply. I did come across that file but it didn't align with the appearance of `PartitionedFile`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/PartitionedFileUtil.scala In fact, the code snippet you shared also

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
thanks everyone for their contributions >> >> I was going to reply to @Enrico Minack but >> noticed additional info. As I understand for example, Apache Uniffle is an >> incubating project aimed at providing a pluggable shuffle service for >> Spark. So basically, all these &quo

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
ons > > I was going to reply to @Enrico Minack but > noticed additional info. As I understand for example, Apache Uniffle is an > incubating project aimed at providing a pluggable shuffle service for > Spark. So basically, all these "external shuffle services" have in c

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread Mich Talebzadeh
interesting. So below should be the corrected code with the suggestion in the [SPARK-47718] .sql() does not recognize watermark defined upstream - ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-47718> # Define schema for parsing Kafka messages schema = Stru

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-09 Thread 刘唯
Sorry this is not a bug but essentially a user error. Spark throws a really confusing error and I'm also confused. Please see the reply in the ticket for how to make things correct. https://issues.apache.org/jira/browse/SPARK-47718 刘唯 于2024年4月6日周六 11:41写道: > This indeed looks like a bug

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Femi Anthony
If you're using just Spark you could try turning on the history server <https://spark.apache.org/docs/latest/monitoring.html> and try to glean statistics from there. But there is no one location or log file which stores them all. Databricks, which is a managed Spark solution, pr

Re: [Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Mich Talebzadeh
Hi, I believe this is the package https://raw.githubusercontent.com/apache/spark/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala And the code case class FilePartition(index: Int, files: Array[PartitionedFile]) extends Partition

Re: How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread Mich Talebzadeh
Well you can do a fair bit with the available tools The Spark UI, particularly the Staging and Executors tabs, do provide some valuable insights related to database health metrics for applications using a JDBC source. Stage Overview: This section provides a summary of all the stages executed

[Spark SQL]: Source code for PartitionedFile

2024-04-08 Thread Ashley McManamon
Hi All, I've been diving into the source code to get a better understanding of how file splitting works from a user perspective. I've hit a deadend at `PartitionedFile`, for which I cannot seem to find a definition? It appears though it should be found at

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external sh

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin <https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin fr

How to get db related metrics when use spark jdbc to read db table?

2024-04-08 Thread casel.chen
Hello, I have a spark application with jdbc source and do some calculation. To monitor application healthy, I need db related metrics per database like number of connections, sql execution time and sql fired time distribution etc. Does anybody know how to get them? Thanks!

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
did > > The configurations below can be used with k8s deployments of Spark. Spark > applications running on k8s can utilize these configurations to seamlessly > access data stored in Google Cloud Storage (GCS) and Amazon S3. > > For Google GCS we may have

Re: External Spark shuffle service for k8s

2024-04-07 Thread Enrico Minack
There is Apache incubator project Uniffle: https://github.com/apache/incubator-uniffle It stores shuffle data on remote servers in memory, on local disk and HDFS. Cheers, Enrico Am 06.04.24 um 15:41 schrieb Mich Talebzadeh: I have seen some older references for shuffle service for k8s,

Spark UDAF in examples fail with not serializable error

2024-04-07 Thread Owen Bell
The type-safe example given at https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html fails with a not serializable exception Is this a known issue?

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Sun, 7 Apr 2024 at 15:08, Cheng Pan wrote: > Instead of External Shuffle Shufle, Apache Celeborn might be a good option > as a Remote Shuffle Service for Spark on K8s. > > There are some

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at 12

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3] https://github.com/aws

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid The configurations below can be used with k8s deployments of Spark. Spark applications running on k8s can utilize these configurations to seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3. For Google GCS we may have spark_config_gcs

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
pedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen wrote: > You can make a PVC on K8S call it 300GB > > make a folder in yours dockerfile > WORKDIR /opt/spark/work-dir > RUN chmod g+w /opt

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB make a folder in yours dockerfile WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir start spark with adding this .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", "300

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-06 Thread 刘唯
n Kafka). However, your query > involves a streaming aggregation: group by provinceId, window('createTime', > '1 hour', '30 minutes'). The problem is that Spark Structured Streaming > requires a watermark to ensure exactly-once processing when using > aggregations with append mode. Your c

Re: [External] Re: Issue of spark with antlr version

2024-04-06 Thread Bjørn Jørgensen
[[VOTE] Release Plan for Apache Spark 4.0.0 (June 2024)]( https://lists.apache.org/thread/r0zn6rd8y25yn2dg59ktw3ttrwxzqrfb) Apache Spark 4.0.0 Release Plan === 1. After creating `branch-3.5`, set "4.0.0-SNAPSHOT" in master branch. 2. Creating `branch-4.0

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
I have seen some older references for shuffle service for k8s, although it is not clear they are talking about a generic shuffle service for k8s. Anyhow with the advent of genai and the need to allow for a larger volume of data, I was wondering if there has been any more work on this matter.

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
I don't really understand how Iceberg and the hadoop libraries can coexist in a deployment. The latest spark (3.5.1) base image contains the hadoop-client*-3.3.4.jar. The AWS v2 SDK is only supported in hadoop*-3.4.0.jar and onward. Iceberg AWS integration states AWS v2 SDK is required<ht

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
(ParquetFileFormat.scala:429) From: Oxlade, Dan Sent: 03 April 2024 14:33 To: Aaron Grubb ; user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix [sorry; replying all this time] With hadoop-*-3.3.6 in place of the 3.4.0

Re: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
April 2024 13:52 To: user@spark.apache.org Subject: [EXTERNAL] Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Downgrade to hadoop-*:3.3.x, Hadoop 3.4.x is based on the AWS SDK v2 and should probably be considered as breaking for tools that build on < 3.4.0 while using

Re: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Aaron Grubb
ect: [Spark]: Spark / Iceberg / hadoop-aws compatibility matrix Hi all, I’ve struggled with this for quite some time. My requirement is to read a parquet file from s3 to a Dataframe then append to an existing iceberg table. In order to read the parquet I need the hadoop-aws dependency for

[Spark]: Spark / Iceberg / hadoop-aws compatibility matrix

2024-04-03 Thread Oxlade, Dan
. Both of these dependencies have a transitive dependency on the aws SDK. I can't find versions for Spark 3.4 that work together. Current Versions: Spark 3.4.1 iceberg-spark-runtime-3.4-2.12:1.4.1 iceberg-aws-bundle:1.4.1 hadoop-aws:3.4.0 hadoop-common:3.4.0 I've tried a number of combinations

Re: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
is designed for scenarios where you want to append new data to an existing dataset at the sink (in this case, the "sink" topic in Kafka). However, your query involves a streaming aggregation: group by provinceId, window('createTime', '1 hour', '30 minutes'). The problem is that Spark

RE: Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
rue") \ > .option("startingOffsets", "earliest") \ > .load() \ > .select(from_json(col("value").cast("string"), > schema).alias("parsed_value")) > .select

Re: [Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Mich Talebzadeh
rom the streaming DataFrame with watermark streaming_df.createOrReplaceTempView("michboy") # Execute SQL queries on the temporary view result_df = (spark.sql(""" SELECT window.start, window.end, provinceId, sum(payAmount) as totalPayAmount FROM michboy

[Spark SQL] How can I use .sql() in conjunction with watermarks?

2024-04-02 Thread Chloe He
Hello! I am attempting to write a streaming pipeline that would consume data from a Kafka source, manipulate the data, and then write results to a downstream sink (Kafka, Redis, etc). I want to write fully formed SQL instead of using the function API that Spark offers. I read a few guides

Re: [External] Re: Issue of spark with antlr version

2024-04-01 Thread Chawla, Parul
Hi Team, Can you let us know the when this spark 4.x will be released to maven. regards, Parul Get Outlook for iOS<https://aka.ms/o0ukef> From: Bjørn Jørgensen Sent: Wednesday, February 28, 2024 5:06:54 PM To: Chawla, Parul Cc: Sahni, Ashima

Apache Spark integration with Spring Boot 3.0.0+

2024-03-28 Thread Szymon Kasperkiewicz
Hello, Ive got a project which has to use newest versions of both Apache Spark and Spring Boot due to vulnerabilities issues. I build my project using Gradle. And when I try to run it i get : Unsatisfied dependecy exception about javax/servlet/Servlet. Ive tried to add jakarta servlet

Is one Spark partition mapped to one and only Spark Task ?

2024-03-24 Thread Sreyan Chakravarty
I am trying to understand the Spark Architecture for my upcoming certification, however there seems to be conflicting information available. https://stackoverflow.com/questions/47782099/what-is-the-relationship-between-tasks-and-partitions Does Spark assign a Spark partition to only a single

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Winston Lai
+1 -- Thank You & Best Regards Winston Lai From: Jay Han Date: Sunday, 24 March 2024 at 08:39 To: Kiran Kumar Dusi Cc: Farshid Ashouri , Matei Zaharia , Mich Talebzadeh , Spark dev list , user @spark Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Communit

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
> Some of you may be aware that Databricks community Home | Databricks >>> have just launched a knowledge sharing hub. I thought it would be a >>> good idea for the Apache Spark user group to have the same, especially >>> for repeat questions on Spark core, Spark SQL, Spa

Re: Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
Sorry from this link Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn <https://www.linkedin.com/pulse/leveraging-generative-ai-apache-spark-transforming-mich-lxbte/?trackingId=aqZMBOg4O1KYRB4Una7NEg%3D%3D> Mich Talebzadeh, Technologist | Data | Generat

Feature article: Leveraging Generative AI with Apache Spark: Transforming Data Engineering

2024-03-22 Thread Mich Talebzadeh
You may find this link of mine in Linkedin for the said article. We can use Linkedin for now. Leveraging Generative AI with Apache Spark: Transforming Data Engineering | LinkedIn Mich Talebzadeh, Technologist | Data | Generative AI | Financial Fraud London United Kingdom view my Linkedin

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Kiran Kumar Dusi
>> good idea for the Apache Spark user group to have the same, especially >> for repeat questions on Spark core, Spark SQL, Spark Structured >> Streaming, Spark Mlib and so forth. >> >> Apache Spark user and dev groups have been around for a good while. >> Th

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-20 Thread Farshid Ashouri
+1 On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, wrote: > Some of you may be aware that Databricks community Home | Databricks > have just launched a knowledge sharing hub. I thought it would be a > good idea for the Apache Spark user group to have the same, especially > for repe

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh
se cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >

Spark-UI stages and other tabs not accessible in standalone mode when reverse-proxy is enabled

2024-03-19 Thread sharad mishra
Hi Team, We're encountering an issue with Spark UI. I've documented the details here: https://issues.apache.org/jira/browse/SPARK-47232 When enabled reverse proxy in master and worker configOptions. We're not able to access different tabs available in spark UI e.g.(stages, environment, storage etc

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Joris Billen
t;. On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen mailto:bjornjorgen...@gmail.com>> wrote: something like this Spark community · GitHub<https://github.com/Spark-community> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud : Good idea. Will be useful +1 From: ashok34...@yahoo.

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah
+1 Great initiative. QQ : Stack overflow has a similar feature called "Collectives", but I am not sure of the expenses to create one for Apache Spark. With SO being used ( atleast before ChatGPT became quite the norm for searching questions), it already has a lot of questions asked an

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma
>> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALID >> *Date: *Monday, March 18, 2024 at 6:36 AM >> *To: *user @spark , Spark dev list < >> d...@spark.apache.org>, Mich Talebzadeh >> *Cc: *Matei Zaharia >> *Subject: *R

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
org/wiki/Wernher_von_Braun>)". > > > On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud > wrote: > >> Good idea. Will be useful >> >> >> >> +1 >> >> >> >> >> >> >> >> *From: *ashok34...@yahoo.com.INVALI

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
OK thanks for the update. What does officially blessed signify here? Can we have and run it as a sister site? The reason this comes to my mind is that the interested parties should have easy access to this site (from ISUG Spark sites) as a reference repository. I guess the advice would

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Reynold Xin
aranteed . It is essential to note > that, as with any advice, quote "one test result is worth one - thousand > expert opinions ( Werner ( https://en.wikipedia.org/wiki/Wernher_von_Braun > ) Von Braun ( https://en.wikipedia.org/wiki/Wernher_von_Braun ) )". > > > > >

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh
n.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen wrote: > something like this Spark community · GitHub > <https://github.com/Spark-community> > > > man. 18. m

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Bjørn Jørgensen
something like this Spark community · GitHub <https://github.com/Spark-community> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud : > Good idea. Will be useful > > > > +1 > > > > > > > > *From: *ashok34...@yahoo.com.INVALID > *Date: *Monda

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Code Tutelage
+1 Thanks for proposing On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud wrote: > Good idea. Will be useful > > > > +1 > > > > > > > > *From: *ashok34...@yahoo.com.INVALID > *Date: *Monday, March 18, 2024 at 6:36 AM > *To: *user @spark , Sp

pyspark - Use Spark to generate a large dataset on the fly

2024-03-18 Thread Sreyan Chakravarty
Hi, I have a specific problem where I have to get the data from REST APIs and store it, and then do some transformations on it and then write to a RDBMS table. I am wondering if Spark will help in this regard. I am confused as to how do I store the data while I actually acquire it on the driver

  1   2   3   4   5   6   7   8   9   10   >