Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
Something like this in Python from pyspark.sql import SparkSession # Configure Spark Session with JDBC URLs spark_conf = SparkConf() \ .setAppName("SparkCatalogMultipleSources") \ .set("hive.metastore.uris", "thrift://hive1-metastore:9080,thrift://hive2-metastore:9080") jdbc_urls =

Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread ????
I have two clusters hive1 and hive2, as well as a MySQL database. Can I use Spark Catalog for registration, but can I only use one catalog at a time? Can multiple catalogs be joined across databases. select * from hive1.table1 join hive2.table2 join mysql.table1 where

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-24 Thread Anil Dasari
It appears that structured streaming and Dstream have entirely different microbatch metadata representation Can someone assist me in finding the following Dstream microbatch metadata equivalent in Structured streaming. 1. microbatch timestamp : structured streaming foreachBatch gives batchID

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team, in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "ShuffleWrite Size/Records " that column prints wrong data when it gets the data from cache/persist . it typically will show the wrong record number though the data

[s3a] Spark is not reading s3 object content

2024-05-23 Thread Amin Mosayyebzadeh
I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files (I have verified it on Ceph side by checking its logs), even returning the correct size of the object. But the content is not read. The object url is:

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

2024-05-22 Thread Raghvendra Yadav
Hello, We are hoping someone can help us understand the spark behavior for scenarios listed below. Q. *Will spark running queries fail when S3 parquet object changes underneath with S3A remote file change detection enabled? Is it 100%? * Our understanding is that S3A has a

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

[ANNOUNCE] Apache Celeborn 0.4.1 available

2024-05-22 Thread Nicholas Jiang
Hi all, Apache Celeborn community is glad to announce the new release of Apache Celeborn 0.4.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including shuffle data, spilled

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure

Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure streaming to get the microbatch end offsets to the checkpoint in our external checkpoint store ? Thanks in advance. Regards

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > Hi, > I think you should write to HDFS then copy file (parquet or orc) from > HDFS to MinIO. > > -- > eabour > > > *From:*

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor,

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output {    "Postcode":

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode": { "exists": true, "num_rows": 93348, "data_type": "string", "null_count": 21921,

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one help me in scenario 2

A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
I just wanted to share a tool I built called *spark-column-analyzer*. It's a Python package that helps you dig into your Spark DataFrames with ease. Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data

Re: pyspark dataframe join with two different data type

2024-05-17 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["column1"] df =

Request for Assistance: Adding User Authentication to Apache Spark Application

2024-05-16 Thread NIKHIL RAJ SHRIVASTAVA
Dear Team, I hope this email finds you well. My name is Nikhil Raj, and I am currently working with Apache Spark for one of my projects , where through the help of a parquet file we are creating an external table in Spark. I am reaching out to seek assistance regarding user authentication for

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ |

How to provide a Zstd "training mode" dictionary object

2024-05-15 Thread Saha, Daniel
Hi, I understand that Zstd compression can optionally be provided a dictionary object to improve performance. See “training mode” here https://facebook.github.io/zstd/ Does Spark surface a way to provide this dictionary object when writing/reading data? What about for intermediate shuffle

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

2024-05-15 Thread Nagatomi Yasukazu
Hi Spark Community, I have a question regarding the support for User-Defined Functions (UDFs) in Spark Connect, specifically when using Kubernetes as the Cluster Manager. According to the Spark documentation, UDFs are supported by default for the shell and in standalone applications with

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in

Display a warning in EMR welcome screen

2024-05-11 Thread Abhishek Basu
Hi Team, for Elastic Map Reduce (EMR) cluster, it would be great if there is a warning message that Logging should be handled carefully and INFO or DEBUG should be enabled only when required. This logging thing took my whole day and lastly I discovered that it’s exploding the hdfs and not

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi all, The issue is solved. I conducted a lot more testing and built checkers to verify at which size it's going wrong. When checking for specific edges, I could construct successful graphs up to 261k records. When verifying all edges created, is breaks somewhere in the 200-250k records. I

unsubscribe

2024-05-10 Thread J UDAY KIRAN
unsubscribe

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ]

Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia. It

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

unsubscribe

2024-05-07 Thread Wojciech Bombik
unsubscribe

unsubscribe

2024-05-06 Thread Moise
unsubscribe

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

unsubscribe

2024-05-04 Thread chen...@birdiexx.com
unsubscribe

unsubscribe

2024-05-03 Thread Bing
Replied Message | From | Wood Super | | Date | 05/01/2024 07:49 | | To | user | | Subject | unsubscribe | unsubscribe

Spark Materialized Views: Improve Query Performance and Data Management

2024-05-03 Thread Mich Talebzadeh
Hi, I have raised a ticket SPARK-48117 for enhancing Spark capabilities with Materialised Views (MV). Currently both Hive and Databricks support this. I have added these potential benefits to the ticket -* Improved Query Performance

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Sadly Apache Spark sounds like it has nothing to do within materialised views. I was hoping it could read it! >>> *spark.sql("SELECT * FROM test.mv ").show()* Traceback (most recent call last): File "", line 1, in File "/opt/spark/python/pyspark/sql/session.py", line 1440, in

Help needed optimize spark history server performance

2024-05-03 Thread Vikas Tharyani
Dear Spark Community, I'm writing to seek your expertise in optimizing the performance of our Spark History Server (SHS) deployed on Amazon EKS. We're encountering timeouts (HTTP 504) when loading large event logs exceeding 5 GB. *Our Setup:* - Deployment: SHS on EKS with Nginx ingress (idle

Re: ********Spark streaming issue to Elastic data**********

2024-05-03 Thread Mich Talebzadeh
My recommendation! is using materialized views (MVs) created in Hive with Spark Structured Streaming and Change Data Capture (CDC) is a good combination for efficiently streaming view data updates in your scenario. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI |

Re: Issue with Materialized Views in Spark SQL

2024-05-03 Thread Mich Talebzadeh
Thanks for the comments I received. So in summary, Apache Spark itself doesn't directly manage materialized views,(MV) but it can work with them through integration with the underlying data storage systems like Hive or through iceberg. I believe databricks through unity catalog support MVs as

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Jungtaek Lim
(removing dev@ as I don't think this is dev@ related thread but more about "question") My understanding is that Apache Spark does not support Materialized View. That's all. IMHO it's not a proper expectation that all operations in Apache Hive will be supported in Apache Spark. They are different

Re: Issue with Materialized Views in Spark SQL

2024-05-02 Thread Walaa Eldin Moustafa
I do not think the issue is with DROP MATERIALIZED VIEW only, but also with CREATE MATERIALIZED VIEW, because neither is supported in Spark. I guess you must have created the view from Hive and are trying to drop it from Spark and that is why you are running to the issue with DROP first. There is

Issue with Materialized Views in Spark SQL

2024-05-02 Thread Mich Talebzadeh
An issue I encountered while working with Materialized Views in Spark SQL. It appears that there is an inconsistency between the behavior of Materialized Views in Spark SQL and Hive. When attempting to execute a statement like DROP MATERIALIZED VIEW IF EXISTS test.mv in Spark SQL, I encountered a

********Spark streaming issue to Elastic data**********

2024-05-02 Thread Karthick Nk
Hi All, Requirements: I am working on the data flow, which will use the view definition(view definition already defined in schema), there are multiple tables used in the view definition. Here we want to stream the view data into elastic index based on if any of the table(used in the view

unsubscribe

2024-05-01 Thread Nebi Aydin
unsubscribe

unsubscribe

2024-05-01 Thread Atakala Selam
unsubscribe

unsubscribe

2024-05-01 Thread Yoel Benharrous
unsubscribe

Traceback is missing content in pyspark when invoked with UDF

2024-05-01 Thread Indivar Mishra
Hi *Tl;Dr:* I have a scenario where I generate code string on fly and execute that code, now for me if an error occurs I need the traceback but for executable code I just get partial traceback i.e. the line which caused the error is missing. Consider below MRC: def fun(): from pyspark.sql

Re: [spark-graphframes]: Generating incorrect edges

2024-05-01 Thread Mich Talebzadeh
Hi Steve, Thanks for your statement. I tend to use uuid myself to avoid collisions. This built-in function generates random IDs that are highly likely to be unique across systems. My concerns are on edge so to speak. If the Spark application runs for a very long time or encounters restarts, the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich, I was just reading random questions on the user list when I noticed that you said: On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh wrote: 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can

unsubscribe

2024-04-30 Thread Wood Super
unsubscribe

unsubscribe

2024-04-30 Thread junhua . xie
unsubscribe

unsubscribe

2024-04-30 Thread Yoel Benharrous

Re: spark.sql.shuffle.partitions=auto

2024-04-30 Thread Mich Talebzadeh
spark.sql.shuffle.partitions=auto Because Apache Spark does not build clusters. This configuration option is specific to Databricks, with their managed Spark offering. It allows Databricks to automatically determine an optimal number of shuffle partitions for your workload. HTH Mich Talebzadeh,

Re: Spark on Kubernetes

2024-04-30 Thread Mich Talebzadeh
Hi, In k8s the driver is responsible for executor creation. The likelihood of your problem is that Insufficient memory allocated for executors in the K8s cluster. Even with dynamic allocation, k8s won't schedule executor pods if there is not enough free memory to fulfill their resource requests.

spark.sql.shuffle.partitions=auto

2024-04-30 Thread second_co...@yahoo.com.INVALID
May i know is spark.sql.shuffle.partitions=auto only available on Databricks? what about on vanilla Spark ? When i set this, it gives error need to put int.  Any open source library that auto find the best partition , block size for dataframe?

Spark on Kubernetes

2024-04-29 Thread Tarun raghav
Respected Sir/Madam, I am Tarunraghav. I have a query regarding spark on kubernetes. We have an eks cluster, within which we have spark installed in the pods. We set the executor memory as 1GB and set the executor instances as 2, I have also set dynamic allocation as true. So when I try to read a

Re: Python for the kids and now PySpark

2024-04-28 Thread Meena Rajani
Mitch, you are right these days the attention span is getting shorter. Christian could work on a completely new thing for 3 hours and is proud to explain. It is amazing. Thanks for sharing. On Sat, Apr 27, 2024 at 9:40 PM Farshid Ashouri wrote: > Mich, this is absolutely amazing. > > Thanks

Re: Python for the kids and now PySpark

2024-04-27 Thread Farshid Ashouri
Mich, this is absolutely amazing. Thanks for sharing. On Sat, 27 Apr 2024, 22:26 Mich Talebzadeh, wrote: > Python for the kids. Slightly off-topic but worthwhile sharing. > > One of the things that may benefit kids is starting to learn something > new. Basically anything that can focus their

[Release Question]: Estimate on 3.5.2 release?

2024-04-26 Thread Paul Gerver
Hello, I'm curious if there is an estimate when 3.5.2 for Spark Core will be released. There are several bug and security vulnerability fixes in the dependencies we are excited to receive! If anyone has any insights, that would be greatly appreciated. Thanks! - ​Paul

[SparkListener] Accessing classes loaded via the '--packages' option

2024-04-26 Thread Damien Hawes
Hi folks, I'm contributing to the OpenLineage project, specifically the Apache Spark integration. My current focus is on extending the project to support data lineage extraction for Spark Streaming, beginning with Apache Kafka sources and sinks. I've encountered an obstacle when attempting to

unsubscribe

2024-04-25 Thread Code Tutelage

Re:RE: How to add MaxDOP option in spark mssql JDBC

2024-04-25 Thread Elite
Thank you. My main purpose is pass "MaxDop 1" to MSSQL to control the CPU usage. From the offical doc, I guess the problem of my codes is spark wrap the query to select * from (SELECT TOP 10 * FROM dbo.Demo with (nolock) WHERE Id = 1 option (maxdop 1)) spark_gen_alias Apparently, this

Re: [spark-graphframes]: Generating incorrect edges

2024-04-25 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich, Thanks for your suggestions. 1) It currently runs on one server with plenty of resources assigned. But I will keep it in mind to replace monotonically_increasing_id() with uuid() once we scale up. 2) I have replaced the null values in origin with a string

DataFrameReader: timestampFormat default value

2024-04-24 Thread keen
Is anyone familiar with [Datetime patterns]( https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) and `TimestampType` parsing in PySpark? When reading CSV or JSON files, timestamp columns need to be parsed. via datasource property `timestampFormat`. [According to documentation](

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK let us have a look at these 1) You are using monotonically_increasing_id(), which is not collision-resistant in distributed environments like Spark. Multiple hosts can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) for guaranteed uniqueness. 2) Missing values in

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi Mich, Thanks for your reply, 1) ID generation is done using monotonically_increasing_id() this is then prefixed with "p_", "m_", "o_" or "org_" depending on the

Re: [spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Mich Talebzadeh
OK few observations 1) ID Generation Method: How are you generating unique IDs (UUIDs, sequential numbers, etc.)? 2) Data Inconsistencies: Have you checked for missing values impacting ID generation? 3) Join Verification: If relevant, can you share the code for joining data points during ID

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

[spark-graphframes]: Generating incorrect edges

2024-04-24 Thread Nijland, J.G.W. (Jelle, Student M-CS)
tags: pyspark,spark-graphframes Hello, I am running pyspark in a podman container and I have issues with incorrect edges when I build my graph. I start with loading a source dataframe from a parquet directory on my server. The source dataframe has the following columns:

How to add MaxDOP option in spark mssql JDBC

2024-04-23 Thread Elite
[QUESTION] How to pass MAXDOP option · Issue #2395 · microsoft/mssql-jdbc (github.com) Hi team, I am suggested to require help form spark community. We suspect spark rewerite the query before pass to ms sql, and it lead to syntax error. Is there any work around to let make my codes work?

How to use Structured Streaming in Spark SQL

2024-04-22 Thread ????
In Flink, you can create flow calculation tables using Flink SQL, and directly connect with SQL through CDC and Kafka. How to use SQL for flow calculation in Spark 308027...@qq.com

How to access the internal hidden columns of table by spark jdbc

2024-04-20 Thread casel.chen
I want to use spark jdbc to access alibaba cloud hologres (https://www.alibabacloud.com/product/hologres) internal hidden column `hg_binlog_timestamp_us ` but met the following error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'hg_binlog_timestamp_us'

Accounting the impact of failures in spark jobs

2024-04-19 Thread Faiz Halde
Hello, In my organization, we have an accounting system for spark jobs that uses the task execution time to determine how much time a spark job uses the executors for and we use it as a way to segregate cost. We sum all the task times per job and apply proportions. Our clusters follow a 1 task

StreamingQueryListener integration with Spark native metric sink (JmxSink)

2024-04-18 Thread Mason Chen
Hi all, Is it possible to integrate StreamingQueryListener with Spark metrics so that metrics can be reported through Spark's internal metric system? Ideally, I would like to report some custom metrics through StreamingQueryListener and export them to Spark's JmxSink. Best, Mason

[ANNOUNCE] Apache Spark 3.4.3 released

2024-04-18 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.4.3! Spark 3.4.3 is a maintenance release containing many fixes including security and correctness domains. This release is based on the branch-3.4 maintenance branch of Spark. We strongly recommend all 3.4 users to upgrade to this

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in

[Spark SQL][How-To] Remove builtin function support from Spark

2024-04-17 Thread Matthew McMillian
Hello, I'm very new to the Spark ecosystem, apologies if this question is a bit simple. I want to modify a custom fork of Spark to remove function support. For example, I want to remove the query runners ability to call reflect and java_method. I saw that there exists a data structure in

should OutputCommitCoordinator fail stages for authorized committer failures when using s3a optimized committers?

2024-04-17 Thread Dylan McClelland
In https://issues.apache.org/jira/browse/SPARK-39195, OutputCommitCoordinator was modified to fail a stage if an authorized committer task fails. We run our spark jobs on a k8s cluster managed by karpenter and mostly built from spot instances. As a result, our executors are frequently killed.

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java

auto create event log directory if not exist

2024-04-15 Thread second_co...@yahoo.com.INVALID
Spark history server is set to use s3a, like below spark.eventLog.enabled true spark.eventLog.dir s3a://bucket-test/test-directory-log any configuration option i can set on the Spark config such that if the directory 'test-directory-log' does not exist auto create it before start Spark history

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Thanks, Mich for your reply. I agree, it is not so scalable and efficient. But it works correctly for kafka transaction, and there is no problem with committing offset to kafka async for now. I try to tell you some more details about my streaming job. CustomReceiver does not receive anything

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Mich Talebzadeh
Interesting My concern is infinite Loop in* foreachRDD*: The *while(true)* loop within foreachRDD creates an infinite loop within each Spark executor. This might not be the most efficient approach, especially since offsets are committed asynchronously.? HTH Mich Talebzadeh, Technologist |

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-14 Thread Kidong Lee
Because spark streaming for kafk transaction does not work correctly to suit my need, I moved to another approach using raw kafka consumer which handles read_committed messages from kafka correctly. My codes look like the following. JavaDStream stream = ssc.receiverStream(new CustomReceiver());

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Kidong Lee
Thank you Mich for your reply. Actually, I tried to do most of your advice. When spark.streaming.kafka.allowNonConsecutiveOffsets=false, I got the following error. Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 3)

Re: Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-13 Thread Mich Talebzadeh
Hi Kidong, There may be few potential reasons why the message counts from your Kafka producer and Spark Streaming consumer might not match, especially with transactional messages and read_committed isolation level. 1) Just ensure that both your Spark Streaming job and the Kafka consumer written

Spark streaming job for kafka transaction does not consume read_committed messages correctly.

2024-04-12 Thread Kidong Lee
Hi, I have a kafka producer which sends messages transactionally to kafka and spark streaming job which should consume read_committed messages from kafka. But there is a problem for spark streaming to consume read_committed messages. The count of messages sent by kafka producer transactionally is

Spark column headings, camelCase or snake case?

2024-04-11 Thread Mich Talebzadeh
I know this is a bit of a silly question. But what is the norm for Sparkcolumn headings? Is it camelCase or snakec_ase. For example here " someone suggested and I quote SumTotalInMillionGBP" accurately conveys the meaning but is a bit long and uses camelCase, which is not the standard convention

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

  1   2   3   4   5   6   7   8   9   10   >