[no subject]

2024-02-03 Thread Gavin McDonald
Hello to all users, contributors and Committers! The Travel Assistance Committee (TAC) are pleased to announce that travel assistance applications for Community over Code EU 2024 are now open! We will be supporting Community over Code EU, Bratislava, Slovakia, June 3th - 5th, 2024. TAC exists

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
I agree with what is stated. This is the gist of my understanding having tested it. When working with Spark Structured Streaming, each streaming query runs in its own separate Spark session to ensure isolation and avoid conflicts between different queries. So here I have: def process_data(self,

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
hm. In your logic here def process_micro_batch(micro_batch_df, batchId) : micro_batch_df.createOrReplaceTempView("temp_view") df = spark.sql(f"select * from temp_view") return df Is this function called and if so do you check if micro_batch_df contains rows -> if

deploy spark as cluster

2024-01-31 Thread ali sharifi
Hi everyone! I followed this guide https://dev.to/mvillarrealb/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-2021-update-6l4 to create a Spark cluster on an Ubuntu server with Docker. However, when I try to submit my PySpark code to the master, the jobs are registered in the

Create Custom Logs

2024-01-31 Thread PRASHANT L
Hi I justed wanted to check if there is a way to create custom log in Spark I want to write selective/custom log messages to S3 , running spark submit on EMR I would not want all the spark generated logs ... I would just need the log messages that are logged as part of Spark Application

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Jungtaek Lim
Hi, Streaming query clones the spark session - when you create a temp view from DataFrame, the temp view is created under the cloned session. You will need to use micro_batch_df.sparkSession to access the cloned session. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jan 31, 2024 at 3:29 PM

randomsplit has issue?

2024-01-31 Thread second_co...@yahoo.com.INVALID
based on this blog post https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36 , I noticed a recommendation against using randomSplit for data splitting due to data sorting. Is the information provided in the blog accurate?

Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-30 Thread Karthick Nk
Hi Team, I am using structered streaming in pyspark in azure Databricks, in that I am creating temp_view from dataframe (df.createOrReplaceTempView('temp_view')) for performing spark sql query transformation. In that I am facing the issue that temp_view not found, so that as a workaround i have

[Spark SQL]: Crash when attempting to select PostgreSQL bpchar without length specifier in Spark 3.5.0

2024-01-29 Thread Lily Hahn
Hi, I’m currently migrating an ETL project to Spark 3.5.0 from 3.2.1 and ran into an issue with some of our queries that read from PostgreSQL databases. Any attempt to run a Spark SQL query that selects a bpchar without a length specifier from the source DB seems to crash:

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-29 Thread Mich Talebzadeh
As I stated earlier on,, there are alternatives that you might explore socket sources for testing purposes. from pyspark.sql import SparkSession from pyspark.sql.functions import expr, when from pyspark.sql.types import StructType, StructField, LongType spark = SparkSession.builder \

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-29 Thread Perfect Stranger
Yes, there's definitely an issue, can someone fix it? I'm not familiar with apache jira, do I need to make a bug report or what? On Mon, Jan 29, 2024 at 2:57 AM Mich Talebzadeh wrote: > OK > > This is the equivalent Python code > > from pyspark.sql import SparkSession > from

Re: startTimestamp doesn't work when using rate-micro-batch format

2024-01-28 Thread Mich Talebzadeh
OK This is the equivalent Python code from pyspark.sql import SparkSession from pyspark.sql.functions import expr, when from pyspark.sql.types import StructType, StructField, LongType from datetime import datetime spark = SparkSession.builder \ .master("local[*]") \

startTimestamp doesn't work when using rate-micro-batch format

2024-01-28 Thread Perfect Stranger
I described the issue here: https://stackoverflow.com/questions/77893939/how-does-starttimestamp-option-work-for-the-rate-micro-batch-format Could someone please respond? The rate-micro-batch format doesn't seem to respect the startTimestamp option. Thanks.

subscribe

2024-01-26 Thread Sahib Aulakh
subscribe

subscribe

2024-01-26 Thread Sahib Aulakh

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-24 Thread Andrzej Zera
Hi, I'm sorry but I got confused about the inner workings of late events watermark. You're completely right. Thanks for clarifying. Regards, Andrzej czw., 11 sty 2024 o 13:02 Jungtaek Lim napisał(a): > Hi, > > The time window is closed and evicted as long as "eviction watermark" > passes the

Some optimization questions about our beloved engine Spark

2024-01-23 Thread Aissam Chia
Hi, I hope this email finds you well. Currently, I'm working on spark SQL and I have two main questions that I've been struggling with for 2 weeks now. I'm running spark on AWS EMR : 1. I'm running 30 spark applications in the same cluster. My applications are basically some SQL

Facing Error org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-

2024-01-17 Thread Abhishek Singla
Hi Team, Version: 3.2.2 Java Version: 1.8.0_211 Scala Version: 2.12.15 Cluster: Standalone I am using Spark Streaming to read from Kafka and write to S3. The job fails with below error if there are no records published to Kafka for a few days and then there are some records published. Could

Re: unsubscribe

2024-01-17 Thread Крюков Виталий Семенович
unsubscribe От: Leandro Martelli Отправлено: 17 января 2024 г. 1:53:04 Кому: user@spark.apache.org Тема: unsubscribe unsubscribe

unsubscribe

2024-01-16 Thread Leandro Martelli
unsubscribe

Unsubscribe

2024-01-13 Thread Andrew Redd
Unsubscribe

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Actually, that did work, thanks. What I previously tried that did not work was #BSUB -env "all,SPARK_LOCAL_DIRS=/tmp,/share/,SPARK_PID_DIR=..." However, I am still getting "No space left on device" errors. It seems that I need hierarchical directories, and round robin distribution is not good

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Without spaces was the first thing I tried. The information in the pdf file inspired me to try the space. On Fri, Jan 12, 2024 at 10:23 PM Koert Kuipers wrote: > try it without spaces? > export SPARK_LOCAL_DIRS="/tmp,/share/" > > On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen > wrote: >

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Koert Kuipers
try it without spaces? export SPARK_LOCAL_DIRS="/tmp,/share/" On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen wrote: > Hello Spark community > > SPARK_LOCAL_DIRS or > spark.local.dir > is supposed to accept a list. > > I want to list one local (fast) drive, followed by a gpfs network drive,

[spark.local.dir] comma separated list does not work

2024-01-12 Thread Andrew Petersen
Hello Spark community SPARK_LOCAL_DIRS or spark.local.dir is supposed to accept a list. I want to list one local (fast) drive, followed by a gpfs network drive, similar to what is done here: https://cug.org/proceedings/cug2016_proceedings/includes/files/pap129s2-file1.pdf "Thus it is preferable

[GraphFrames Spark Package]: Why is there not a distribution for Spark 3.3?

2024-01-12 Thread Boileau, Brad
Hello, I was hoping to use a distribution of GraphFrames for AWS Glue 4 which has spark 3.3, but there is no found distribution for Spark 3.3 at this location: https://spark-packages.org/package/graphframes/graphframes Do you have any advice on the best compatible version to use for Spark 3.3?

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Mich Talebzadeh
catching up a bit late on this, I mentioned optimising RockDB as below in my earlier thread, specifically # Add RocksDB configurations here spark.conf.set("spark.sql.streaming.stateStore.providerClass",

Re: Structured Streaming Process Each Records Individually

2024-01-11 Thread Mich Talebzadeh
Hi, Let us visit the approach as some fellow members correctly highlighted the use case for spark structured streaming and two key concepts that I will mention - foreach: A method for applying custom write logic to each individual row in a streaming DataFrame or Dataset. -

Best option to process single kafka stream in parallel: PySpark Vs Dask

2024-01-11 Thread lab22
I am creating a setup to process packets from single kafta topic in parallel. For example, I have 3 containers (let's take 4 cores) on one vm, and from 1 kafka topic stream I create 10 jobs depending on packet source. These packets have small workload. 1. I can install dask in each

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-11 Thread Jungtaek Lim
If you use RocksDB state store provider, you can turn on changelog checkpoint to put the single changelog file per partition per batch. With disabling changelog checkpoint, Spark uploads newly created SST files and some log files. If compaction had happened, most SST files have to be re-uploaded.

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-11 Thread Jungtaek Lim
Hi, The time window is closed and evicted as long as "eviction watermark" passes the end of the window. Late events watermark only deals with discarding late events from "inputs". We did not introduce additional delay on the work of multiple stateful operators. We just allowed more late events to

Re: Okio Vulnerability in Spark 3.4.1

2024-01-11 Thread Bjørn Jørgensen
[SPARK-46662][K8S][BUILD] Upgrade kubernetes-client to 6.10.0 a new version of kubernets-client with okio version 1.17.6 is now merged to master and will be in the spark 4.0 version. tir. 14. nov. 2023 kl. 15:21 skrev Bjørn Jørgensen : > FYI > I have

Re: [Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Ant Kutschera
Hi *Do we have any option to make streaming queries with multiple stateful operations output data without waiting this extra iteration? One of my ideas was to force an empty microbatch to run and propagate late events watermark without any new data. While this conceptually works, I didn't find a

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Ant Kutschera
It might be good to first split the stream up into smaller streams, one per type. If ordering of the Kafka records is important, then you could partition them at the source based on the type, but be careful how you configure Spark to read from Kafka as that could also influence ordering. kdf

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Mich Talebzadeh
Use an intermediate work table to put json data streaming in there in the first place and then according to the tag store the data in the correct table HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Structured Streaming Process Each Records Individually

2024-01-10 Thread Khalid Mammadov
Use foreachBatch or foreach methods: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch On Wed, 10 Jan 2024, 17:42 PRASHANT L, wrote: > Hi > I have a use case where I need to process json payloads coming from Kafka > using structured

Structured Streaming Process Each Records Individually

2024-01-10 Thread PRASHANT L
Hi I have a use case where I need to process json payloads coming from Kafka using structured streaming , but thing is json can have different formats , schema is not fixed and each json will have a @type tag so based on tag , json has to be parsed and loaded to table with tag name , and if a

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Yes, I agree. But apart from maintaining this state internally (in memory or in memory+disk as in case of RocksDB), every trigger it saves some information about this state in a checkpoint location. I'm afraid we can't do much about this checkpointing operation. I'll continue looking for

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, You may have a point on scenario 2. Caching Streaming DataFrames: In Spark Streaming, each batch of data is processed incrementally, and it may not fit the typical caching we discussed. Instead, Spark Streaming has its mechanisms to manage and optimize the processing of streaming data. Case

[Structured Streaming] Avoid one microbatch delay with multiple stateful operations

2024-01-10 Thread Andrzej Zera
I'm struggling with the following issue in Spark >=3.4, related to multiple stateful operations. When spark.sql.streaming.statefulOperator.allowMultiple is enabled, Spark keeps track of two types of watermarks: eventTimeWatermarkForEviction and eventTimeWatermarkForLateEvents. Introducing them

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Hey, Yes, that's how I understood it (scenario 1). However, I'm not sure if scenario 2 is possible. I think cache on streaming DataFrame is supported only in forEachBatch (in which it's actually no longer a streaming DF). śr., 10 sty 2024 o 15:01 Mich Talebzadeh napisał(a): > Hi, > > With

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Mich Talebzadeh
Hi, With regard to your point - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try to cache batch ones to save on reads. But I'm not sure if it's what you mean, and I don't know how to apply what

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-10 Thread Andrzej Zera
Thank you very much for your suggestions. Yes, my main concern is checkpointing costs. I went through your suggestions and here're my comments: - Caching: Can you please explain what you mean by caching? I know that when you have batch and streaming sources in a streaming query, then you can try

unsubscribe

2024-01-10 Thread Daniel Maangi

[apache-spark] documentation on File Metadata _metadata struct

2024-01-10 Thread Jason Horner
All, the only documentation about the File Metadata ( hidden_metadata struct) I can seem to find is on the databricks website https://docs.databricks.com/en/ingestion/file-metadata-column.html#file-metadata-column for reference here is the struct:_metadata: struct (nullable = false) |-- file_path:

Unsubscribe

2024-01-09 Thread qi bryce
Unsubscribe

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread Mich Talebzadeh
Hi Ashok, Thanks for pointing out the databricks article Scalable Spark Structured Streaming for REST API Destinations | Databricks Blog I browsed it and it is basically similar to many of us involved

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-09 Thread ashok34...@yahoo.com.INVALID
Hey Mich, Thanks for this introduction on your forthcoming proposal "Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics". I recently came across an article by Databricks with title Scalable Spark Structured Streaming for REST API Destinations. Their use

Unsubscribe

2024-01-09 Thread mahzad kalantari
Unsubscribe

Unsubscribe

2024-01-09 Thread Kalhara Gurugamage
Unsubscribe

Re: Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Please also note that Flask, by default, is a single-threaded web framework. While it is suitable for development and small-scale applications, it may not handle concurrent requests efficiently in a production environment. In production, one can utilise Gunicorn (Green Unicorn) which is a WSGI (

Spark Structured Streaming and Flask REST API for Real-Time Data Ingestion and Analytics.

2024-01-08 Thread Mich Talebzadeh
Thought it might be useful to share my idea with fellow forum members. During the breaks, I worked on the *seamless integration of Spark Structured Streaming with Flask REST API for real-time data ingestion and analytics*. The use case revolves around a scenario where data is generated through

Re: Pyspark UDF as a data source for streaming

2024-01-08 Thread Mich Talebzadeh
Hi, Have you come back with some ideas for implementing this? Specifically integrating Spark Structured Streaming with REST API? FYI, I did some work on it as it can have potential wider use cases, i.e. the seamless integration of Spark Structured Streaming with Flask REST API for real-time data

[ANNOUNCE] Apache Celeborn(incubating) 0.3.2 available

2024-01-07 Thread Nicholas Jiang
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.3.2. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Mich Talebzadeh
OK I assume that your main concern is checkpointing costs. - Caching: If your queries read the same data multiple times, caching the data might reduce the amount of data that needs to be checkpointed. - Optimize Checkpointing Frequency i.e - Consider Changelog Checkpointing with RocksDB.

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-07 Thread Andrzej Zera
Usually one or two topics per query. Each query has its own checkpoint directory. Each topic has a few partitions. Performance-wise I don't experience any bottlenecks in terms of checkpointing. It's all about the number of requests (including a high number of LIST requests) and the associated

Re: [Structured Streaming] Keeping checkpointing cost under control

2024-01-06 Thread Mich Talebzadeh
How many topics and checkpoint directories are you dealing with? Does each topic has its own checkpoint on S3? All these checkpoints are sequential writes so even SSD would not really help HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view

[Structured Streaming] Keeping checkpointing cost under control

2024-01-05 Thread Andrzej Zera
Hey, I'm running a few Structured Streaming jobs (with Spark 3.5.0) that require near-real time accuracy with trigger intervals in the level of 5-10 seconds. I usually run 3-6 streaming queries as part of the job and each query includes at least one stateful operation (and usually two or more).

Re: Issue with Spark Session Initialization in Kubernetes Deployment

2024-01-05 Thread Mich Talebzadeh
Hi, I personally do not use the Spark operator. Anyhow, the Spark Operator automates the deployment and management of Spark applications within Kubernetes. However, it does not eliminate the need to configure Spark sessions for proper communication with the k8 cluster. So specifying the master

Issue with Spark Session Initialization in Kubernetes Deployment

2024-01-04 Thread Atul Patil
Hello Team, I am currently working on initializing a Spark session using Spark Structure Streaming within a Kubernetes deployment managed by the Spark Operator. During the initialization process, I encountered an error message indicating the necessity to set a master URL: *"Caused by:

Unsubscribe

2024-01-02 Thread Atlas - Samir Souidi
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Select Columns from Dataframe in Java

2023-12-30 Thread Grisha Weintraub
Hi, Have a look here - https://repost.aws/knowledge-center/spark-driver-logs-emr-cluster. Usually, you have application logs out-of-the-box in the driver stdout. It looks like

Re: Select Columns from Dataframe in Java

2023-12-30 Thread PRASHANT L
Hi Grisha This is Great :) It worked thanks alot I have this requirement , I will be running my spark application on EMR and build a custom logging to create logs on S3. Any idea what should I do? or In general if i create a custom log (with my Application name ), where will logs be generated

Re: Select Columns from Dataframe in Java

2023-12-30 Thread Grisha Weintraub
In Java, it expects an array of Columns, so you can simply cast your list to an array: array_df.select(fields.toArray(new Column[0])) On Fri, Dec 29, 2023 at 10:58 PM PRASHANT L wrote: > > Team > I am using Java and want to select columns from Dataframe , columns are > stored in List >

Unsubscribe

2023-12-29 Thread Vinti Maheshwari

Re: Pyspark UDF as a data source for streaming

2023-12-29 Thread Mich Talebzadeh
Hi, Do you have more info on this Jira besides the github link as I don't seem to find it! Thanks Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: the life cycle shuffle Dependency

2023-12-29 Thread murat migdisoglu
Hello, why would you like to delete the shuffle data yourself in the first place? On Thu, Dec 28, 2023, 10:08 yang chen wrote: > > hi, I'm learning spark, and wonder when to delete shuffle data, I find the > ContextCleaner class which clean the shuffle data when shuffle dependency > is GC-ed.

Select Columns from Dataframe in Java

2023-12-29 Thread PRASHANT L
Team I am using Java and want to select columns from Dataframe , columns are stored in List equivalent of below scala code * array_df=array_df.select(fields: _*)* When I try array_df=array_df.select(fields) , I get error saying Cast to Column I am using Spark 3.4

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
Hi Stanislav , On Pyspark DF can you the following df.printSchema() and send the output please HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

RE: Pyspark UDF as a data source for streaming

2023-12-28 Thread Поротиков Станислав Вячеславович
Ok. Thank you very much! Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Thursday, December 28, 2023 5:14 PM To: Hyukjin Kwon Cc: Поротиков Станислав Вячеславович ; user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming You can work around this issue by

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Mich Talebzadeh
You can work around this issue by trying to write your DF to a flat file and use Kafka to pick it up from the flat file and stream it in. Bear in mind that Kafa will require a unique identifier as K/V pair. Check this link how to generate UUID for this purpose

Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress https://github.com/apache/spark/pull/44416 we will likely release this in spark 4.0 On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович wrote: > Yes, it's actual data. > > > > Best regards, > > Stanislav Porotikov > > > > *From:*

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Yes, it's actual data. Best regards, Stanislav Porotikov From: Mich Talebzadeh Sent: Wednesday, December 27, 2023 9:43 PM Cc: user@spark.apache.org Subject: Re: Pyspark UDF as a data source for streaming Is this generated data actual data or you are testing the application? Sounds like a form

Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From: Mich

RE: Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Actually it's json with specific structure from API server. But the task is to check constantly if new data appears on API server and load it to Kafka. Full pipeline can be presented like that: REST API -> Kafka -> some processing -> Kafka/Mongo -> … Best regards, Stanislav Porotikov From:

Re: Pyspark UDF as a data source for streaming

2023-12-27 Thread Mich Talebzadeh
Ok so you want to generate some random data and load it into Kafka on a regular interval and the rest? HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Pyspark UDF as a data source for streaming

2023-12-27 Thread Поротиков Станислав Вячеславович
Hello! Is it possible to write pyspark UDF, generated data to streaming dataframe? I want to get some data from REST API requests in real time and consider to save this data to dataframe. And then put it to Kafka. I can't realise how to create streaming dataframe from generated data. I am new in

Re: Validate spark sql

2023-12-26 Thread Gourav Sengupta
Dear friend, thanks a ton was looking for linting for SQL for a long time, looks like https://sqlfluff.com/ is something that can be used :) Thank you so much, and wish you all a wonderful new year. Regards, Gourav On Tue, Dec 26, 2023 at 4:42 AM Bjørn Jørgensen wrote: > You can try sqlfluff

Re: Validate spark sql

2023-12-26 Thread Mich Talebzadeh
Worth trying EXPLAIN statement as suggested by @tianlangstudio HTH Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
You can try sqlfluff it's a linter for SQL code and it seems to have support for sparksql man. 25. des. 2023 kl. 17:13 skrev ram manickam : > Thanks Mich, Nicholas. I tried looking over the stack overflow post and > none of them >

Re: Validate spark sql

2023-12-25 Thread Bjørn Jørgensen
Mailing lists For broad, opinion based, ask for external resources, debug issues, bugs, contributing to the project, and scenarios, it is recommended you use the user@spark.apache.org mailing list. - user@spark.apache.org is for

回复:Validate spark sql

2023-12-25 Thread tianlangstudio
What about EXPLAIN? https://spark.apache.org/docs/3.5.0/sql-ref-syntax-qry-explain.html#content Fusion Zhu

Re: Validate spark sql

2023-12-24 Thread ram manickam
Thanks Mich, Nicholas. I tried looking over the stack overflow post and none of them Seems to cover the syntax validation. Do you know if it's even possible to do syntax validation in spark? Thanks Ram On Sun, Dec 24, 2023 at 12:49 PM Mich Talebzadeh wrote: > Well not to put too finer point on

Re: Validate spark sql

2023-12-24 Thread Mich Talebzadeh
Well not to put too finer point on it, in a public forum, one ought to respect the importance of open communication. Everyone has the right to ask questions, seek information, and engage in discussions without facing unnecessary patronization. Mich Talebzadeh, Dad | Technologist | Solutions

Re: Validate spark sql

2023-12-24 Thread Nicholas Chammas
This is a user-list question, not a dev-list question. Moving this conversation to the user list and BCC-ing the dev list. Also, this statement > We are not validating against table or column existence. is not correct. When you call spark.sql(…), Spark will lookup the table references and

Unsubscribe

2023-12-21 Thread yxj1141
Unsubscribe

India Scala & Big Data Job Referral

2023-12-21 Thread sri hari kali charan Tummala
Hi Community, I was laid off from Apple in February 2023, which led to my relocation from the USA due to immigration issues related to my H1B visa. I have over 12 years of experience as a consultant in Big Data, Spark, Scala, Python, and Flink. Despite my move to India, I haven't secured a

About shuffle partition size

2023-12-20 Thread Nebi Aydin
Hi all, What happens when # of unique join keys less than shuffle partitions? Are we going to end up with lots of empty partitions? If yes,is there any point to have shuffle partitions bigger than # of unique join keys?

[ANNOUNCE] Apache Spark 3.3.4 released

2023-12-16 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.3.4! Spark 3.3.4 is the last maintenance release based on the branch-3.3 maintenance branch of Spark. It contains many fixes including security and correctness domains. We strongly recommend all 3.3 users to upgrade to this or higher

Unsubscribe

2023-12-16 Thread Andrew Milkowski

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
Apologies Koert! Mich Talebzadeh, Dad | Technologist | Solutions Architect | Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-15 Thread Mich Talebzadeh
Hi kurt, I read this document of yours. indeed interesting and pretty recent (9th Dec). I am more focused on GCP and GKE . But obviously the concepts are the same. One thing I noticed, there was a lack of mention of Workload Identity federation

Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark Sessions. To multiplex or run multiple Drivers, you need some work such as gateway. On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong wrote: > Hi, > > My understanding is there is only one driver/spark context for all user >

Re: Architecture of Spark Connect

2023-12-14 Thread Kezhi Xiong
Hi, My understanding is there is only one driver/spark context for all user sessions. When you run the bin/start-connect-server script, you are submitting one long standing spark job / application. Every time a new user request comes in, a new user session is created under that. Please correct me

Re: Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
If multiple applications are running, we would need multiple spark connect servers? If so, is the user responsible for creating these servers or they are just created on the fly when the user requests a new spark session? On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal wrote: > Hi folks, > I am

Architecture of Spark Connect

2023-12-14 Thread Nikhil Goyal
Hi folks, I am trying to understand one question. Does Spark Connect create a new driver in the backend for every user or there are a fixed number of drivers running to which requests are sent to? Thanks Nikhil

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Koert Kuipers
yes it does using IAM roles for service accounts. see: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html i wrote a little bit about this also here: https://technotes.tresata.com/spark-on-k8s/ On Wed, Dec 13, 2023 at 7:52 AM Atul Patil wrote: > Hello Team, > >

Unsubscribe

2023-12-13 Thread kritika jain

Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Atul Patil
Hello Team, Does Spark support role-based authentication and access to Amazon S3 for Kubernetes deployment? *Note: we have deployed our spark application in the Kubernetes cluster.* Below are the Hadoop-AWS dependencies we are using: org.apache.hadoop hadoop-aws 3.3.4 We are

<    1   2   3   4   5   6   7   8   9   10   >