Re: [PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Raghavendra Ganesh
Hi, What is the purpose for which you want to use repartition() .. to reduce the number of files in delta? Also note that there is an alternative option of using coalesce() instead of repartition(). -- Raghavendra On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong wrote: > Hi all on user@spark: >

[PySpark Structured Streaming] How to tune .repartition(N) ?

2023-10-04 Thread Shao Yang Hong
Hi all on user@spark: We are looking for advice and suggestions on how to tune the .repartition() parameter. We are using Spark Streaming on our data pipeline to consume messages and persist them to a Delta Lake (https://delta.io/learn/getting-started/). We read messages from a Kafka topic,

[Spark Core]: Recomputation cost of a job due to executor failures

2023-10-04 Thread Faiz Halde
Hello, Due to the way Spark implements shuffle, a loss of an executor sometimes results in the recomputation of partitions that were lost The definition of a *partition* is the tuple ( RDD-ids, partition id ) RDD-ids is a sequence of RDD ids In our system, we define the unit of work performed

Updating delta file column data

2023-10-02 Thread Karthick Nk
Hi community members, In databricks adls2 delta tables, I need to perform the below operation, could you help me with your thoughts I have the delta tables with one colum with data type string , which contains the json data in string data type, I need to do the following 1. I have to update one

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jon Rodríguez Aranguren
Dear Jörn Franke, Jayabindu Singh and Spark Community members, Thank you profoundly for your initial insights. I feel it's necessary to provide more precision on our setup to facilitate a deeper understanding. We're interfacing with S3 Compatible storages, but our operational context is somewhat

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
There is nowadays more a trend to move away from static credentials/certificates that are stored in a secret vault. The issue is that the rotation of them is complex, once they are leaked they can be abused, making minimal permissions feasible is cumbersome etc. That is why keyless approaches are

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Jörn Franke
With oidc sth comparable is possible: https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.htmlAm 01.10.2023 um 11:13 schrieb Mich Talebzadeh :It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-10-01 Thread Mich Talebzadeh
It seems that workload identity is not available on AWS. Workload Identity replaces the need to use Metadata concealment on exposed storage such as s3 and gcs. The sensitive metadata protected by metadata concealment is also

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jayabindu Singh
Hi Jon, Using IAM as suggested by Jorn is the best approach. We recently moved our spark workload from HDP to Spark on K8 and utilizing IAM. It will save you from secret management headaches and also allows a lot more flexibility on access control and option to allow access to multiple S3 buckets

Re: Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-30 Thread Jörn Franke
Don’t use static iam (s3) credentials. It is an outdated insecure method - even AWS recommend against using this for anything (cf eg https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html). It is almost a guarantee to get your data stolen and your account manipulated. If

using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link ->

Seeking Guidance on Spark on Kubernetes Secrets Configuration

2023-09-29 Thread Jon Rodríguez Aranguren
Dear Spark Community Members, I trust this message finds you all in good health and spirits. I'm reaching out to the collective expertise of this esteemed community with a query regarding Spark on Kubernetes. As a newcomer, I have always admired the depth and breadth of knowledge shared within

Re: Inquiry about Processing Speed

2023-09-28 Thread Jack Goodson
Hi Haseeb, I think the user mailing list is what you're looking for, people are usually pretty active on here if you present a direct question about apache spark. I've linked below the community guidelines which says which mailing lists are for what etc https://spark.apache.org/community.html

Thread dump only shows 10 shuffle clients

2023-09-28 Thread Nebi Aydin
Hi all, I set the spark.shuffle.io.serverThreads and spark.shuffle.io.clientThreads to *800* But when I click Thread dump from the Spark UI for the executor: I only see 10 shuffle client threads for the executor. Is that normal, am I missing something?

Re: Inquiry about Processing Speed

2023-09-27 Thread Deepak Goel
Hi "Processing Speed" can be at a software level (Code Optimization) and at a hardware level (Capacity Planning) Deepak "The greatness of a nation can be judged by the way its animals are treated - Mahatma Gandhi" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool

Files io threads vs shuffle io threads

2023-09-27 Thread Nebi Aydin
Hi all, Can someone explain the difference between Files io threads and shuffle io threads, as I couldn't find any explanation. I'm specifically asking about these: spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.io.threads spark.files.io.serverThreads spark.files.io.clientThreads

Inquiry about Processing Speed

2023-09-27 Thread Haseeb Khalid
Dear Support Team, I hope this message finds you well. My name is Haseeb Khalid, and I am reaching out to discuss a scenario related to processing speed in Apache Spark. I have been utilizing these technologies in our projects, and we have encountered a specific use case where we are seeking to

Reading Glue Catalog Views through Spark.

2023-09-25 Thread Agrawal, Sanket
Hello Everyone, We have setup spark and setup Iceberg-Glue connectors as mentioned at https://iceberg.apache.org/docs/latest/aws/ to integrate Spark, Iceberg, and AWS Glue Catalog. We are able to read tables through this but we are unable to read data through views. PFB, the error:

[PySpark][Spark logs] Is it possible to dynamically customize Spark logs?

2023-09-25 Thread Ayman Rekik
Hello, What would be the right way, if any, to inject a runtime variable into Spark logs. So that, for example, if Spark (driver/worker) logs some info/warning/error message, the variable will be output there (in order to help filtering logs for the sake of monitoring and troubleshooting).

[ANNOUNCE] Apache Kyuubi released 1.7.3

2023-09-25 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.3 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Spark Connect Multi-tenant Support

2023-09-22 Thread Kezhi Xiong
Hi, >From Spark Connect's official site's image, it mentions the "Multi-tenant Application Gateway" on driver. Are there any more documents about it? Can I know how users can utilize such a feature? Thanks, Kezhi

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-22 Thread Karthick
Hi All, It will be helpful if we gave any pointers to the problem addressed. Thanks Karthick. On Wed, Sep 20, 2023 at 3:03 PM Gowtham S wrote: > Hi Spark Community, > > Thank you for bringing up this issue. We've also encountered the same > challenge and are actively working on finding a

Re: Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Found this issue reported earlier but was bulk closed: https://issues.apache.org/jira/browse/SPARK-27030 Regards, Shrikant On Fri, 22 Sep 2023 at 12:03 AM, Shrikant Prasad wrote: > Hi all, > > We have multiple spark jobs running in parallel trying to write into same > hive table but each job

Parallel write to different partitions

2023-09-21 Thread Shrikant Prasad
Hi all, We have multiple spark jobs running in parallel trying to write into same hive table but each job writing into different partition. This was working fine with Spark 2.3 and Hadoop 2.7. But after upgrading to Spark 3.2 and Hadoop 3.2.2, these parallel jobs are failing with FileNotFound

Re: Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread Mich Talebzadeh
In general you can probably do all this in spark-sql by reading in Hive table through a DF in Pyspark, then creating a TempView on that DF, select PM data through CAST() function and then use a windowing function to select the top 5 with DENSE_RANK() #Read Hive table as a DataFrame df =

Need to split incoming data into PM on time column and find the top 5 by volume of data

2023-09-21 Thread ashok34...@yahoo.com.INVALID
Hello gurus, I have a Hive table created as below (there are more columns) CREATE TABLE hive.sample_data ( incoming_ip STRING, time_in TIMESTAMP, volume INT ); Data is stored in that table In PySpark, I want to  select the top 5 incoming IP addresses with the highest total volume of data

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Oh, I saw it now. Thanks! On Wed, Sep 20, 2023 at 1:04 PM Sean Owen wrote: > [ External sender. Exercise caution. ] > > I think the announcement mentioned there were some issues with pypi and > the upload size this time. I am sure it's intended to be there when > possible. > > On Wed, Sep 20,

Re: PySpark 3.5.0 on PyPI

2023-09-20 Thread Sean Owen
I think the announcement mentioned there were some issues with pypi and the upload size this time. I am sure it's intended to be there when possible. On Wed, Sep 20, 2023, 3:00 PM Kezhi Xiong wrote: > Hi, > > Are there any plans to upload PySpark 3.5.0 to PyPI ( >

PySpark 3.5.0 on PyPI

2023-09-20 Thread Kezhi Xiong
Hi, Are there any plans to upload PySpark 3.5.0 to PyPI ( https://pypi.org/project/pyspark/)? It's still 3.4.1. Thanks, Kezhi

[Spark 3.5.0] Is the protobuf-java JAR no longer shipped with Spark?

2023-09-20 Thread Gijs Hendriksen
Hi all, This week, I tried upgrading to Spark 3.5.0, as it contained some fixes for spark-protobuf that I need for my project. However, my code is no longer running under Spark 3.5.0. My build.sbt file is configured as follows: val sparkV  = "3.5.0" val hadoopV = "3.3.6"

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Sean Owen
This has turned into a big thread for a simple thing and has been answered 3 times over now. Neither is better, they just calculate different things. That the 'default' is sample stddev is just convention. stddev_pop is the simple standard deviation of a set of numbers stddev_samp is used when

Re: Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-20 Thread Gowtham S
Hi Spark Community, Thank you for bringing up this issue. We've also encountered the same challenge and are actively working on finding a solution. It's reassuring to know that we're not alone in this. If you have any insights or suggestions regarding how to address this problem, please feel

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-20 Thread Mich Talebzadeh
Spark uses the sample standard deviation stddev_samp by default, whereas *Hive* uses population standard deviation stddev_pop as default. My understanding is that spark uses sample standard deviation by default because - It is more commonly used. - It is more efficient to calculate. -

unsubscribe

2023-09-19 Thread Danilo Sousa
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2023-09-19 Thread Ghousia
unsubscribe

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
from pyspark.sql import SparkSession from pyspark.sql.functions import stddev_samp, stddev_pop spark = SparkSession.builder.getOrCreate() data = [(52.7,), (45.3,), (60.2,), (53.8,), (49.1,), (44.6,), (58.0,), (56.5,), (47.9,), (50.3,)] df = spark.createDataFrame(data, ["value"])

Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud
Hi, I usually create an external Delta table with the command below, using DataFrameWriter API: df.write    .format("delta")    .option("path", "")    .saveAsTable("") Now I would like to use the DataFrameWriterV2 API. I have tried the following command: df.writeTo("")    .using("delta")    

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > > > > I

Spark streaming sourceArchiveDir does not move file to archive directory

2023-09-19 Thread Yunus Emre G?rses
Hello everyone, I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming using Spark, I give the `cleanSource` option as "archive" and the `sourceArchiveDir` option as "archived" as in the code below. ``` spark.readStream .option("cleanSource", "archive")

Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi! I am applying the stddev function (so actually stddev_samp), however when comparing with the sample standard deviation in Excel the resuls do not match. I cannot find in your documentation any more specifics on how the sample standard deviation is calculated, so I cannot compare the

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details:

Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-19 Thread Karthick
Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem Dear Spark Community, I recently reached out to the Apache Flink community for assistance with a critical issue we are facing in our IoT platform, which relies on Apache Kafka and real-time data processing. We received some

unsubscribe

2023-09-18 Thread Ghazi Naceur
unsubscribe

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
Hi Craig, Thank you for sending us more information. Can you answer my previous question which I don't think the document addresses. How did you determine duplicates in the output? How was the output data read? The FileStreamSink provides exactly-once writes ONLY if you read the output with the

Re: getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
OK thanks Sean. Not a big issue for me. Normally happens AM GMT/London time.. I see the email trail but not the thread owner's email first. Normally responses first. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: getting emails in different order!

2023-09-18 Thread Sean Owen
I have seen this, and not sure if it's just the ASF mailer being weird, or more likely, because emails are moderated and we inadvertently moderate them out of order On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh wrote: > Hi, > > I use gmail to receive spark user group emails. > > On

Re: Spark stand-alone mode

2023-09-18 Thread Ilango
Thanks all for your suggestions. Noted with thanks. Just wanted share few more details about the environment 1. We use NFS for data storage and data is in parquet format 2. All HPC nodes are connected and already work as a cluster for Studio workbench. I can setup password less SSH if it not exist

getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
Hi, I use gmail to receive spark user group emails. On occasions, I get the latest emails first and later in the day I receive the original email. Has anyone else seen this behaviour recently? Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United

[ANNOUNCE] Apache Kyuubi released 1.7.2

2023-09-18 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.2 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-) I would recommend you to check https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task. Refer to https://github.com/apache/spark/pull/41504, https://github.com/apache/spark/pull/41455 as an example PR. Or you can also add a new sub-task if you find any error

Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram, We have some good guidance at https://spark.apache.org/contributing.html HTH! Denny On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > > > > Hello All, > Recently, joined this community and would like to contribute. Is there a > guideline or recommendation on tasks that can be

About Peak Jvm Memory Onheap

2023-09-17 Thread Nebi Aydin
Hi all, I couldn't find any useful doc that explains `*Peak JVM Memory Onheap`* field on Spark UI. Most of the time my applications have very low *On heap storage memory *and *Peak execution memory on heap* But have very big `*Peak JVM Memory Onheap`.* on Spark UI Can someone please explain the

Fwd: First Time contribution.

2023-09-17 Thread ram manickam
Hello All, Recently, joined this community and would like to contribute. Is there a guideline or recommendation on tasks that can be picked up by a first timer or a started task?. Tried looking at stack overflow tag: apache-spark , couldn't

Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
Thank you Bjorn and Mich.  Appreciated Best On Saturday, 16 September 2023 at 16:50:04 BST, Mich Talebzadeh wrote: Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > The memory consumption is > bounded. The

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
percentile_approx returns the approximate percentile(s) The memory consumption is bounded. The larger accuracy parameter we choose, the smaller error we get. The default accuracy value is 1, to match with Hive default setting. Choose a smaller value

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have modified your code to use percentile_approx rather than manually computing it. It would be interesting to hear ideas on this.

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Happy Saturday coding  Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
ah.. yes that's right. I did have to use some time on this one and I was having some issues with the code. I restart the notebook kernel now and rerun it and I get the same result. lør. 16. sep. 2023 kl. 11:41 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Splendid code. A minor error

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Splendid code. A minor error glancing at your code. print(df.count()) print(result_df.count()) You have not defined result_df. I gather you meant "result"? print(result.count()) That should fix it 樂 HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London

[Spark Core]: How does rpc threads influence shuffle?

2023-09-15 Thread Nebi Aydin
Hello all, I know that these parameters exist for shuffle tuning: *spark.shuffle.io.serverThreadsspark.shuffle.io.clientThreadsspark.shuffle.io.threads* But we also have *spark.rpc.io.serverThreadsspark.rpc.io.clientThreadsspark.rpc.io.threads* So specifically talking about *Shuffling,

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
you need to setup ssh without password, use key instead. How to connect without password using SSH (passwordless) fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh <

Re: Filter out 20% of rows

2023-09-15 Thread Bjørn Jørgensen
Something like this? # Standard library imports import json import multiprocessing import os import re import sys import random # Third-party imports import numpy as np import pandas as pd import pyarrow # Pyspark imports from pyspark import SparkConf, SparkContext from pyspark.sql import

Re: Spark stand-alone mode

2023-09-15 Thread Mich Talebzadeh
Hi, Can these 4 nodes talk to each other through ssh as trusted hosts (on top of the network that Sean already mentioned)? Otherwise you need to set it up. You can install a LAN if you have another free port at the back of your HPC nodes. They should You ought to try to set up a Hadoop cluster

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all nodes. > > Spark is

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start

Spark stand-alone mode

2023-09-14 Thread Ilango
Hi all, We have 4 HPC nodes and installed spark individually in all nodes. Spark is used as local mode(each driver/executor will have 8 cores and 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as scheduler. As this is local mode, we are facing performance issue(as only

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Russell et al,Acknowledging receipt; we’ll get these answers back to the group.Follow-up forthcoming.CraigOn Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote:Hi Craig,Can you please

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Jerry- This is exactly the type of help we're seeking, to confirm the FilestreamSink was not utilized on our test runs. Our team is going to work towards implementing this and re-running our experiments across the versions. If everything comes back with similar results, we will reach back

Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
Thanks Holden and Martin for the nice words and feedback :) On Wed, Sep 13, 2023 at 8:22 AM Martin Grund wrote: > This is absolutely awesome! Thank you so much for dedicating your time to > this project! > > > On Wed, Sep 13, 2023 at 6:04 AM Holden Karau wrote: > >> That’s so cool! Great work

unsubscribe

2023-09-13 Thread randy clinton
unsubscribe -- I appreciate your time, ~Randy - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Write Spark Connection client application in Go

2023-09-13 Thread Martin Grund
This is absolutely awesome! Thank you so much for dedicating your time to this project! On Wed, Sep 13, 2023 at 6:04 AM Holden Karau wrote: > That’s so cool! Great work y’all :) > > On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > >> Hi Spark Friends, >> >> Anyone interested in using Golang

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library . > Would love to hear

APACHE Spark adoption/growth chart

2023-09-12 Thread Andrew Petersen
Hello Spark community Can anyone direct me to a simple graph/chart that shows APACHE Spark adoption, preferably one that includes recent years? Of less importance, a similar Databricks plot? An internet search gave me plots only up to 2015. I also searched spark.apache.org and databricks.com,

Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends, Anyone interested in using Golang to write Spark application? We created a Spark Connect Go Client library . Would love to hear feedback/thoughts from the community. Please see the quick start guide

Unsubscribe

2023-09-11 Thread Tom Praison
Unsubscribe

Feedback on Testing Guidelines for Data Stream Processing Applications

2023-09-11 Thread Alexandre Strapacao Guedes Vianna
Greetings, I hope this message finds you well. As part of my PhD research, I've developed guidelines tailored to assist professionals in planning testing of Data Stream Processing applications. If you've worked directly with stream processing or have experience with similar systems, like

Re: IDEA compile fail but sbt test succeed

2023-09-09 Thread Pasha Finkelshteyn
Dear AlphaBetaGo, First of all, there are not only guys here, but also women. Second, you didn't give a context that would allow to understand the connection with Spark. From what I see, it's more likely that it's an issue in Spark/sbt support in IDEA. Feel free to create an issue in the

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage memory and executor memory when it failed is usually low On Fri, Sep 8, 2023 at 16:49 Jack Wells wrote: > Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS > if it runs out of memory on a per-executor

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS if it runs out of memory on a per-executor basis. This could happen when evaluating a cache operation like you have below or during shuffle operations in joins, etc. You might try to increase executor memory, tune shuffle

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Sure df = spark.read.option("basePath", some_path).parquet(*list_of_s3_file_paths()) ( df .where(SOME FILTER) .repartition(6) .cache() ) On Fri, Sep 8, 2023 at 14:56 Jack Wells wrote: > Hi Nebi, can you share the code you’re using to read and write from S3? > > On Sep 8,

Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
Hi Nebi, can you share the code you’re using to read and write from S3? On Sep 8, 2023 at 10:59:59, Nebi Aydin wrote: > Hi all, > I am using spark on EMR to process data. Basically i read data from AWS S3 > and do the transformation and post transformation i am loading/writing data > to s3. >

About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Hi all, I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to s3. Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. I disabled `yarn.log-aggregation-enable` by setting it

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-08 Thread Agrawal, Sanket
Hi Yasukazu, I tried by replacing the jar though the spark code didn’t work but the vulnerability was removed. But I agree that even 3.1.3 has other vulnerabilities listed on maven page but these are medium level vulnerabilities. We are currently targeting Critical and High vulnerabilities

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson : Awesome, it worked with "“org.elasticsearch.spark.sql”" But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also worked. On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev wrote: > > Let me try that and get back. Just wondering, if there a change in the > way we pass

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in the way we pass the format in connector from Spark 2 to 3? On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson wrote: > I am pretty certain you need to change the write.format from “es” to > “org.elasticsearch.spark.sql” > > Sent

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi, I tried replacing just this JAR but getting errors. From: Nagatomi Yasukazu Sent: Friday, September 8, 2023 9:35 AM To: Agrawal, Sanket Cc: Chao Sun ; Yeachan Park ; user@spark.apache.org Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3 Hi Sanket, While migrating to Hive 3.1.3 may resolve

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Nagatomi Yasukazu
Hi Sanket, While migrating to Hive 3.1.3 may resolve many issues, the link below suggests that there might still be some vulnerabilities present. Do you think the specific vulnerability you're concerned about can be addressed with Hive 3.1.3?

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
I mean, have you checked if this is in your jar? Are you building an assembly? Where do you expect elastic classes to be and are they there? Need some basic debugging here On Thu, Sep 7, 2023, 8:49 PM Dipayan Dev wrote: > Hi Sean, > > Removed the provided thing, but still the same issue. > > >

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi Sean, Removed the provided thing, but still the same issue. org.elasticsearch elasticsearch-spark-30_${scala.compat.version} 7.12.1 On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > By marking it provided, you are not including this dependency with your > app. If it is also

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi Chao, The reason to migrate to Hive 3.1.3 is to remove a vulnerability from hive-exec-2.3.9.jar. Thanks Sanket From: Chao Sun Sent: Thursday, September 7, 2023 10:23 PM To: Agrawal, Sanket Cc: Yeachan Park ; user@spark.apache.org Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3 Hi Sanket,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
By marking it provided, you are not including this dependency with your app. If it is also not somehow already provided by your spark cluster (this is what it means), then yeah this is not anywhere on the class path at runtime. Remove the provided scope. On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev

Re: Change default timestamp offset on data load

2023-09-07 Thread Jack Goodson
Thanks Mich figured that might be the case, regardless, appreciate the help :) On Thu, Sep 7, 2023 at 8:36 PM Mich Talebzadeh wrote: > Hi, > > As far as I am aware there is no Spark or JVM setting that can make Spark > assume a different timezone during the initial load from Parquet as Parquet

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Chao Sun
Hi Sanket, Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a lot of work to upgrade the Hive version to 3.x and up. Normally though, you only need the Hive client in Spark to talk to HiveMetastore (HMS) for things like table or partition metadata information. In this case,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi, Can you please elaborate your last response? I don’t have any external dependencies added, and just updated the Spark version as mentioned below. Can someone help me with this? On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers wrote: > could the provided scope be the issue? > > On Sun, Aug 27,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev wrote: > Hi, > > Can you please elaborate your last response? I don’t have any external > dependencies added, and just updated the Spark version as mentioned below. > > Can someone help me with this? > > On Fri, 1 Sep 2023 at 5:58 PM, Koert

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Yeachan Park
Hi, The maven option is good for testing but I wouldn't recommend it running in production from a security perspective and also depending on your setup you might be downloading jars at the start of every spark session. By the way, Spark definitely not require all the jars from Hive, since from

<    2   3   4   5   6   7   8   9   10   11   >