unsubscribe

2023-09-19 Thread Ghousia
unsubscribe

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Mich Talebzadeh
Hi Helen, Assuming you want to calculate stddev_samp, Spark correctly points STDDEV to STDDEV_SAMP. In below replace sales with your table name and AMOUNT_SOLD with the column you want to do the calculation SELECT

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Bjørn Jørgensen
from pyspark.sql import SparkSession from pyspark.sql.functions import stddev_samp, stddev_pop spark = SparkSession.builder.getOrCreate() data = [(52.7,), (45.3,), (60.2,), (53.8,), (49.1,), (44.6,), (58.0,), (56.5,), (47.9,), (50.3,)] df = spark.createDataFrame(data, ["value"])

Create an external table with DataFrameWriterV2

2023-09-19 Thread Christophe Préaud
Hi, I usually create an external Delta table with the command below, using DataFrameWriter API: df.write    .format("delta")    .option("path", "")    .saveAsTable("") Now I would like to use the DataFrameWriterV2 API. I have tried the following command: df.writeTo("")    .using("delta")    

Re: Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Sean Owen
Pyspark follows SQL databases here. stddev is stddev_samp, and sample standard deviation is the calculation with the Bessel correction, n-1 in the denominator. stddev_pop is simply standard deviation, with n in the denominator. On Tue, Sep 19, 2023 at 7:13 AM Helene Bøe wrote: > Hi! > > > > I

Spark streaming sourceArchiveDir does not move file to archive directory

2023-09-19 Thread Yunus Emre G?rses
Hello everyone, I'm using scala and spark with the version 3.4.1 in Windows 10. While streaming using Spark, I give the `cleanSource` option as "archive" and the `sourceArchiveDir` option as "archived" as in the code below. ``` spark.readStream .option("cleanSource", "archive")

Discriptency sample standard deviation pyspark and Excel

2023-09-19 Thread Helene Bøe
Hi! I am applying the stddev function (so actually stddev_samp), however when comparing with the sample standard deviation in Excel the resuls do not match. I cannot find in your documentation any more specifics on how the sample standard deviation is calculated, so I cannot compare the

Re: Spark stand-alone mode

2023-09-19 Thread Patrick Tucci
Multiple applications can run at once, but you need to either configure Spark or your applications to allow that. In stand-alone mode, each application attempts to take all resources available by default. This section of the documentation has more details:

Urgent: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem

2023-09-19 Thread Karthick
Subject: Seeking Guidance on Kafka Slow Consumer and Data Skew Problem Dear Spark Community, I recently reached out to the Apache Flink community for assistance with a critical issue we are facing in our IoT platform, which relies on Apache Kafka and real-time data processing. We received some

unsubscribe

2023-09-18 Thread Ghazi Naceur
unsubscribe

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-18 Thread Jerry Peng
Hi Craig, Thank you for sending us more information. Can you answer my previous question which I don't think the document addresses. How did you determine duplicates in the output? How was the output data read? The FileStreamSink provides exactly-once writes ONLY if you read the output with the

Re: getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
OK thanks Sean. Not a big issue for me. Normally happens AM GMT/London time.. I see the email trail but not the thread owner's email first. Normally responses first. Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile

Re: getting emails in different order!

2023-09-18 Thread Sean Owen
I have seen this, and not sure if it's just the ASF mailer being weird, or more likely, because emails are moderated and we inadvertently moderate them out of order On Mon, Sep 18, 2023 at 10:59 AM Mich Talebzadeh wrote: > Hi, > > I use gmail to receive spark user group emails. > > On

Re: Spark stand-alone mode

2023-09-18 Thread Ilango
Thanks all for your suggestions. Noted with thanks. Just wanted share few more details about the environment 1. We use NFS for data storage and data is in parquet format 2. All HPC nodes are connected and already work as a cluster for Studio workbench. I can setup password less SSH if it not exist

getting emails in different order!

2023-09-18 Thread Mich Talebzadeh
Hi, I use gmail to receive spark user group emails. On occasions, I get the latest emails first and later in the day I receive the original email. Has anyone else seen this behaviour recently? Thanks Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United

[ANNOUNCE] Apache Kyuubi released 1.7.2

2023-09-18 Thread Zhen Wang
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.7.2 has been released! Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for

Re: First Time contribution.

2023-09-17 Thread Haejoon Lee
Welcome Ram! :-) I would recommend you to check https://issues.apache.org/jira/browse/SPARK-37935 out as a starter task. Refer to https://github.com/apache/spark/pull/41504, https://github.com/apache/spark/pull/41455 as an example PR. Or you can also add a new sub-task if you find any error

Re: First Time contribution.

2023-09-17 Thread Denny Lee
Hi Ram, We have some good guidance at https://spark.apache.org/contributing.html HTH! Denny On Sun, Sep 17, 2023 at 17:18 ram manickam wrote: > > > > Hello All, > Recently, joined this community and would like to contribute. Is there a > guideline or recommendation on tasks that can be

About Peak Jvm Memory Onheap

2023-09-17 Thread Nebi Aydin
Hi all, I couldn't find any useful doc that explains `*Peak JVM Memory Onheap`* field on Spark UI. Most of the time my applications have very low *On heap storage memory *and *Peak execution memory on heap* But have very big `*Peak JVM Memory Onheap`.* on Spark UI Can someone please explain the

Fwd: First Time contribution.

2023-09-17 Thread ram manickam
Hello All, Recently, joined this community and would like to contribute. Is there a guideline or recommendation on tasks that can be picked up by a first timer or a started task?. Tried looking at stack overflow tag: apache-spark , couldn't

Re: Filter out 20% of rows

2023-09-16 Thread ashok34...@yahoo.com.INVALID
Thank you Bjorn and Mich.  Appreciated Best On Saturday, 16 September 2023 at 16:50:04 BST, Mich Talebzadeh wrote: Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
EDIT: I don't think that the question asker will have only returned the top 25 percentages. lør. 16. sep. 2023 kl. 21:54 skrev Bjørn Jørgensen : > percentile_approx returns the approximate percentile(s) > The memory consumption is > bounded. The

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
percentile_approx returns the approximate percentile(s) The memory consumption is bounded. The larger accuracy parameter we choose, the smaller error we get. The default accuracy value is 1, to match with Hive default setting. Choose a smaller value

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Hi Bjorn, I thought that one is better off using percentile_approx as it seems to be the recommended approach for computing percentiles and can simplify the code. I have modified your code to use percentile_approx rather than manually computing it. It would be interesting to hear ideas on this.

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Happy Saturday coding  Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: Filter out 20% of rows

2023-09-16 Thread Bjørn Jørgensen
ah.. yes that's right. I did have to use some time on this one and I was having some issues with the code. I restart the notebook kernel now and rerun it and I get the same result. lør. 16. sep. 2023 kl. 11:41 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Splendid code. A minor error

Re: Filter out 20% of rows

2023-09-16 Thread Mich Talebzadeh
Splendid code. A minor error glancing at your code. print(df.count()) print(result_df.count()) You have not defined result_df. I gather you meant "result"? print(result.count()) That should fix it 樂 HTH Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London

[Spark Core]: How does rpc threads influence shuffle?

2023-09-15 Thread Nebi Aydin
Hello all, I know that these parameters exist for shuffle tuning: *spark.shuffle.io.serverThreadsspark.shuffle.io.clientThreadsspark.shuffle.io.threads* But we also have *spark.rpc.io.serverThreadsspark.rpc.io.clientThreadsspark.rpc.io.threads* So specifically talking about *Shuffling,

Re: Spark stand-alone mode

2023-09-15 Thread Bjørn Jørgensen
you need to setup ssh without password, use key instead. How to connect without password using SSH (passwordless) fre. 15. sep. 2023 kl. 20:55 skrev Mich Talebzadeh <

Re: Filter out 20% of rows

2023-09-15 Thread Bjørn Jørgensen
Something like this? # Standard library imports import json import multiprocessing import os import re import sys import random # Third-party imports import numpy as np import pandas as pd import pyarrow # Pyspark imports from pyspark import SparkConf, SparkContext from pyspark.sql import

Re: Spark stand-alone mode

2023-09-15 Thread Mich Talebzadeh
Hi, Can these 4 nodes talk to each other through ssh as trusted hosts (on top of the network that Sean already mentioned)? Otherwise you need to set it up. You can install a LAN if you have another free port at the back of your HPC nodes. They should You ought to try to set up a Hadoop cluster

Re: Spark stand-alone mode

2023-09-15 Thread Sean Owen
Yes, should work fine, just set up according to the docs. There needs to be network connectivity between whatever the driver node is and these 4 nodes. On Thu, Sep 14, 2023 at 11:57 PM Ilango wrote: > > Hi all, > > We have 4 HPC nodes and installed spark individually in all nodes. > > Spark is

Re: Spark stand-alone mode

2023-09-15 Thread Patrick Tucci
I use Spark in standalone mode. It works well, and the instructions on the site are accurate for the most part. The only thing that didn't work for me was the start_all.sh script. Instead, I use a simple script that starts the master node, then uses SSH to connect to the worker machines and start

Spark stand-alone mode

2023-09-14 Thread Ilango
Hi all, We have 4 HPC nodes and installed spark individually in all nodes. Spark is used as local mode(each driver/executor will have 8 cores and 65 GB) in Sparklyr/pyspark using Rstudio/Posit workbench. Slurm is used as scheduler. As this is local mode, we are facing performance issue(as only

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Russell et al,Acknowledging receipt; we’ll get these answers back to the group.Follow-up forthcoming.CraigOn Sep 14, 2023, at 6:38 PM, russell.spit...@gmail.com wrote:Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread russell . spitzer
Exactly once should be output sink dependent, what sink was being used?Sent from my iPhoneOn Sep 14, 2023, at 4:52 PM, Jerry Peng wrote:Craig,Thanks! Please let us know the result!Best,JerryOn Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote:Hi Craig,Can you please

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Jerry Peng
Craig, Thanks! Please let us know the result! Best, Jerry On Thu, Sep 14, 2023 at 12:22 PM Mich Talebzadeh wrote: > > Hi Craig, > > Can you please clarify what this bug is and provide sample code causing > this issue? > > HTH > > Mich Talebzadeh, > Distinguished Technologist, Solutions

Re: Data Duplication Bug Found - Structured Streaming Versions 3..4.1, 3.2.4, and 3.3.2

2023-09-14 Thread Craig Alfieri
Hi Jerry- This is exactly the type of help we're seeking, to confirm the FilestreamSink was not utilized on our test runs. Our team is going to work towards implementing this and re-running our experiments across the versions. If everything comes back with similar results, we will reach back

Re: Write Spark Connection client application in Go

2023-09-14 Thread bo yang
Thanks Holden and Martin for the nice words and feedback :) On Wed, Sep 13, 2023 at 8:22 AM Martin Grund wrote: > This is absolutely awesome! Thank you so much for dedicating your time to > this project! > > > On Wed, Sep 13, 2023 at 6:04 AM Holden Karau wrote: > >> That’s so cool! Great work

unsubscribe

2023-09-13 Thread randy clinton
unsubscribe -- I appreciate your time, ~Randy - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Write Spark Connection client application in Go

2023-09-13 Thread Martin Grund
This is absolutely awesome! Thank you so much for dedicating your time to this project! On Wed, Sep 13, 2023 at 6:04 AM Holden Karau wrote: > That’s so cool! Great work y’all :) > > On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > >> Hi Spark Friends, >> >> Anyone interested in using Golang

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library . > Would love to hear

APACHE Spark adoption/growth chart

2023-09-12 Thread Andrew Petersen
Hello Spark community Can anyone direct me to a simple graph/chart that shows APACHE Spark adoption, preferably one that includes recent years? Of less importance, a similar Databricks plot? An internet search gave me plots only up to 2015. I also searched spark.apache.org and databricks.com,

Write Spark Connection client application in Go

2023-09-12 Thread bo yang
Hi Spark Friends, Anyone interested in using Golang to write Spark application? We created a Spark Connect Go Client library . Would love to hear feedback/thoughts from the community. Please see the quick start guide

Unsubscribe

2023-09-11 Thread Tom Praison
Unsubscribe

Feedback on Testing Guidelines for Data Stream Processing Applications

2023-09-11 Thread Alexandre Strapacao Guedes Vianna
Greetings, I hope this message finds you well. As part of my PhD research, I've developed guidelines tailored to assist professionals in planning testing of Data Stream Processing applications. If you've worked directly with stream processing or have experience with similar systems, like

Re: IDEA compile fail but sbt test succeed

2023-09-09 Thread Pasha Finkelshteyn
Dear AlphaBetaGo, First of all, there are not only guys here, but also women. Second, you didn't give a context that would allow to understand the connection with Spark. From what I see, it's more likely that it's an issue in Spark/sbt support in IDEA. Feel free to create an issue in the

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Usually job never reaches that point fails during shuffle. And storage memory and executor memory when it failed is usually low On Fri, Sep 8, 2023 at 16:49 Jack Wells wrote: > Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS > if it runs out of memory on a per-executor

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
Assuming you’re not writing to HDFS in your code, Spark can spill to HDFS if it runs out of memory on a per-executor basis. This could happen when evaluating a cache operation like you have below or during shuffle operations in joins, etc. You might try to increase executor memory, tune shuffle

Re: [External Email] Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Sure df = spark.read.option("basePath", some_path).parquet(*list_of_s3_file_paths()) ( df .where(SOME FILTER) .repartition(6) .cache() ) On Fri, Sep 8, 2023 at 14:56 Jack Wells wrote: > Hi Nebi, can you share the code you’re using to read and write from S3? > > On Sep 8,

Re: About /mnt/hdfs/current/BP directories

2023-09-08 Thread Jack Wells
Hi Nebi, can you share the code you’re using to read and write from S3? On Sep 8, 2023 at 10:59:59, Nebi Aydin wrote: > Hi all, > I am using spark on EMR to process data. Basically i read data from AWS S3 > and do the transformation and post transformation i am loading/writing data > to s3. >

About /mnt/hdfs/current/BP directories

2023-09-08 Thread Nebi Aydin
Hi all, I am using spark on EMR to process data. Basically i read data from AWS S3 and do the transformation and post transformation i am loading/writing data to s3. Recently we have found that hdfs(/mnt/hdfs) utilization is going too high. I disabled `yarn.log-aggregation-enable` by setting it

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-08 Thread Agrawal, Sanket
Hi Yasukazu, I tried by replacing the jar though the spark code didn’t work but the vulnerability was removed. But I agree that even 3.1.3 has other vulnerabilities listed on maven page but these are medium level vulnerabilities. We are currently targeting Critical and High vulnerabilities

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson : Awesome, it worked with "“org.elasticsearch.spark.sql”" But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also worked. On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev wrote: > > Let me try that and get back. Just wondering, if there a change in the > way we pass

Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in the way we pass the format in connector from Spark 2 to 3? On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson wrote: > I am pretty certain you need to change the write.format from “es” to > “org.elasticsearch.spark.sql” > > Sent

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi, I tried replacing just this JAR but getting errors. From: Nagatomi Yasukazu Sent: Friday, September 8, 2023 9:35 AM To: Agrawal, Sanket Cc: Chao Sun ; Yeachan Park ; user@spark.apache.org Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3 Hi Sanket, While migrating to Hive 3.1.3 may resolve

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Nagatomi Yasukazu
Hi Sanket, While migrating to Hive 3.1.3 may resolve many issues, the link below suggests that there might still be some vulnerabilities present. Do you think the specific vulnerability you're concerned about can be addressed with Hive 3.1.3?

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
I mean, have you checked if this is in your jar? Are you building an assembly? Where do you expect elastic classes to be and are they there? Need some basic debugging here On Thu, Sep 7, 2023, 8:49 PM Dipayan Dev wrote: > Hi Sean, > > Removed the provided thing, but still the same issue. > > >

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi Sean, Removed the provided thing, but still the same issue. org.elasticsearch elasticsearch-spark-30_${scala.compat.version} 7.12.1 On Fri, Sep 8, 2023 at 4:41 AM Sean Owen wrote: > By marking it provided, you are not including this dependency with your > app. If it is also

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi Chao, The reason to migrate to Hive 3.1.3 is to remove a vulnerability from hive-exec-2.3.9.jar. Thanks Sanket From: Chao Sun Sent: Thursday, September 7, 2023 10:23 PM To: Agrawal, Sanket Cc: Yeachan Park ; user@spark.apache.org Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3 Hi Sanket,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
By marking it provided, you are not including this dependency with your app. If it is also not somehow already provided by your spark cluster (this is what it means), then yeah this is not anywhere on the class path at runtime. Remove the provided scope. On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev

Re: Change default timestamp offset on data load

2023-09-07 Thread Jack Goodson
Thanks Mich figured that might be the case, regardless, appreciate the help :) On Thu, Sep 7, 2023 at 8:36 PM Mich Talebzadeh wrote: > Hi, > > As far as I am aware there is no Spark or JVM setting that can make Spark > assume a different timezone during the initial load from Parquet as Parquet

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Chao Sun
Hi Sanket, Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a lot of work to upgrade the Hive version to 3.x and up. Normally though, you only need the Hive client in Spark to talk to HiveMetastore (HMS) for things like table or partition metadata information. In this case,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi, Can you please elaborate your last response? I don’t have any external dependencies added, and just updated the Spark version as mentioned below. Can someone help me with this? On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers wrote: > could the provided scope be the issue? > > On Sun, Aug 27,

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev wrote: > Hi, > > Can you please elaborate your last response? I don’t have any external > dependencies added, and just updated the Spark version as mentioned below. > > Can someone help me with this? > > On Fri, 1 Sep 2023 at 5:58 PM, Koert

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Yeachan Park
Hi, The maven option is good for testing but I wouldn't recommend it running in production from a security perspective and also depending on your setup you might be downloading jars at the start of every spark session. By the way, Spark definitely not require all the jars from Hive, since from

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-07 Thread Mich Talebzadeh
Hi Varun, With all that said, I forgot one worthy sentence. "It doesn't really matter what background you come from or your wealth, everything is possible. Use every negative source in your life as a positive and you will never ever fail!" Cheers Mich Talebzadeh, Distinguished Technologist,

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi I Tried using the maven option and it’s working. But we are not allowed to download jars at runtime from maven because of some security restrictions. So, I tried again with downloading hive 3.1.3 and giving the location of jars and it worked this time. But now in our docker image we have 40

Re: Change default timestamp offset on data load

2023-09-07 Thread Mich Talebzadeh
Hi, As far as I am aware there is no Spark or JVM setting that can make Spark assume a different timezone during the initial load from Parquet as Parquet files store timestamps in UTC. The timezone conversion can be done (as I described before) after the load. HTH Mich Talebzadeh, Distinguished

Re: Change default timestamp offset on data load

2023-09-06 Thread Jack Goodson
Thanks Mich, sorry, I might have been a bit unclear in my original email. The timestamps are getting loaded as 2003-11-24T09:02:32+ for example but I want it loaded as 2003-11-24T09:02:32+1300 I know how to do this with various transformations however I'm wondering if there's any spark or jvm

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread ashok34...@yahoo.com.INVALID
Hello Mich, Thanking you for providing these useful feedbacks and responses. We appreciate your contribution to this community forum. I for myself find your posts insightful. +1 for me Best, AK On Wednesday, 6 September 2023 at 18:34:27 BST, Mich Talebzadeh wrote: Hi Varun, In answer

Re: how can i use spark with yarn cluster in java

2023-09-06 Thread Mich Talebzadeh
Sounds like a network issue, for example connecting to remote server? try ping 172.21.242.26 telnet 172.21.242.26 596590 or nc -vz 172.21.242.26 596590 example nc -vz rhes76 1521 Ncat: Version 7.50 ( https://nmap.org/ncat ) Ncat: Connected to 50.140.197.230:1521. Ncat: 0 bytes sent, 0 bytes

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-06 Thread Mich Talebzadeh
Hi Varun, In answer to your questions, these are my views. However, they are just views and cannot be taken as facts so to speak 1. *Focus and Time Management:* I often struggle with maintaining focus and effectively managing my time. This leads to productivity issues and affects

how can i use spark with yarn cluster in java

2023-09-06 Thread BCMS
i want to use yarn cluster with my current code. if i use conf.set("spark.master","local[*]") inplace of conf.set("spark.master","yarn"), everything is very well. but i try to use yarn in setmaster, my code give an below error. ``` package com.example.pocsparkspring; import

Re: Change default timestamp offset on data load

2023-09-06 Thread Mich Talebzadeh
Hi Jack, You may use from_utc_timestamp and to_utc_timestamp to see if they help. from pyspark.sql.functions import from_utc_timestamp You can read your Parquet file into DF df = spark.read.parquet('parquet_file_path') # Convert timestamps (assuming your column name) from UTC to

Change default timestamp offset on data load

2023-09-05 Thread Jack Goodson
Hi, I've got a number of tables that I'm loading in from a SQL server. The timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these as parquet files in our raw storage location and pick them up in Databricks. When I load the data in databricks, the dataframe/spark assumes UTC or

Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-05 Thread Varun Shah
Dear Apache Spark Community, I hope this email finds you well. I am writing to seek your valuable insights and advice on some challenges I've been facing in my career and personal development journey, particularly in the context of Apache Spark and the broader big data ecosystem. A little

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-05 Thread Nagatomi Yasukazu
Dear Spark Community, I've been exploring the capabilities of the Spark Connect Server and encountered an issue when trying to launch it in a cluster deploy mode with Kubernetes as the master. While initiating the `start-connect-server.sh` script with the `--conf` parameter for `spark.master`

Re: pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Mich Talebzadeh
Hi, Have you set python environment variables correctly? PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON? You can print the environment variables within your PySpark script to verify this: import os print("PYTHONPATH:", os.environ.get("PYTHONPATH")) print("PYSPARK_PYTHON:",

Re: pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Harry Jamison
That did not paste well, let me try again I am using python3.7 and spark 2.4.7 I am trying to figure out why my job is using the wrong python version This is how it is starting up the logs confirm that I am using python 3.7But I later see the error message showing it is trying to us 3.8, and I

pyspark.ml.recommendation is using the wrong python version

2023-09-04 Thread Harry Jamison
I am using python3.7 and spark 2.4.7 I am trying to figure out why my job is using the wrong python version This is how it is starting up the logs confirm that I am using python 3.7But I later see the error message showing it is trying to us 3.8, and I am not sure where it is picking that up.

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-04 Thread Nagatomi Yasukazu
Hello Mich, Thank you for your questions. Here are my responses: > 1. What investigation have you done to show that it is running in local mode? I have verified through the History Server's Environment tab that: - "spark.master" is set to local[*] - "spark.app.id" begins with local-xxx -

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-04 Thread Mich Talebzadeh
personally I have not used this feature myself. However, some points 1. What investigation have you done to show that it is running in local mode? 2. who has configured this kubernetes cluster? Is it supplied by a cloud vendor? 3. Confirm that you have configured Spark Connect

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-03 Thread Nagatomi Yasukazu
Hi Cley, Thank you for taking the time to respond to my query. Your insights on Spark cluster deployment are much appreciated. However, I'd like to clarify that my specific challenge is related to running the Spark Connect Server on Kubernetes in Cluster Mode. While I understand the general

Re: Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-03 Thread Cleyson Barros
Hi Nagatomi, Use Apache imagers, then run your master node, then start your many slavers. You can add a command line in the docker files to call for the master using the docker container names in your service composition if you wish to run 2 masters active and standby follow the instructions in

Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-02 Thread Nagatomi Yasukazu
Hello Apache Spark community, I'm currently trying to run Spark Connect Server on Kubernetes in Cluster Mode and facing some challenges. Any guidance or hints would be greatly appreciated. ## Environment: Apache Spark version: 3.4.1 Kubernetes version: 1.23 Command executed:

[Spark Connect]Running Spark Connect Server in Cluster Mode on Kubernetes

2023-09-02 Thread Nagatomi Yasukazu
Hello Apache Spark community, I'm currently trying to run Spark Connect Server on Kubernetes in Cluster Mode and facing some challenges. Any guidance or hints would be greatly appreciated. ## Environment: Apache Spark version: 3.4.1 Kubernetes version: 1.23 Command executed:

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes,

Re: Elasticsearch support for Spark 3.x

2023-09-01 Thread Koert Kuipers
could the provided scope be the issue? On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev wrote: > Using the following dependency for Spark 3 in POM file (My Scala version > is 2.12.14) > > > > > > > *org.elasticsearch > elasticsearch-spark-30_2.12 > 7.12.0provided* > > > The code throws error

Reg read json inference schema

2023-08-31 Thread Manoj Babu
Hi Team, I am getting the below error when reading a column with a value with JSON string. json_schema_ctx_rdd = record_df.rdd.map(lambda row: row.contexts_parsed) spark.read.option("mode", "PERMISSIVE").option("inferSchema", "true").option("inferTimestamp", "false").json(json_schema_ctx_rdd)

Re:

2023-08-31 Thread leibnitz
me too ayan guha 于2023年8月24日周四 09:02写道: > Unsubscribe-- > Best Regards, > Ayan Guha >

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Bjørn Jørgensen
Have tried to upgrade it. It is from kubernetes-client [SPARK-43990][BUILD] Upgrade kubernetes-client to 6.7.2 tor. 31. aug. 2023 kl. 14:47 skrev Agrawal, Sanket : > I don’t see an entry in pom.xml while building spark. I think

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
It's a dependency of some other HTTP library. Use mvn dependency:tree to see where it comes from. It may be more straightforward to upgrade the library that brings it in, assuming a later version brings in a later okio. You can also manage up the version directly with a new entry in However,

RE: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Agrawal, Sanket
I don’t see an entry in pom.xml while building spark. I think it is being downloaded as part of some other dependency. From: Sean Owen Sent: Thursday, August 31, 2023 5:10 PM To: Agrawal, Sanket Cc: user@spark.apache.org Subject: [EXT] Re: Okio Vulnerability in Spark 3.4.1 Does the

Re: Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Sean Owen
Does the vulnerability affect Spark? In any event, have you tried updating Okio in the Spark build? I don't believe you could just replace the JAR, as other libraries probably rely on it and compiled against the current version. On Thu, Aug 31, 2023 at 6:02 AM Agrawal, Sanket wrote: > Hi All, >

Okio Vulnerability in Spark 3.4.1

2023-08-31 Thread Agrawal, Sanket
Hi All, Amazon inspector has detected a vulnerability in okio-1.15.0.jar JAR in Spark 3.4.1. It suggests to upgrade the jar version to 3.4.0. But when we try this version of jar then the spark application is failing with below error: py4j.protocol.Py4JJavaError: An error occurred while calling

CommunityOverCode(CoC) 2023

2023-08-28 Thread Uma Maheswara Rao Gangumalla
Hi All, The CommmunityOverCode (CoC) 2023 Conference is approaching real quick. This year's conference is happening at Halifax, Nova Scotia, Canada (Oct 07 - Oct 10 2023). We have an exciting set of talks lined up from compute and storage experts. Please take a moment to check the compute track

Registration open for Community Over Code North America

2023-08-28 Thread Rich Bowen
Hello! Registration is still open for the upcoming Community Over Code NA event in Halifax, NS! We invite you to register for the event https://communityovercode.org/registration/ Apache Committers, note that you have a special discounted rate for the conference at US$250. To take advantage of

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Dipayan Dev
Using the following dependency for Spark 3 in POM file (My Scala version is 2.12.14) *org.elasticsearch elasticsearch-spark-30_2.12 7.12.0provided* The code throws error at this line : df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name") The same code

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using? On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev wrote: > Hi All, > > We're using Spark 2.4.x to write dataframe into the Elasticsearch index. > As we're upgrading to Spark 3.3.0, it throwing out error > Caused by:

<    2   3   4   5   6   7   8   9   10   11   >