Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi, On the issue of Spark shuffle it is accepted that shuffle *often involves* the following if not all below: - Disk I/O - Data serialization and deserialization - Network I/O Excluding external shuffle service and without relying on the configuration options provided by spark

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
Mon, 15 May 2023 at 13:11, Faiz Halde wrote: > Hello, > > We've been in touch with a few spark specialists who suggested us a > potential solution to improve the reliability of our jobs that are shuffle > heavy > > Here is what our setup looks like > >- Spark version:

[spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Faiz Halde
Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like - Spark version: 3.3.1 - Java version: 1.8 - We do not use external shuffle service - We use

Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-11 Thread Vijay B
In my view spark is behaving as expected. TL:DR Every time a dataframe is reused or branched or forked the sequence operations evaluated run again. Use Cache or persist to avoid this behavior and un-persist when no longer required, spark does not un-persist automatically. Couple of things

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some r

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When I run this job in local mode spark-submit --master local[4] with spark = SparkSession.builder \ .appName("tests") \ .enableHiveSupport() \ .getOrCreate() spark.conf.set("spark.sql.adaptive.enabled", "true") df3.explain(extende

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()&quo

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
v], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` On Mon, May 8, 2023 at 1:07 AM Mich Talebzadeh wrote: > When I run this job in local mode spark-submit --master local[4] > > with > > spark = SparkSession.builder \ > .appName(&

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
acc -> acc) AS feq_cnt Here are my questions: * Is using "map()" above the best way? The "start" structure in this case should be Map.empty[String, Int], but of course, it won't work in pure Spark SQL, so the best solution I can think of is "map()", and it is

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
here. It has to read it twice to perform this operation. HJ was not invented by Spark. It has been around in databases for years plus NLJ and MJ. [image: image.png] Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
to help me out is minimized I don't think Spark validating the file existence qualifies as an action according to Spark parlance. Sure there would be an analysis exception in case the file is not found as per the location provided, however, if you provided a schema and a valid path then no job would

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
You have started with panda DF which won't scale outside of the driver itself. Let us put that aside. df1.to_csv("./df1.csv",index_label = "index") ## write the dataframe to the underlying file system starting with spark df1 = spark.read.csv("./df1.csv", header=

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
t; >> I get how using .cache I can ensure that the data from a particular >> checkpoint is reused and the computations do not happen again. >> >> However, In my case here I am calling just one action. Within the purview >> of one action Spark should not rerun t

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
st one action. Within the purview > of one action Spark should not rerun the overlapping parts of the DAG. I do > not understand why the file scan is happening several times. I can easily > mitigate the issue by using window functions and creating all the columns > in one go without hav

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
checkpoint is reused and the computations do not happen again. However, In my case here I am calling just one action. Within the purview of one action Spark should not rerun the overlapping parts of the DAG. I do not understand why the file scan is happening several times. I can easily mitigate the issue

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Winston Lai
When your memory is not sufficient to keep the cached data for your jobs in two different stages, it might be read twice because Spark might have to clear the previous cache for other jobs. In those cases, a spill may triggered when Spark write your data from memory to disk. One way

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
e author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 5 May 2023 at 20:33, Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can ONLY do this in Spark SQL, instead of either Scala > or PySpark en

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment. For some reason, I can ONLY do this in Spark SQL, instead of either Scala or PySpark environment. I want to aggregate an array into a Map of element count, within that array, but in Spark SQL. I know that there is an aggregate function available like

How to create spark udf use functioncatalog?

2023-05-03 Thread tzxxh
We are using spark.Today I see the FunctionCatalog , and I have seen the source of spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala and have implements the ScalarFunction.But I still not know how to register it in sql

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-05-02 Thread Trường Trần Phan An
Hi all, I have written a program and overridden two events onStageCompleted and onTaskEnd. However, these two events do not provide information on when a Task/Stage is completed. What I want to know is which Task corresponds to which stage of a DAG (the Spark history server only tells me how

CVE-2023-32007: Apache Spark: Shell command injection via Spark UI

2023-05-02 Thread Arnout Engelen
Severity: important Affected versions: - Apache Spark 3.1.1 before 3.2.2 Description: ** UNSUPPORTED WHEN ASSIGNED ** The Apache Spark UI offers the possibility to enable ACLs via the configuration option spark.acls.enable. With an authentication filter, this checks whether a user has access

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
l.com> wrote: > > > You don't want to use CPUs with Tensorflow. > If it's not scaling, you may have a problem that is far too small to > distribute. > > On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID > wrote: > > Anyone successfully run nativ

How to read text files with GBK encoding in the spark core

2023-04-30 Thread lianyou1...@126.com
Hello all, Is there any way to use the pyspark core to read some text files with GBK encoding? Although the pyspark sql has an option to set the encoding, but these text files are not structural format. Any advices are appreciated. Thank you lianyou Li

Re: Tensorflow on Spark CPU

2023-04-30 Thread second_co...@yahoo.com.INVALID
second_co...@yahoo.com.INVALID wrote: Anyone successfully run native tensorflow on Spark ? i tested example at https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor   on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any speed up in training

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen
You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at > https://gith

Tensorflow on Spark CPU

2023-04-29 Thread second_co...@yahoo.com.INVALID
Anyone successfully run native tensorflow on Spark ? i tested example at https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor   on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any speed up in training time by setting number of slot from1

RE: Spark Kubernetes Operator

2023-04-26 Thread Aldo Culquicondor
/04/14 16:41:36 Yuval Itzchakov wrote: > Hi, > > ATM I see the most used option for a Spark operator is the one provided by > Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator > > Unfortunately, it doesn't seem actively maintained. Are there any plans to >

Reg: create spark using virtual machine through chef

2023-04-24 Thread sunkara akhil sai teja
Hi team, Myself akhil, Iam trying to create a spark using virtual machine through chef. Could you please help us how we can do it. If possible could you please share the documentation. Regards Akhil

Re: Use Spark Aggregator in PySpark

2023-04-24 Thread Enrico Minack
Hi, For an aggregating UDF, use spark.udf.registerJavaUDAF(name, className). Enrico Am 23.04.23 um 23:42 schrieb Thomas Wang: Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to |org.apache.spark.sql.expressions.Aggregator|). Now I'm trying to use

Use Spark Aggregator in PySpark

2023-04-23 Thread Thomas Wang
Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Now I'm trying to use it in a PySpark application, but for some reason, I'm not able to trigger the function. Here is what I'm doing, could someone help me take a look

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
should work. > -- > Raghavendra > > > On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > >> Hi Spark Community, >> >> I'm trying to implement a custom Spark Aggregator (a subclass to >> org.apache.spark.sql.expressions.Aggregator). Correct me if I'm

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Raghavendra Ganesh
For simple array types setting encoder to ExpressionEncoder() should work. -- Raghavendra On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > Hi Spark Community, > > I'm trying to implement a custom Spark Aggregator (a subclass to > org.apache.spark.sql.expressions.Aggregator)

Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Hi Spark Community, I'm trying to implement a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Correct me if I'm wrong, but I'm assuming I will be able to use it as an aggregation function like SUM. What I'm trying to do is that I have a column of ARRAY and I

Dependency injection for spark executors

2023-04-20 Thread Deepak Patankar
I am writing a spark application which uses java and spring boot to process rows. For every row it performs some logic and saves data into the database. <https://stackoverflow.com/posts/76058897/timeline> The logic is performed using some services defined in my application and some ex

Re: [Spark on SBT] Executor just keeps running

2023-04-18 Thread Dhruv Singla
You can reproduce the behavior in ordinary Scala code if you keep reduce in an object outside the main method. Hope it might help On Mon, Apr 17, 2023 at 10:22 PM Dhruv Singla wrote: > Hi Team >I was trying to run spark using `sbt console` on the terminal. I am > able

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
Thanks Elliot ! Let me check it out ! On Mon, 17 Apr, 2023, 10:08 pm Elliot West, wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI: htt

[Spark on SBT] Executor just keeps running

2023-04-17 Thread Dhruv Singla
Hi Team I was trying to run spark using `sbt console` on the terminal. I am able to build the project successfully using build.sbt and the following piece of code runs fine on IntelliJ. The only issue I am facing while running the same on terminal is that the Executor keeps running

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports connecting multiple HMS in a single Spark application. Some limitations - currently only supports Spark 3.3 - has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and normal jar-based Spark application. [1

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Elliot West
Hi Ankit, While not a part of Spark, there is a project called 'WaggleDance' that can federate multiple Hive metastores so that they are accessible via a single URI: https://github.com/ExpediaGroup/waggle-dance This may be useful or perhaps serve as inspiration. Thanks, Elliot. On Mon, 17 Apr

Spark Log Shipper to Cloud Bucket

2023-04-17 Thread Jayabindu Singh
Greetings Everyone! We are in need to ship spark (driver and executor) logs (not spark event logs) from K8 to cloud bucket ADLS/S3. Using fluentbit we are able to ship the log files but only to one single path container/logs/. This will cause a huge number of files in a single folder

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
++ User Mailing List Just a reminder, anyone who can help on this. Thanks a lot ! Ankit Prakash Gupta On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta wrote: > Hi All > > The question is regarding the support of multiple Remote Hive Metastore > catalogs with Spark. Starting Spark

CVE-2023-22946: Apache Spark proxy-user privilege escalation from malicious configuration class

2023-04-15 Thread Sean R. Owen
Description: In Apache Spark versions prior to 3.4.0, applications using spark-submit can specify a 'proxy-user' to run as, limiting privileges. The application can execute code with the privileges of the submitting user, however, by providing malicious configuration-related classes

Re: Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
I'm not running on GKE. I am wondering what's the long term strategy around a Spark operator. Operators are the de-facto way to run complex deployments. The Flink community now has an official community led operator, and I was wondering if there are any similar plans for Spark. On Fri, Apr 14

Re: Spark Kubernetes Operator

2023-04-14 Thread Mich Talebzadeh
Hi, What exactly are you trying to achieve? Spark on GKE works fine and you can run Datapoc now on GKE https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D Unless I misunderstood your point. HTH Mich Talebzadeh, Lead

Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
Hi, ATM I see the most used option for a Spark operator is the one provided by Google: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator Unfortunately, it doesn't seem actively maintained. Are there any plans to support an official Apache Spark community driven operator?

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
Hi, Start with intercepting stage completions using SparkListenerStageCompleted [1]. That's Spark Core (jobs, stages and tasks). Go up the execution chain to Spark SQL with SparkListenerSQLExecutionStart [2] and SparkListenerSQLExecutionEnd [3], and correlate infos. You may want to look at how

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
com.invalid> wrote: > We are using spark.Today I see the FunctionCatalog , and I have seen the > source of > spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala > and have implements the ScalarFunction.But i still not konw how > to register it in sql

How to create spark udf use functioncatalog?

2023-04-14 Thread ??????
We are using spark.Today I see the FunctionCatalog , and I haveseen the source of spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala and have implements theScalarFunction.But i still not konw how toregisterit in sql

[ANNOUNCE] Apache Spark 3.2.4 released

2023-04-13 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.4! Spark 3.2.4 is a maintenance release containing stability fixes. This release is based on the branch-3.2 maintenance branch of Spark. We strongly recommend all 3.2 users to upgrade to this stable release. To download Spark 3.2.4

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-13 Thread Trường Trần Phan An
;reverse-engineer" tasks to functions. > > In essence, Spark SQL is an abstraction layer over RDD API that's made up > of partitions and tasks. Tasks are Scala functions (possibly with some > Python for PySpark). A simple-looking high-level operator like > DataFrame.join can end up with

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Maytas Monsereenusorn
Hi, I was wondering if it's not possible to determine tasks to functions, is it still possible to easily figure out which job and stage completed which part of the query from the UI? For example, in the SQL tab of the Spark UI, I am able to see the query and the Job IDs for that query. However

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
laimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 12 Apr 2023 at 02:55, Lingzhe Sun wrote: > Hi Mich, > > FYI we're using spark operator( > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) t

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
Hi, tl;dr it's not possible to "reverse-engineer" tasks to functions. In essence, Spark SQL is an abstraction layer over RDD API that's made up of partitions and tasks. Tasks are Scala functions (possibly with some Python for PySpark). A simple-looking high-level operator like DataFram

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh, It's working fine, at least for now. But you'll need to build your own spark image using later versions. Lingzhe Sun Hirain Technologies Original: From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe SunCc:Mich Talebzadeh , user Subject:Re: Re: spark streaming

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
unsubscribe On Wed, Apr 12, 2023 at 3:59 PM Rajesh Katkar wrote: > Hi Lingzhe, > > We are also started using this operator. > Do you see any issues with it? > > > On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > >> Hi Mich, >> >> FYI we're u

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
Hi Lingzhe, We are also started using this operator. Do you see any issues with it? On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > Hi Mich, > > FYI we're using spark operator( > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build > stateful structured s

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
Hi Mich, FYI we're using spark operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build stateful structured streaming on k8s for a year. Haven't test it using non-operator way. Besides that, the main contributor of the spark operator, Yinan Li, has been inactive

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark applications in the form of containers in an automated fashion so that one can easily deploy as many instances of the application as required (autoscaling). From below: https://price2meet.com/gcp/docs

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this "In so far as I know k8s does not support spark structured streaming?" So it is an open question. I just recalled it. I have not tested myself. I know structured streaming works on Google Dataproc cluster but I have not seen any official link that says Spark

Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support spark streaming ? On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, wrote: > Do you have a high level diagram of the proposed solution? > > In so far as I know k8s does not support spark structured streaming?

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh wrote: > OK Spark Structured Streaming. > > How are you getting messages into Spark? Is it Kafka? > > This to me index that the message is incomplete or having another value in > Json > > HTH > > Mich Talebzad

Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s Officially I could not find the connector or reader for kinesis from spark like it has for kafka. Checking here if anyone used kinesis and spark streaming combination ? On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, wrote: >

RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
kar Cc: u...@spark.incubator.apache.org Subject: Re: spark streaming and kinesis integration ⚠ [EXTERNAL EMAIL]: Use Caution Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/m

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
elying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 6 Apr 2023 at 13:08, Rajesh Katkar wrote: > Hi Spark Team, > > We need to read/write the kinesis streams using

spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Hi Spark Team, We need to read/write the kinesis streams using spark streaming. We checked the official documentation - https://spark.apache.org/docs/latest/streaming-kinesis-integration.html It does not mention kinesis connector. Alternative is - https://github.com/qubole/kinesis-sql which

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread Mich Talebzadeh
OK Spark Structured Streaming. How are you getting messages into Spark? Is it Kafka? This to me index that the message is incomplete or having another value in Json HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid. I have created a section in Apache Spark Community Stack called spark foundation. spark-foundation - Apache Spark Community - Slack <https://app.slack.com/client/T04URTRBZ1R/C051CL5T1KL/thread/C0501NBTNQG-1680132989.091199> I invite you to add your weblink to that s

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG I have got some articles about this subject that should help. E.g. https://khalidmammadov.github.io/spark/spark_internals_rdd.html Also check other Spark Internals on web. Regards Khalid On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, wrote: > Thank you for your informat

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 31 Mar 2023 at 16:17, AN-TRUONG Tran Phan wrote: > Thank you for your information, > > I have tracked the spark history server on port 18080 and the spark UI on > port 4040

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
Thank you for your information, I have tracked the spark history server on port 18080 and the spark UI on port 4040. I see the result of these two tools as similar right? I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in the images does, is it possible? https

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040? have a look. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
t; >>> There is a section in slack called webinars >>> >>> >>> https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG >>> >>> Asma Zgolli, agreed to prepare materials for Spark internals and/

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
> >> There is a section in slack called webinars >> >> >> https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG >> >> Asma Zgolli, agreed to prepare materials for Spark internals and/or >> comparing spark

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
887/messages/C0501NBTNQG > > Asma Zgolli, agreed to prepare materials for Spark internals and/or > comparing spark 3 and 2. > > I like to contribute to "Spark Streaming & Spark Structured Streaming" > plus "Spark on k8s for both GCP and EKS concepts and con

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi all, There is a section in slack called webinars https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG Asma Zgolli, agreed to prepare materials for Spark internals and/or comparing spark 3 and 2. I like to contribute to "Spark Stre

Re: Topics for Spark online classes & webinars

2023-03-28 Thread asma zgolli
Hello everyone, I suggest using the slack for the spark community created recently to collaborate and work together on these topics and use the LinkedIn page to publish the events and the webinars. Cheers, Asma Le jeu. 16 mars 2023 à 01:39, Denny Lee a écrit : > What we can do is

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Mich Talebzadeh
Agreed. How does asynchronous communication relate to Spark Structured streaming? In the previous post of yours, you made your Spark to run on the driver in a single JVM. You attempted to increase the number of executors to 3 after submission of the job that (as Sean alluded to) would not work

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Sean Owen
What do you mean by asynchronously here? On Sun, Mar 26, 2023, 10:22 AM Emmanouil Kritharakis < kritharakismano...@gmail.com> wrote: > Hello again, > > Do we have any news for the above question? > I would really appreciate it. > > Thank you, > >

Re: Question related to asynchronously map transformation using java spark structured streaming

2023-03-26 Thread Emmanouil Kritharakis
Hello again, Do we have any news for the above question? I would really appreciate it. Thank you, -- Emmanouil (Manos) Kritharakis Ph.D. candidate in the Department of Computer Science

Topics for Spark online classes & webinars, next steps

2023-03-21 Thread Mich Talebzadeh
Hi all, As you may be aware we are proposing to set-up community classes and webinars for Spark interest group or simply for those who could benefit from them. @Denny Lee and myself had a discussion on how to put this framework forward. The idea is first and foremost getting support from

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread karan alang
e or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 16 Mar 2023

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread Mich Talebzadeh
netary damages arising from such loss, damage or destruction. On Thu, 16 Mar 2023 at 23:49, karan alang wrote: > Fyi .. apache spark version is 3.1.3 > > On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote: > >> Hi Mich, this doesn't seem to be working for me .. the watermar

Single node spark issue in Sparkly/RStudio

2023-03-16 Thread elango vaidyanathan
Hi team, In a single Linux node, I would like to set up Rstudio with Sparkly. Three to four people make up the dev team. I am aware of the single-node spark cluster's constraints. When there is a resource problem with Spark, I want to know when more users join in to use Sparkly in Rstudio

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-16 Thread karan alang
Fyi .. apache spark version is 3.1.3 On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote: > Hi Mich, this doesn't seem to be working for me .. the watermark seems to > be getting ignored ! > > Here is the data pu

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
n an article perhaps. Comments and >> contributions are welcome. >> >> HTH >> >> Mich Talebzadeh, >> Lead Solutions Architect/Engineering Lead, >> Palantir Technologies Limited >> >> >> >>view my Linkedin profile >>

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-15 Thread karan alang
"2023-03-13T10:12:00.000-07:00" should have got dropped, it is more than 2 days old (i.e. dated - 2023-03-13)! Any ideas what needs to be changed to make this work ? Here is the code (modified for my requirement, but essentially the same) ``` schema = StructType([

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explici

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
aimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh > wrote: > > Hi Denny, > > That Apache Spark Linkedin page > https://www.linkedin.

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Bjørn Jørgensen
Great. A case that I hope can be better documented, especially now that we have Pandas API on Spark and many potential new users coming from Pandas. Is how to start Spark with full available memory and CPU. I use this function to do this in a notebook. import multiprocessing import os import sys

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Denny Lee
ng from > such loss, damage or destruction. > > > > > On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh > wrote: > >> Hi Denny, >> >> That Apache Spark Linkedin page >> https://www.linkedin.com/company/apachespark/ looks fine. It also allows >> a wider

Re: Topics for Spark online classes & webinars

2023-03-15 Thread Mich Talebzadeh
uction. On Tue, 14 Mar 2023 at 15:09, Mich Talebzadeh wrote: > Hi Denny, > > That Apache Spark Linkedin page > https://www.linkedin.com/company/apachespark/ looks fine. It also allows > a wider audience to benefit from it. > > +1 for me > > > >view my Linke

Question related to asynchronously map transformation using java spark structured streaming

2023-03-14 Thread Emmanouil Kritharakis
Hello, I hope this email finds you well! I have a simple dataflow in which I read from a kafka topic, perform a map transformation and then I write the result to another topic. Based on your documentation here

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Mich Talebzadeh
Hi Denny, That Apache Spark Linkedin page https://www.linkedin.com/company/apachespark/ looks fine. It also allows a wider audience to benefit from it. +1 for me view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywi

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Denny Lee
In the past, we've been using the Apache Spark LinkedIn page <https://www.linkedin.com/company/apachespark/> and group to broadcast these type of events - if you're cool with this? Or we could go through the process of submitting and updating the current https://spark.apache.org or r

Re: Topics for Spark online classes & webinars

2023-03-14 Thread Joris Billen
, Mich Talebzadeh wrote: Apologies I missed the list. To move forward I selected these topics from the thread "Online classes for spark topics". To take this further I propose a confluence page to be seup. 1. Spark UI 2. Dynamic allocation 3. Tuning of jobs 4. Collec

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread yangjie01
From the release notes of antl4 , there are two key changes in antl4 4.10: 1. 4.10-generated parsers incompatible with previous runtimes 2. Increasing minimum java version to Java 11 So I personally think it is temporarily impossible for Spark to upgrade to the antl4 version above 4.10

Re: Spark 3.3.2 not running with Antlr4 runtime latest version

2023-03-14 Thread Sean Owen
You want Antlr 3 and Spark is on 4? no I don't think Spark would downgrade. You can shade your app's dependencies maybe. On Tue, Mar 14, 2023 at 8:21 AM Sahu, Karuna wrote: > Hi Team > > > > We are upgrading a legacy application using Spring boot , Spark and > Hibernat

<    2   3   4   5   6   7   8   9   10   11   >