Re: ChatGPT and prediction of Spark future

2023-06-01 Thread Mich Talebzadeh
Great stuff Winston. I added a channel in Slack Community for Spark https://sparkcommunitytalk.slack.com/archives/C05ACMS63RT cheers Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Comparison of Trino, Spark, and Hive-MR3

2023-05-31 Thread Sungwoo Park
Hi everyone, We published an article on the performance and correctness of Trino, Spark, and Hive-MR3, and thought that it could be of interest to Spark users. https://www.datamonad.com/post/2023-05-31-trino-spark-hive-performance-1.7/ Omitted in the article is the performance of Spark 2.3.1 vs

Re: Viewing UI for spark jobs running on K8s

2023-05-31 Thread Qian Sun
Hi Nikhil Spark operator supports ingress for exposing all UIs of running spark applications. reference: https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md#driver-ui-access-and-ingress On Thu, Jun 1, 2023 at 6:19 AM Nikhil Goyal wrote: > Hi

Re: ChatGPT and prediction of Spark future

2023-05-31 Thread Winston Lai
Hi Mich, I have been using ChatGPT free version, Bing AI, Google Bard and other AI chatbots. My use cases so far include writing, debugging code, generating documentation and explanation on Spark key terminologies for beginners to quickly pick up new concepts, summarizing pros and cons or

Viewing UI for spark jobs running on K8s

2023-05-31 Thread Nikhil Goyal
Hi folks, Is there an equivalent of the Yarn RM page for Spark on Kubernetes. We can port-forward the UI from the driver pod for each but this process is tedious given we have multiple jobs running. Is there a clever solution to exposing all Driver UIs in a centralized place? Thanks Nikhil

ChatGPT and prediction of Spark future

2023-05-31 Thread Mich Talebzadeh
I have started looking into ChatGPT as a consumer. The one I have tried is the free not plus version. I asked a question entitled "what is the future for spark" and asked for a concise response This was the answer "Spark has a promising future due to its capabilities in data processing,

Structured streaming append mode picture question

2023-05-31 Thread Hill Liu
Hi, I have a question related to this picture https://spark.apache.org/docs/latest/img/structured-streaming-watermark-append-mode.png in structured streaming programming guide web page. At 12:20, the wm is 12:11, so the window 12:00 ~ 12:10 will be flushed out. But in this pic, it seems the window

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-29 Thread Aishwarya Panicker
Hi, Thanks for your response. I understand there is no explicit way to configure dynamic scaling for Spark Structured Streaming as the ticket is still open for that. But is there a way to manage dynamic scaling with the existing Batch Dynamic scaling algorithm as this kicks in when Dynamic

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
2.13.8 you must change 2.13.6 to 2.13.8 man. 29. mai 2023 kl. 18:02 skrev Mich Talebzadeh : > Thanks everyone. Still not much progress :(. It is becoming a bit > confusing as I am getting

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks everyone. Still not much progress :(. It is becoming a bit confusing as I am getting this error Compiling ReduceByKey [INFO] Scanning for projects... [INFO] [INFO] -< spark:ReduceByKey >-- [INFO] Building ReduceByKey 3.0 [INFO] from pom.xml

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Bjørn Jørgensen
Change org.scala-lang scala-library 2.13.11-M2 to org.scala-lang scala-library ${scala.version} man. 29. mai 2023 kl. 13:20 skrev Lingzhe Sun : > Hi Mich, > > Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8 >

Re: JDK version support information

2023-05-29 Thread Sean Owen
Per docs, it is Java 8. It's possible Java 11 partly works with 2.x but not supported. But then again 2.x is not supported either. On Mon, May 29, 2023, 6:43 AM Poorna Murali wrote: > We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. > So, we wanted to check the maximum

Re: JDK version support information

2023-05-29 Thread Poorna Murali
We are currently using JDK 11 and spark 2.4.5.1 is working fine with that. So, we wanted to check the maximum JDK version supported for 2.4.5.1. On Mon, 29 May, 2023, 5:03 pm Aironman DirtDiver, wrote: > Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the > official Spark

Re: JDK version support information

2023-05-29 Thread Aironman DirtDiver
Spark version 2.4.5.1 is based on Apache Spark 2.4.5. According to the official Spark documentation for version 2.4.5, the maximum supported JDK (Java Development Kit) version is JDK 8 (Java 8). Spark 2.4.5 is not compatible with JDK versions higher than Java 8. Therefore, you should use JDK 8 to

JDK version support information

2023-05-29 Thread Poorna Murali
Hi, We are using spark version 2.4.5.1. We would like to know the maximum JDK version supported for the same. Thanks, Poorna

Re: Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Lingzhe Sun
Hi Mich, Spark 3.4.0 prebuilt with scala 2.13 is built with version 2.13.8. Since you are using spark-core_2.13 and spark-sql_2.13, you should stick to the major(13) and the minor version(8). Not using any of these may cause unexpected behaviour(though scala claims compatibility among minor

Re: maven with Spark 3.4.0 fails compilation

2023-05-29 Thread Mich Talebzadeh
Thanks for your helpful comments Bjorn. I managed to compile the code with maven but when it run it fails with Application is ReduceByKey Exception in thread "main" java.lang.NoSuchMethodError: scala.package$.Seq()Lscala/collection/immutable/Seq$; at

Re: maven with Spark 3.4.0 fails compilation

2023-05-28 Thread Bjørn Jørgensen
>From chatgpt4 The problem appears to be that there is a mismatch between the version of Scala used by the Scala Maven plugin and the version of the Scala library defined as a dependency in your POM. You've defined your Scala version in your properties as `2.12.17` but you're pulling in

Re: [Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Mich Talebzadeh
Hi, Autoscaling is not compatible with Spark Structured Streaming since Spark Structured Streaming currently does not support dynamic allocation (see SPARK-24815: Structured Streaming should support dynamic

[Spark Structured Streaming]: Dynamic Scaling of Executors

2023-05-25 Thread Aishwarya Panicker
Hi Team, I have been working on Spark Structured Streaming and trying to autoscale our application through dynamic allocation. But I couldn't find any documentation or configurations that supports dynamic scaling in Spark Structured Streaming, due to which I had been using Spark Batch mode

Re: [MLlib] how-to find implementation of Decision Tree Regressor fit function

2023-05-25 Thread Sean Owen
Are you looking for https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala On Thu, May 25, 2023 at 6:54 AM Max wrote: > Good day, I'm working on an Implantation from Joint Probability Trees > (JPT) using the Spark framework. For this

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack
Hi, given your dataset: val df=Seq( (1, 20230523, "M01"), (2, 20230523, "M01"), (3, 20230523, "M01"), (4, 20230523, "M02"), (5, 20230523, "M02"), (6, 20230523, "M02"), (7, 20230523, "M01"), (8, 20230523, "M01"), (9, 20230523, "M02"), (10, 20230523, "M02"), (11, 20230523, "M02"), (12,

Dynamic value as the offset of lag() function

2023-05-23 Thread Nipuna Shantha
Hi all This is the sample set of data that I used for this task ,[image: image.png] My need is to pass count as the offset of lag() function. *[ lag(col(), lag(count)).over(windowspec) ]* But as the lag function expects lag(Column, Int) above code does not work. So can you guys suggest a

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd,

Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Nipuna Shantha
Hi all, This is the sample set of data that I used for this task [image: image.png] My expected output is as below [image: image.png] My scenario is if Type is M01 the count should be 0 and if Type is M02 it should be incremented from 1 or 0 until the sequence of M02 is finished. Imagine this

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows  that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread Rauf Khan
Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window. The data will be shuffled to form group. Regards Raouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: > Hello, > > In Spark windowing does call

cannot load model using pyspark

2023-05-23 Thread second_co...@yahoo.com.INVALID
spark.sparkContext.textFile("s3a://a_bucket/models/random_forest_zepp/bestModel/metadata", 1).getNumPartitions() when i run above code, i get below error. Can advice how to troubleshoot? i' using spark 3.3.0. the above file path exist.

Data Stream Processing applications testing

2023-05-22 Thread Alexandre Strapacao Guedes Vianna
Hey everyone, I wanted to share my latest paper, "A Grey Literature Review on Data Stream Processing Applications Testing," in the Journal of Systems and Software (JSS), Elsevier. This paper provides unique industry insights, addresses the challenges faced in Data Stream Processing (DSP)

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Just to correct the last sentence, if we end up starting a new instance of Spark, I don't think it will be able to read the shuffle data from storage from another instance, I stand corrected. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Hi Maksym. Let us understand the basics here first My thoughtsSpark replicates the partitions among multiple nodes. If one executor fails, it moves the processing over to the other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
Hey vaquar, The link does't explain the crucial detail we're interested in - does executor re-use the data that exists on a node from previous executor and if not, how can we configure it to do so? We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't very relevant. We are

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh
Ok, I did a bit of a test that shows that the shuffle does spill to memory then to disk if my assertion is valid. The sample code I wrote is as follows: import sys from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
Following link you will get all required details https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/ Let me know if you required further informations. Regards, Vaquar khan On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh wrote: > Couple of points > > Why

RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico Ursprüngliche Nachricht Von: Shashank Rao Datum: 16.05.23 19:48 (GMT+01:00) An: user@spark.apache.org Betreff:

Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
Hi, I'm trying to set up a Spark pipeline which reads data from S3 and writes it into Google Big Query. Environment Details: --- Java 8 AWS EMR-6.10.0 Spark v3.3.1 2 m5.xlarge executor nodes S3 Directory structure: --- bucket-name: |---folder1: |---folder2:

Spark shuffle and inevitability of writing to Disk

2023-05-16 Thread Mich Talebzadeh
Hi, On the issue of Spark shuffle it is accepted that shuffle *often involves* the following if not all below: - Disk I/O - Data serialization and deserialization - Network I/O Excluding external shuffle service and without relying on the configuration options provided by spark for

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
Couple of points Why use spot or pre-empt intantes when your application as you stated shuffles heavily. Have you looked at why you are having these shuffles? What is the cause of these large transformations ending up in shuffle Also on your point: "..then ideally we should expect that when an

[spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Faiz Halde
Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy Here is what our setup looks like - Spark version: 3.3.1 - Java version: 1.8 - We do not use external shuffle service - We use

Pyspark cluster mode on standalone deployment

2023-05-14 Thread خالد القحطاني
Hi Can I deploy my Pyspark application on standalone cluster with cluster mode I believe it was not possible to do that but I searched all the documentation and I did not find it. My Spark standalone cluster version is 3.3.1

Re: Error while merge in delta table

2023-05-12 Thread Farhan Misarwala
Hi Karthick, If you have confirmed that the incompatibility between Delta and spark versions is not the case, then I would say the same what Jacek said earlier, there’s not enough “data” here. To further comment on it, we would need to know more on how you are structuring your multi threaded

Shuffle with Window().partitionBy()

2023-05-12 Thread ashok34...@yahoo.com.INVALID
Hello, In Spark windowing does call with  Window().partitionBy() can cause shuffle to take place? If so what is the performance impact if any if the data result set is large. Thanks

Re: Error while merge in delta table

2023-05-12 Thread Karthick Nk
Hi Farhan, Thank you for your response, I am using databricks with 11.3x-scala2.12. Here I am overwriting all the tables in the same database in concurrent thread, But when I do in the iterative manner it is working fine, For Example, i am having 200 tables in same database, i am overwriting the

Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-11 Thread Vijay B
In my view spark is behaving as expected. TL:DR Every time a dataframe is reused or branched or forked the sequence operations evaluated run again. Use Cache or persist to avoid this behavior and un-persist when no longer required, spark does not un-persist automatically. Couple of things

Re: Error while merge in delta table

2023-05-11 Thread Farhan Misarwala
Hi Karthick, I think I have seen this before and this probably could be because of an incompatibility between your spark and delta versions. Or an incompatibility between the delta version you are using now vs the one you used earlier on the existing table you are merging with. Let me know if

Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski
Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books Follow me on

Error while merge in delta table

2023-05-10 Thread Karthick Nk
Hi, I am trying to merge daaframe with delta table in databricks, but i am getting error, i have attached the code nippet and error message for reference below, code: [image: image.png] error: [image: image.png] Thanks

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When I run this job in local mode spark-submit --master local[4] with spark = SparkSession.builder \ .appName("tests") \ .enableHiveSupport() \ .getOrCreate() spark.conf.set("spark.sql.adaptive.enabled", "true") df3.explain(extended=True) and no caching I see this

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
Hi, Mich: Thanks for your reply, but maybe I didn't make my question clear. I am looking for a solution to compute the count of each element in an array, without "exploding" the array, and output a Map structure as a column. For example, for an array as ('a', 'b', 'a'), I want to output a

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
I do not think InMemoryFileIndex means it is caching the data. The caches get shown as InMemoryTableScan. InMemoryFileIndex is just for partition discovery and partition pruning. Any read will always show up as a scan from InMemoryFileIndex. It is not cached data. It is a cached file index. Please

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
Hi, Mich: Thanks for your reply, but maybe I didn't make my question clear. I am looking for a solution to compute the count of each element in an array, without "exploding" the array, and output a Map structure as a column. For example, for an array as ('a', 'b', 'a'), I want to output a

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
When you run this in yarn mode, it uses Broadcast Hash Join for join operation as shown in the following output. The datasets here are the same size, so it broadcasts one dataset to all of the executors and then reads the same dataset and does a hash join. It is typical of joins . No surprises

unsubscribe

2023-05-09 Thread Balakumar iyer S

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
Thank you for the help Mich :) I have not started with a pandas DF. I have used pandas to create a dummy .csv which I dump on the disk that I intend to use to showcase my pain point. Providing pandas code was to ensure an end-to-end runnable example is provided and the effort on anyone trying to

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
You have started with panda DF which won't scale outside of the driver itself. Let us put that aside. df1.to_csv("./df1.csv",index_label = "index") ## write the dataframe to the underlying file system starting with spark df1 = spark.read.csv("./df1.csv", header=True, schema = schema) ## read

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
Thank you for your response, Sir. My understanding is that the final ```df3.count()``` is the only action in the code I have attached. In fact, I tried running the rest of the code (commenting out just the final df3.count()) and, as I expected, no computations were triggered On Sun, 7 May, 2023,

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
...However, In my case here I am calling just one action. .. ok, which line in your code is called one action? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
@Vikas Kumar I am sorry but I thought that you had answered the other question that I had raised to the same email address yesterday. It was around the SQL tab in web UI and the output of .explain showing different plans. I get how using .cache I can ensure that the data from a particular

unsubscribe

2023-05-07 Thread Utkarsh Jain

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Winston Lai
When your memory is not sufficient to keep the cached data for your jobs in two different stages, it might be read twice because Spark might have to clear the previous cache for other jobs. In those cases, a spill may triggered when Spark write your data from memory to disk. One way to to

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you want ## you don't need all these import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import udf, col,

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
So what are you intending to do with the resultset produced? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment. For some reason, I can ONLY do this in Spark SQL, instead of either Scala or PySpark environment. I want to aggregate an array into a Map of element count, within that array, but in Spark SQL. I know that there is an aggregate function available like

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote: >

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
you can try df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json") hdfs dfs -ls /tmp/pairs.json Found 2 items -rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21 /tmp/pairs.json/_SUCCESS -rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct

Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hello, I am testing writing my DataFrame to S3 using the DataFrame `write` method. It mostly does a great job. However, it fails one of my requirements. Here are my requirements. - Write to S3 - use `partitionBy` to automatically make folders based on my chosen partition columns - control the

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing the DataFrame as-is >

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Enrico Minack
Hi, You could rearrange the DataFrame so that writing the DataFrame as-is produces your structure: df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int, datA string") +---++ | id|datA| +---++ |  1|  a1| |  2|  a2| |  3|  a3| +---++ df2 = df.select(df.id,

Write custom JSON from DataFrame in PySpark

2023-05-03 Thread Marco Costantini
Hello, Let's say I have a very simple DataFrame, as below. +---++ | id|datA| +---++ | 1| a1| | 2| a2| | 3| a3| +---++ Let's say I have a requirement to write this to a bizarre JSON structure. For example: { "id": 1, "stuff": { "datA": "a1" } } How can I achieve

How to create spark udf use functioncatalog?

2023-05-03 Thread tzxxh
We are using spark.Today I see the FunctionCatalog , and I have seen the source of spark\sql\core\src\test\scala\org\apache\spark\sql\connector\DataSourceV2FunctionSuite.scala and have implements the ScalarFunction.But I still not know how to register it in sql

unsubscribe

2023-05-03 Thread Kang
-- Best Regards! Kang

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-05-02 Thread Trường Trần Phan An
Hi all, I have written a program and overridden two events onStageCompleted and onTaskEnd. However, these two events do not provide information on when a Task/Stage is completed. What I want to know is which Task corresponds to which stage of a DAG (the Spark history server only tells me how

CVE-2023-32007: Apache Spark: Shell command injection via Spark UI

2023-05-02 Thread Arnout Engelen
Severity: important Affected versions: - Apache Spark 3.1.1 before 3.2.2 Description: ** UNSUPPORTED WHEN ASSIGNED ** The Apache Spark UI offers the possibility to enable ACLs via the configuration option spark.acls.enable. With an authentication filter, this checks whether a user has access

Unsubscribe

2023-05-02 Thread rau-jannik
Gesendet mit der mobilen Mail App - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2023-05-01 Thread peng

Unsubscribe

2023-05-01 Thread sandeep vura
-- Sandeep V

Re: Change column values using several when conditions

2023-05-01 Thread Bjørn Jørgensen
you can check if the value exists by using distinct before you loop over the dataset. man. 1. mai 2023 kl. 10:38 skrev marc nicole : > Hello > > I want to change values of a column in a dataset according to a mapping > list that maps original values of that column to other new values. Each >

Change column values using several when conditions

2023-05-01 Thread marc nicole
Hello I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a ";". So for a given column (in

How to change column values using several when conditions ?

2023-04-30 Thread marc nicole
Hello to you Sparkling community :) I want to change values of a column in a dataset according to a mapping list that maps original values of that column to other new values. Each element of the list (colMappingValues) is a string that separates the original values from the new values using a

Any experience with K8s Remote Shuffling Service at scale?

2023-04-30 Thread Andrey Gourine
Hi All, I am looking for people that have experience running external shuffling service at scale with Spark 3 and K8s I have already tried internal shuffling service (available from spark 3) and trying to work with Uniffle (Incubating) Any other

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
There is a large overhead to distributing this type of workload. I imagine that for a small problem, the overhead dominates. You do not nearly need to distribute a problem of this size, so more workers is probalby just worse. On Sun, Apr 30, 2023 at 1:46 AM second_co...@yahoo.com <

How to read text files with GBK encoding in the spark core

2023-04-30 Thread lianyou1...@126.com
Hello all, Is there any way to use the pyspark core to read some text files with GBK encoding? Although the pyspark sql has an option to set the encoding, but these text files are not structural format. Any advices are appreciated. Thank you lianyou Li

Re: Tensorflow on Spark CPU

2023-04-30 Thread second_co...@yahoo.com.INVALID
I re-test with cifar10 example and below is the result .  can advice why lesser num_slot is faster compared with more slots? num_slots=20 231 seconds num_slots=5 52 seconds num_slot=134 seconds the code is at below https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32 Do you

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen
You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at >

Tensorflow on Spark CPU

2023-04-29 Thread second_co...@yahoo.com.INVALID
Anyone successfully run native tensorflow on Spark ? i tested example at https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor   on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any speed up in training time by setting number of slot from1

driver and executors shared same Kubernetes PVC

2023-04-28 Thread second_co...@yahoo.com.INVALID
i able to shared same PVC for spark 3.3. but on Spark 3.4 onward. i get below error.  I would like all the executors and driver to mount the same PVC. Is this a bug ? I don't want to use SPARK_EXECUTOR_ID or OnDemandOn because otherwise each of the executors will use an unique and separate PVC.

Re: ***pyspark.sql.functions.monotonically_increasing_id()***

2023-04-28 Thread Winston Lai
Hi Karthick, A few points that may help you: As stated in the URL you posted, "The function is non-deterministic because its result depends on partition IDs." Hence, the generated ID is dependent on partition IDs. Based on the code snippet you provided, I didn't see the partion columns you

***pyspark.sql.functions.monotonically_increasing_id()***

2023-04-28 Thread Karthick Nk
Hi @all, I am using monotonically_increasing_id(), in the pyspark function, for removing one field from json field in one column from the delta table, please refer the below code df = spark.sql(f"SELECT * from {database}.{table}") df1 = spark.read.json(df.rdd.map(lambda x: x.data), multiLine =

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla
Thanks, Mich for acknowledging. Yes, I am providing the checkpoint path. I omitted it here in the code snippet. I believe this is due to spark version 3.1.x, this config is there only in versions greater than 3.2.x On Thu, Apr 27, 2023 at 9:26 PM Mich Talebzadeh wrote: > Is this all of your

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Mich Talebzadeh
Is this all of your writeStream? df.writeStream() .foreachBatch(new KafkaS3PipelineImplementation(applicationId, appConfig)) .start() .awaitTermination(); What happened to the checkpoint location? option('checkpointLocation', checkpoint_path). example checkpoint_path =

config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla
Hi Team, I am using Spark Streaming to read from Kafka and write to S3. Version: 3.1.2 Scala Version: 2.12 Spark Kafka connector: spark-sql-kafka-0-10_2.12 Dataset df = spark .readStream() .format("kafka") .options(appConfig.getKafka().getConf()) .load()

Re: What is the best way to organize a join within a foreach?

2023-04-27 Thread Amit Joshi
Hi Marco, I am not sure if you will get access to data frame inside the for each, as spark context used to be non serialized, if I remember correctly. One thing you can do. Use cogroup operation on both the dataset. This will help you have (Key- iter(v1),itr(V2). And then use for each partition

RE: Spark Kubernetes Operator

2023-04-26 Thread Aldo Culquicondor
We are welcoming contributors, as announced in the Kubernetes WG Batch https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#bookmark=id.gfgjt0nmbgjl If you are interested, you can find us in slack.k8s.io #wg-batch or ping @mwielgus on github/slack. Thanks On

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Again one try is worth many opinions. Try it and gather matrix from spark UI and see how it performs. On Wed, 26 Apr 2023 at 14:57, Marco Costantini < marco.costant...@rocketfncl.com> wrote: > Thanks team, > Email was just an example. The point was to illustrate that some actions > could be

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
Thanks team, Email was just an example. The point was to illustrate that some actions could be chained using Spark's foreach. In reality, this is an S3 write and a Kafka message production, which I think is quite reasonable for spark to do. To answer Ayan's first question. Yes, all a users

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Indeed very valid points by Ayan. How email is going to handle 1000s of records. As a solution architect I tend to replace. Users by customers and for each order there must be products sort of many to many relationship. If I was a customer I would also be interested in product details as

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread ayan guha
Adding to what Mitch said, 1. Are you trying to send statements of all orders to all users? Or the latest order only? 2. Sending email is not a good use of spark. instead, I suggest to use a notification service or function. Spark should write to a queue (kafka, sqs...pick your choice here).

<    7   8   9   10   11   12   13   14   15   16   >