Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du
I don't think coalesce (by repartitioning I assume you mean coalesce) itself and deserialising takes that much time. To add a little bit more context, the computation of the DataFrame is CPU intensive instead of data/IO intensive. I purposely keep coalesce​ after df.count​ as I want to keep the

Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
Hi, As you can found the description from the website[1] of Apache Kyuubi (incubating): "Kyuubi provides a pure SQL gateway through Thrift JDBC/ODBC interface for end-users to manipulate large-scale data with pre-programmed and extensible Spark SQL engines." [1]: https://kyuubi.apache.org/

Re: [ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Bitfox
What’s the difference between Spark and Kyuubi? Thanks On Mon, Jan 31, 2022 at 2:45 PM Vino Yang wrote: > Hi all, > > The Apache Kyuubi (Incubating) community is pleased to announce that > Apache Kyuubi (Incubating) 1.4.1-incubating has been released! > > Apache Kyuubi (Incubating) is a

[ANNOUNCE] Apache Kyuubi (Incubating) released 1.4.1-incubating

2022-01-30 Thread Vino Yang
Hi all, The Apache Kyuubi (Incubating) community is pleased to announce that Apache Kyuubi (Incubating) 1.4.1-incubating has been released! Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and

Unsubscribe

2022-01-30 Thread Yogitha Ramanathan

Re: Migration to Spark 3.2

2022-01-30 Thread Aurélien Mazoyer
Hi Stephen, Thank you for your answer. Yes, I changed the scope to "provided" but got the same error :-( FYI. I am getting this error while running tests. Regards, Aurelien Le jeu. 27 janv. 2022 à 23:57, Stephen Coy a écrit : > Hi Aurélien, > > Your Jackson versions look fine. > > What

RE: why the pyspark RDD API is so slow?

2022-01-30 Thread Theodore J Griesenbrock
Any particular code sample you can suggest to review on your tips? > On Jan 30, 2022, at 06:16, Sebastian Piu wrote: > >  > This Message Is From an External Sender > This message came from outside your organization. > It's because all data needs to be pickled back and forth between java and a

Re: how can I remove the warning message

2022-01-30 Thread Sean Owen
This one you can ignore. It's from the JVM so you might be able to disable it by configuring the right JVM logger as well, but it also tells you right in the message how to turn it off! But this is saying that some reflective operations are discouraged in Java 9+. They still work and Spark needs

Re: unsubscribe

2022-01-30 Thread Bitfox
The signature in your mail has showed the info: To unsubscribe e-mail: user-unsubscr...@spark.apache.org On Sun, Jan 30, 2022 at 8:50 PM Lucas Schroeder Rossi wrote: > unsubscribe > > - > To unsubscribe e-mail:

unsubscribe

2022-01-30 Thread Lucas Schroeder Rossi
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: why the pyspark RDD API is so slow?

2022-01-30 Thread Sebastian Piu
It's because all data needs to be pickled back and forth between java and a spun python worker, so there is additional overhead than if you stay fully in scala. Your python code might make this worse too, for example if not yielding from operations You can look at using UDFs and arrow or trying

why the pyspark RDD API is so slow?

2022-01-30 Thread Bitfox
Hello list, I did a comparison for pyspark RDD, scala RDD, pyspark dataframe and a pure scala program. The result shows the pyspark RDD is too slow. For the operations and dataset please see: https://blog.cloudcache.net/computing-performance-comparison-for-words-statistics/ The result table is

Re: [Spark UDF]: Where does UDF stores temporary Arrays/Sets

2022-01-30 Thread Gourav Sengupta
Hi, Can you please try to see if you can increase the number of cores per task, and therefore give each task more memory per executor? I do not understand what is the XML, what is the data in it, and what is the problem that you are trying to solve writing UDF's to parse XML. So maybe we are not

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Gourav Sengupta
Hi, without getting into suppositions, the best option is to look into the SPARK UI SQL section. It is the most wonderful tool to explain what is happening, and why. In SPARK 3.x they have made the UI even better, with different set of granularity and details. On another note, you might want to

Re: How to delete the record

2022-01-30 Thread Gourav Sengupta
Hi, I think it will be useful to understand the problem before solving the problem. Can I please ask what this table is? Is it a fact (event store) kind of a table, or a dimension (master data) kind of table? And what are the downstream consumptions of this table? Besides that what is the

Re: how can I remove the warning message

2022-01-30 Thread Gourav Sengupta
Hi, I have often found that logging in the warnings is extremely useful, they are just logs, and provide a lot of insights during upgrades, external package loading, deprecation, debugging, etc. Do you have any particular reason to disable the warnings in a submitted job? I used to disable

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Deepak Sharma
coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du wrote: > I have some PySpark code like below. Basically, I persist a DataFrame > (which is time-consuming to compute) to disk, call the method > DataFrame.count to trigger

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Sebastian Piu
It's probably the repartitioning and deserialising the df that you are seeing take time. Try doing this 1. Add another count after your current one and compare times 2. Move coalesce before persist You should see On Sun, 30 Jan 2022, 08:37 Benjamin Du, wrote: > I have some PySpark code like

Re: Kafka to spark streaming

2022-01-30 Thread Gourav Sengupta
Hi Amit, before answering your question, I am just trying to understand it. I am not exactly clear how do the Akka application, Kafka and SPARK Streaming application sit together, and what are you exactly trying to achieve? Can you please elaborate? Regards, Gourav On Fri, Jan 28, 2022 at

A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du
I have some PySpark code like below. Basically, I persist a DataFrame (which is time-consuming to compute) to disk, call the method DataFrame.count to trigger the caching/persist immediately, and then I coalesce the DataFrame to reduce the number of partitions (the original DataFrame has 30,000