Re: About some Spark technical help

2021-12-24 Thread sam smith
Thanks for the feedback Andrew. Le sam. 25 déc. 2021 à 03:17, Andrew Davidson a écrit : > Hi Sam > > It is kind of hard to review straight code. Adding some some sample data, > a unit test and expected results. Would be a good place to start. Ie. > Determine the fidelity of your implementation

Re: measure running time

2021-12-24 Thread bitfox
Thanks a lot Hollis. It is does due to the pypi version. Now I updated it. $ pip3 -V pip 9.0.1 from /usr/lib/python3/dist-packages (python 3.6) $ pip3 install sparkmeasure Collecting sparkmeasure Using cached

Re: About some Spark technical help

2021-12-24 Thread Andrew Davidson
Hi Sam It is kind of hard to review straight code. Adding some some sample data, a unit test and expected results. Would be a good place to start. Ie. Determine the fidelity of your implementation compared to the original. Also a verbal description of the algo would be helpful Happy Holidays

Re: measure running time

2021-12-24 Thread Hollis
Hi I can run this in my pc. I check the email chian. bitfox install the spark measure with python2 and he launch the pyspark with python3. I think it's the reason. Regards. Hollis Replied mail | From | Mich Talebzadeh | | Date | 12/25/2021 00:25 | | To | Sean Owen | | Cc |

Re: df.show() to text file

2021-12-24 Thread Sean Owen
You can redirect the stdout of your program I guess but show is for display, not saving data. Use df.write methods for that. On Fri, Dec 24, 2021, 7:02 PM wrote: > Hello list, > > spark newbie here :0 > How can I write the df.show() result to a text file in the system? > I run with pyspark, not

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Hollis
Hi, it is the same thing when you are using Sql or dataframe api. actually, they will be optimized by catalyst then push to rdd. and in this case, there are many times on iteration, (16000 times). so you got a very big execution plan when you join the dataframe again and again, I think this

df.show() to text file

2021-12-24 Thread bitfox
Hello list, spark newbie here :0 How can I write the df.show() result to a text file in the system? I run with pyspark, not the python client programming. Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Dataframe's storage size

2021-12-24 Thread Gourav Sengupta
Hi, even the cached data has different memory for the dataframes with exactly the same data depending on a lot of conditions. I generally tend to try to understand the problem before jumping into conclusions through assumptions, sadly a habit I cannot overcome. Is there a way to understand what

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, may be I am getting confused as always :) , but the requirement looked pretty simple to me to be implemented in SQL, or it is just the euphoria of Christmas eve Anyways, in case the above can be implemented in SQL, then I can have a look at it. Yes, indeed there are bespoke scenarios where

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
This is simply not generally true, no, and not in this case. The programmatic and SQL APIs overlap a lot, and where they do, they're essentially aliases. Use whatever is more natural. What I wouldn't recommend doing is emulating SQL-like behavior in custom code, UDFs, etc. The native operators

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, yeah I think that in practice you will always find that dataframes can give issues regarding a lot of things, and then you can argue. In the SPARK conference, I think last year, it was shown that more than 92% or 95% use the SPARK SQL API, if I am not mistaken. I think that you can do the

OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Andrew Davidson
Hi Sean and Gourav Thanks for the suggestions. I thought that both the sql and dataframe apis are wrappers around the same frame work? Ie. catalysts. I tend to mix and match my code. Sometimes I find it easier to write using sql some times dataframes. What is considered best practices? Here

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
(that's not the situation below we are commenting on) On Fri, Dec 24, 2021, 9:28 AM Gourav Sengupta wrote: > Hi, > > try to write several withColumns in a dataframe with functions and then > see the UI for time differences. This should be done with large data sets > of course, in order of a

Re: measure running time

2021-12-24 Thread Mich Talebzadeh
Hi Sean, I have already discussed an issue in my case with Spark 3.1.1 and sparkmeasure with the author Luca Canali on this matter. It has been reproduced. I think we ought to wait for a patch. HTH, Mich view my Linkedin profile

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, try to write several withColumns in a dataframe with functions and then see the UI for time differences. This should be done with large data sets of course, in order of a around 200GB + With scenarios involving nested queries and joins the time differences shown in the UI becomes a bit more

Re: measure running time

2021-12-24 Thread Sean Owen
You probably did not install it on your cluster, nor included the python package with your app On Fri, Dec 24, 2021, 4:35 AM wrote: > but I already installed it: > > Requirement already satisfied: sparkmeasure in > /usr/local/lib/python2.7/dist-packages > > so how? thank you. > > On 2021-12-24

Re: Dataframe's storage size

2021-12-24 Thread Sean Owen
I assume it means size in memory when cached, which does make sense. Fastest thing is to look at it in the UI Storage tab after it is cached. On Fri, Dec 24, 2021, 4:54 AM Gourav Sengupta wrote: > Hi, > > This question, once again like the last one, does not make much sense at > all. Where are

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Sean Owen
Nah, it's going to translate to the same plan as the equivalent SQL. On Fri, Dec 24, 2021, 5:09 AM Gourav Sengupta wrote: > Hi, > > please note that using SQL is much more performant, and easier to manage > these kind of issues. You might want to look at the SPARK UI to see the > advantage of

Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There are integration test cases for K8S support, and I myself also tested it before. It would be helpful if you try what I did at https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html and see

Re: About some Spark technical help

2021-12-24 Thread sam smith
Hi Gourav, Good question! that's the programming language i am most proficient at. You are always welcome to suggest corrective remarks about my (Spark) code. Kind regards. Le ven. 24 déc. 2021 à 11:58, Gourav Sengupta a écrit : > Hi, > > out of sheer and utter curiosity, why JAVA? > >

Re: AnalysisException: Trouble using select() to append multiple columns

2021-12-24 Thread Gourav Sengupta
Hi, please note that using SQL is much more performant, and easier to manage these kind of issues. You might want to look at the SPARK UI to see the advantage of using SQL over dataframes API. Regards, Gourav Sengupta On Sat, Dec 18, 2021 at 5:40 PM Andrew Davidson wrote: > Thanks Nicholas >

Re: Unable to use WriteStream to write to delta file.

2021-12-24 Thread Gourav Sengupta
Hi, also please ensure that you have read all the required documentation to understand whether you need to do any metadata migration or not. Regards, Gourav Sengupta On Sun, Dec 19, 2021 at 11:55 AM Alex Ott wrote: > Make sure that you're using compatible version of Delta Lake library. For >

Re: About some Spark technical help

2021-12-24 Thread Gourav Sengupta
Hi, out of sheer and utter curiosity, why JAVA? Regards, Gourav Sengupta On Thu, Dec 23, 2021 at 5:10 PM sam smith wrote: > Hi Andrew, > > Thanks, here's the Github repo to the code and the publication : > https://github.com/SamSmithDevs10/paperReplicationForReview > > Kind regards > > Le

Re: Dataframe's storage size

2021-12-24 Thread Gourav Sengupta
Hi, This question, once again like the last one, does not make much sense at all. Where are you trying to store the data frame, and how? Are you just trying to write a blog, as you were mentioning in an earlier email, and trying to fill in some gaps? I think that the questions are entirely

Re: measure running time

2021-12-24 Thread Gourav Sengupta
Hi, There are too many blogs out there with absolutely no value. Before writing another blog, which does not make much sense by doing run time comparisons between RDD and dataframes (as stated earlier), it may be useful to first understand what you are trying to achieve by writing this blog.

Re: measure running time

2021-12-24 Thread bitfox
As you see below: $ pip install sparkmeasure Collecting sparkmeasure Using cached https://files.pythonhosted.org/packages/9f/bf/c9810ff2d88513ffc185e65a3ab9df6121ad5b4c78aa8d134a06177f9021/sparkmeasure-0.14.0-py2.py3-none-any.whl Installing collected packages: sparkmeasure Successfully

Re: measure running time

2021-12-24 Thread bitfox
but I already installed it: Requirement already satisfied: sparkmeasure in /usr/local/lib/python2.7/dist-packages so how? thank you. On 2021-12-24 18:15, Hollis wrote: Hi bitfox, you need pip install sparkmeasure firstly. then can lanch in pysaprk. from sparkmeasure import StageMetrics

Re:Re: measure running time

2021-12-24 Thread Hollis
Hi bitfox, you need pip install sparkmeasure firstly. then can lanch in pysaprk. | >>> from sparkmeasure import StageMetrics >>> stagemetrics = StageMetrics(spark) >>> stagemetrics.runandmeasure(locals(), 'spark.sql("select count(*) from >>> range(1000) cross join range(1000) cross join