Re: Duplicates in Collaborative Filtering Output

2023-01-23 Thread Kartik Ohri
Hi again! Ironically, soon after sending the previous email I actually found the bug in our setup that was resulting in duplicates and it wasn't Mllib ALS after all. Sorry for the confusion. Regards. On Mon, Jan 23, 2023 at 1:09 PM Kartik Ohri wrote: > Hi! > > We are using Spark mllib (on

Re: Any advantages of using sql.adaptive.autoBroadcastJoinThreshold over sql.autoBroadcastJoinThreshold?

2023-01-22 Thread Balakrishnan Ayyappan
Hi Soumyadeep, Both the configs are more or less the same. However, sql.adaptive.auto* config is applicable (starting from version 3.2.0) only in adaptive framework As per the doc, default value for " spark.sql.adaptive.autoBroadcastJoinThreshold" is same with "

Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

2023-01-21 Thread krexos
But in this case too the single partition is dynamic. I would expect the error to be thrown here too. When I create the table through a query I do it with PARTITION BY 'partitionCol' thanks --- Original Message --- On Saturday, January 21st, 2023 at 9:27 PM, Peyman Mohajerian wrote:

Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

2023-01-21 Thread Peyman Mohajerian
In the case of saveAsTable("tablename") you specified the partition: ' partitionBy("partitionCol")' On Sat, Jan 21, 2023 at 4:03 AM krexos wrote: > My periodically running process writes data to a table over parquet files > with the configuration "spark.sql.sources.partitionOverwriteMode" = >

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-18 Thread Oliver Ruebenacker
Awesome, thanks, this was exactly what I needed! On Tue, Jan 17, 2023 at 5:23 PM Sean Owen wrote: > I think you want array_contains: > > https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html > > On Tue, Jan 17, 2023 at 4:18 PM Oliver

Re: [PySPark] How to check if value of one column is in array of another column

2023-01-17 Thread Sean Owen
I think you want array_contains: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_contains.html On Tue, Jan 17, 2023 at 4:18 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > I have data originally stored as

Re: pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

2023-01-13 Thread Sean Owen
One is a normal Pyspark DataFrame, the other is a pandas work-alike wrapper on a Pyspark DataFrame. They're the same thing with different APIs. Neither has a 'storage format'. spark-excel might be fine, and it's used with Spark DataFrames. Because it emulates pandas's read_excel API, the Pyspark

Re: Hive 3 has big performance improvement from my test

2023-01-08 Thread Mich Talebzadeh
What bothers me is that you are making sweeping statements about Spark inability to handle quote " ... the key weakness of Spark is 1) its poor performance when executing concurrent queries and 2) its poor resource utilization when executing multiple Spark applications concurrently" and conversely

Re: Hive 3 has big performance improvement from my test

2023-01-07 Thread Mich Talebzadeh
Thanks for this insight guys. On your point below and I quote: ... "It's even as fast as Spark by using the default mr engine" OK as we are all experimentalists, are we stating that the classic MapReduce computation can outdo Spark's in-memory computation. I would be curious to know this.

Re: [pyspark/sparksql]: How to overcome redundant/repetitive code? Is a for loop over an sql statement with a variable a bad idea?

2023-01-06 Thread Sean Owen
Right, nothing wrong with a for loop here. Seems like just the right thing. On Fri, Jan 6, 2023, 3:20 PM Joris Billen wrote: > Hello Community, > I am working in pyspark with sparksql and have a very similar very complex > list of dataframes that Ill have to execute several times for all the >

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
So I think now that my problem is Spark-related after all. It looks like my bootstrap script installs SciPy just fine in a regular environment, but somehow interaction with PySpark breaks it. On Fri, Jan 6, 2023 at 12:39 PM Bjørn Jørgensen wrote: > Create a Dockerfile > > FROM fedora > > RUN

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
Create a Dockerfile FROM fedora RUN sudo yum install -y python3-devel RUN sudo pip3 install -U Cython && \ sudo pip3 install -U pybind11 && \ sudo pip3 install -U pythran && \ sudo pip3 install -U numpy && \ sudo pip3 install -U scipy docker build --pull --rm -f "Dockerfile" -t

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Mich Talebzadeh
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Oliver Ruebenacker
Thank you for the link. I already tried most of what was suggested there, but without success. On Fri, Jan 6, 2023 at 11:35 AM Bjørn Jørgensen wrote: > > > > https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp > > > > > fre.

Re: [PySpark] Error using SciPy: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

2023-01-06 Thread Bjørn Jørgensen
https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp fre. 6. jan. 2023, 16:01 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > > Hello, > > I'm trying to install SciPy using a bootstrap script and then use

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-06 Thread Aaron Grubb
Hi Mich, Thanks a lot for the insight, it was very helpful. Aaron On Thu, 2023-01-05 at 23:44 +, Mich Talebzadeh wrote: Hi Aaron, Thanks for the details. It is a general practice when running Spark on premise to use Hadoop

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Mich Talebzadeh
Hi Aaron, Thanks for the details. It is a general practice when running Spark on premise to use Hadoop clusters. This comes from the notion of data locality. Data locality in

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Aaron Grubb
Hi Mich, Thanks for your reply. In hindsight I realize I didn't provide enough information about the infrastructure for the question to be answered properly. We are currently running a Hadoop cluster with nodes that have the following services: - HDFS NameNode (3.3.4) - YARN NodeManager

Re: Spark reading from HBase using hbase-connectors - any benefit from localization?

2023-01-05 Thread Mich Talebzadeh
Few questions - As I understand you already have a Hadoop cluster. Are you going to put your spark as Hadoopp nodes? - Where is your HBase cluster? Is it sharing nodes with Hadoop or has its own cluster I looked at that link and it does not say much. Essentially you want to use HBase

Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-05 Thread ayan guha
Hi What you are trying to do does not make sense. I suggest you to understand how Views work in SQL. IMHO you are better off creating a table. Ayan On Fri, 6 Jan 2023 at 12:20 am, Stelios Philippou wrote: > Vajiha, > > I dont see your query working as you hope it will. > > spark.sql will

Re: GPU Support

2023-01-05 Thread Sean Owen
Spark itself does not use GPUs, but you can write and run code on Spark that uses GPUs. You'd typically use software like Tensorflow that uses CUDA to access the GPU. On Thu, Jan 5, 2023 at 7:05 AM K B M Kaala Subhikshan < kbmkaalasubhiks...@gmail.com> wrote: > Is Gigabyte GeForce RTX 3080 GPU

Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-05 Thread Stelios Philippou
Vajiha, I dont see your query working as you hope it will. spark.sql will execute a query on a database level to retrieve the temp view you need to go from the sessions. i.e session.sql("SELECT * FROM TEP_VIEW") You might need to retrieve the data in a collection and iterate over them to do

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
and 2 single quotes together'' are looking like a single double quote ". Mvg/Regards Saurabh Gulati From: Saurabh Gulati Sent: 05 January 2023 12:24 To: Sean Owen Cc: User Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Its the same input except that headers are also being read with csv reader. Mvg/Regards Saurabh Gulati From: Sean Owen Sent: 04 January 2023 15:12 To: Saurabh Gulati Cc: User Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Cc: Mich Talebzadeh ; User Subject: Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data If you have found a parser that works, simply read the data as text files, apply the parser manually, and convert to DataFrame (if needed at all), ___

Re: How to set a config for a single query?

2023-01-05 Thread Khalid Mammadov
Hi I believe there is a feature in Spark specifically for this purpose. You can create a new spark session and set those configs. Note that it's not the same as creating a separate driver processes with separate sessions, here you will still have the same SparkContext that works as a backend for

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
04 January 2023 14:25 > *To:* Saurabh Gulati > *Cc:* Mich Talebzadeh ; User < > user@spark.apache.org> > *Subject:* Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used > within the data > > That input is just invalid as CSV for any parser. You end a quote

Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Shay Elbaz
: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data ATTENTION: This email originated from outside of GM. Hi @Sean Owen<mailto:sro...@gmail.com> Probably the data is incorrect, and the source needs to fix it. But using python's csv parser returns the correct r

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
__ From: Sean Owen Sent: 04 January 2023 14:25 To: Saurabh Gulati Cc: Mich Talebzadeh ; User Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data That input is just invalid as CSV for any parser. You end a quoted col witho

Re: Got Error Creating permanent view in Postgresql through Pyspark code

2023-01-04 Thread Stelios Philippou
Vajiha, I believe that you might be confusing stuff ? Permanent View in PSQL is a standard view. Temp view or Global View is the Spark View that is internal for Spark. Can we get a snippet of the code please. On Wed, 4 Jan 2023 at 15:10, Vajiha Begum S A wrote: > > I have tried to Create a

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Sean Owen
That input is just invalid as CSV for any parser. You end a quoted col without following with a col separator. What would the intended parsing be and how would it work? On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati wrote: > > @Sean Owen Also see the example below with quotes > feedback: > >

Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
"|null|null| |2 |null|abc | +---+++ df.select("c").show(10, False) ++ |c | ++ |",see what ""I did""| |null

Re: How to set a config for a single query?

2023-01-04 Thread Shay Elbaz
for the new SparkSession does not contain custom configurations from the original session. I had to re-apply the important configurations (catalogs, etc.) on the new Sessions as well. Hope that helps. Shay From: Saurabh Gulati Sent: Wednesday, January 4, 2023 11:54 AM

Re: How to set a config for a single query?

2023-01-04 Thread Saurabh Gulati
Hey Felipe, Since you are collecting the dataframes, you might as well run them separately with desired configs and store them in your storage. Regards Saurabh From: Felipe Pessoto Sent: 04 January 2023 01:14 To: user@spark.apache.org Subject: [EXTERNAL] How to

Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Mich Talebzadeh
What is the point of having *,* as a column value? From a business point of view it does not signify anything IMO view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
Why does the data even need cleaning? That's all perfectly correct. The error was setting quote to be an escape char. On Tue, Jan 3, 2023, 2:32 PM Mich Talebzadeh wrote: > if you take your source CSV as below > > "a","b","c" > "1","","," > "2","","abc" > > > and define your code as below > > >

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Mich Talebzadeh
if you take your source CSV as below "a","b","c" "1","","," "2","","abc" and define your code as below csv_file="hdfs://rhes75:9000/data/stg/test/testcsv.csv" # read hive table in spark listing_df = spark.read.format("com.databricks.spark.csv").option("inferSchema",

Re: Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Sean Owen
No, you've set the escape character to double-quote, when it looks like you mean for it to be the quote character (which it already is). Remove this setting, as it's incorrect. On Tue, Jan 3, 2023 at 11:00 AM Saurabh Gulati wrote: > Hello, > We are seeing a case with csv data when it parses csv

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
I agree with you that it's not the recommended approach. But I just want to understand which change caused this change in behavior. If you can point me to some Jira in which this change was made, that would be greatly appreciated. Regards, Shrikant On Mon, 2 Jan 2023 at 9:46 PM, Sean Owen

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
Not true, you've never been able to use the SparkSession inside a Spark task. You aren't actually using it, if the application worked in Spark 2.x. Now, you need to avoid accidentally serializing it, which was the right thing to do even in Spark 2.x. Just move the sesion inside main(), not a

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
If that was the case and deserialized session would not work, the application would not have worked. As per the logs and debug prints, in spark 2.3 the main object is not getting deserialized in executor, otherise it would have failed then also. On Mon, 2 Jan 2023 at 9:15 PM, Sean Owen wrote:

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
It silently allowed the object to serialize, though the serialized/deserialized session would not work. Now it explicitly fails. On Mon, Jan 2, 2023 at 9:43 AM Shrikant Prasad wrote: > Thats right. But the serialization would be happening in Spark 2.3 also, > why we dont see this error there? >

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
Thats right. But the serialization would be happening in Spark 2.3 also, why we dont see this error there? On Mon, 2 Jan 2023 at 9:09 PM, Sean Owen wrote: > Oh, it's because you are defining "spark" within your driver object, and > then it's getting serialized because you are trying to use

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
Oh, it's because you are defining "spark" within your driver object, and then it's getting serialized because you are trying to use TestMain methods in your program. This was never correct, but now it's an explicit error in Spark 3. The session should not be a member variable. On Mon, Jan 2, 2023

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
Please see these logs. The error is thrown in executor: 23/01/02 15:14:44 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.ExceptionInInitializerError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
It's not running on the executor; that's not the issue. See your stack trace, where it clearly happens in the driver. On Mon, Jan 2, 2023 at 8:58 AM Shrikant Prasad wrote: > Even if I set the master as yarn, it will not have access to rest of the > spark confs. It will need spark.yarn.app.id. >

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
Even if I set the master as yarn, it will not have access to rest of the spark confs. It will need spark.yarn.app.id. The main issue is if its working as it is in Spark 2.3 why its not working in Spark 3 i.e why the session is getting created on executor. Another thing we tried is removing the df

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Sean Owen
So call .setMaster("yarn"), per the error On Mon, Jan 2, 2023 at 8:20 AM Shrikant Prasad wrote: > We are running it in cluster deploy mode with yarn. > > Regards, > Shrikant > > On Mon, 2 Jan 2023 at 6:15 PM, Stelios Philippou > wrote: > >> Can we see your Spark Configuration parameters ? >>

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Shrikant Prasad
We are running it in cluster deploy mode with yarn. Regards, Shrikant On Mon, 2 Jan 2023 at 6:15 PM, Stelios Philippou wrote: > Can we see your Spark Configuration parameters ? > > The mater URL refers to as per java > new SparkConf()setMaster("local[*]") > according to where you want to

Re: Spark migration from 2.3 to 3.0.1

2023-01-02 Thread Stelios Philippou
Can we see your Spark Configuration parameters ? The mater URL refers to as per java new SparkConf()setMaster("local[*]") according to where you want to run this On Mon, 2 Jan 2023 at 14:38, Shrikant Prasad wrote: > Hi, > > I am trying to migrate one spark application from Spark 2.3 to

Re: Profiling data quality with Spark

2022-12-29 Thread Chitral Verma
Hi Rajat, I have worked for years in democratizing data quality for some of the top organizations and I'm also an Apache Griffin Contributor and PMC - so I know a lot about this space. :) Coming back to your original question, there are a lot of data quality options available in the market today

Re: Profiling data quality with Spark

2022-12-28 Thread infa elance
You can also look at informatica data quality that runs on spark. Of course it’s not free but you can sign up for a 30 day free trial. They have both profiling and prebuilt data quality rules and accelerators. Sent from my iPhoneOn Dec 28, 2022, at 10:02 PM, vaquar khan wrote:@ Gourav Sengupta

Re: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-28 Thread Vibhor Gupta
m Verma Sent: Monday, December 26, 2022 8:08 PM To: Russell Jurney Cc: Gurunandan ; user@spark.apache.org Subject: EXT: Re: Check if shuffle is caused for repartitioned pyspark dataframes EXTERNAL: Report suspicious emails to Email Abuse. I tried sorting the repartitioned dataframes on the pa

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
@ Gourav Sengupta why you are sending unnecessary emails ,if you think snowflake good plz use it ,here question was different and you are talking totally different topic. Plz respects group guidelines Regards, Vaquar khan On Wed, Dec 28, 2022, 10:29 AM vaquar khan wrote: > Here you can find

Re: Profiling data quality with Spark

2022-12-28 Thread vaquar khan
Here you can find all details , you just need to pass spark dataframe and deequ also generate recommendations for rules and you can also write custom complex rules. https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/ Regards, Vaquar khan On Wed, Dec 28, 2022, 9:40 AM

Re: Profiling data quality with Spark

2022-12-28 Thread rajat kumar
Thanks for the input folks. Hi Vaquar , I saw that we have various types of checks in GE and Deequ. Could you please suggest what types of check did you use for Metric based columns Regards Rajat On Wed, Dec 28, 2022 at 12:15 PM vaquar khan wrote: > I would suggest Deequ , I have

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Gourav Sengupta
Hi Khalid, just out of curiosity, does the API help us in setting JOB ID's or just job Descriptions? Regards, Gourav Sengupta On Wed, Dec 28, 2022 at 10:58 AM Khalid Mammadov wrote: > There is a feature in SparkContext to set localProperties > (setLocalProperty) where you can set your Request

Re: [Spark Core] [Advanced] [How-to] How to map any external field to job ids spawned by Spark.

2022-12-28 Thread Khalid Mammadov
There is a feature in SparkContext to set localProperties (setLocalProperty) where you can set your Request ID and then using SparkListener instance read that ID with Job ID using onJobStart event. Hope this helps. On Tue, 27 Dec 2022, 13:04 Dhruv Toshniwal, wrote: > TL;Dr - >

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi Sean, the entire narrative of SPARK being a unified analytics tool falls flat as what should have been an engine on SPARK is now deliberately floated off as a separate company called as Ray, and all the unified narrative rings hollow. SPARK is nothing more than a SQL engine as per SPARKs own

Re: Profiling data quality with Spark

2022-12-27 Thread vaquar khan
I would suggest Deequ , I have implemented many time easy and effective. Regards, Vaquar khan On Tue, Dec 27, 2022, 10:30 PM ayan guha wrote: > The way I would approach is to evaluate GE, Deequ (there is a python > binding called pydeequ) and others like Delta Live tables with expectations >

Re: Profiling data quality with Spark

2022-12-27 Thread ayan guha
The way I would approach is to evaluate GE, Deequ (there is a python binding called pydeequ) and others like Delta Live tables with expectations from Data Quality feature perspective. All these tools have their pros and cons, and all of them are compatible with spark as a compute engine. Also,

Re: Profiling data quality with Spark

2022-12-27 Thread Walaa Eldin Moustafa
Rajat, You might want to read about Data Sentinel, a data validation tool on Spark that is developed at LinkedIn. https://engineering.linkedin.com/blog/2020/data-sentinel-automating-data-validation The project is not open source, but the blog post might give you insights about how such a system

Re: Profiling data quality with Spark

2022-12-27 Thread Sean Owen
I think this is kind of mixed up. Data warehouses are simple SQL creatures; Spark is (also) a distributed compute framework. Kind of like comparing maybe a web server to Java. Are you thinking of Spark SQL? then I dunno sure you may well find it more complicated, but it's also just a data

Re: Profiling data quality with Spark

2022-12-27 Thread Gourav Sengupta
Hi, SPARK is just another querying engine with a lot of hype. I would highly suggest using Redshift (storage and compute decoupled mode) or Snowflake without all this super complicated understanding of containers/ disk-space, mind numbing variables, rocket science tuning, hair splitting failure

Re: Profiling data quality with Spark

2022-12-27 Thread Mich Talebzadeh
Well, you need to qualify your statement on data quality. Are you talking about data lineage here? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all

Re: spark-submit fails in kubernetes 1.24.x cluster

2022-12-27 Thread Saurabh Gulati
Hello Thimme, Your issue is related to https://kubernetes.io/docs/reference/using-api/deprecation-guide/#ingress-v122 Deprecated API Migration Guide | Kubernetes As the Kubernetes API evolves, APIs are periodically

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-26 Thread Shivam Verma
I tried sorting the repartitioned dataframes on the partition key before saving them as parquet files, however, when I read those repartitioned-sorted dataframes and join them on the partition key, the spark plan still shows `Exchange hashpartitioning` step, which I want to avoid:

RE: Re: RDD to InputStream

2022-12-25 Thread ayuio5799
porary file and then opening an> > >> > inputstream on the file but that is not really optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > >> > Thanks,> > >> > Ayoub.> &g

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Russell Jurney
This may not be good advice but... could you sort by the partition key to ensure the partitions match up? Thinking of olden times :) On Fri, Dec 23, 2022 at 4:42 AM Shivam Verma wrote: > Hi Gurunandan, > > Thanks for the reply! > > I do see the exchange operator in the SQL tab, but I can see it

Re: Check if shuffle is caused for repartitioned pyspark dataframes

2022-12-23 Thread Shivam Verma
Hi Gurunandan, Thanks for the reply! I do see the exchange operator in the SQL tab, but I can see it in both the experiments: 1. Using repartitioned dataframes 2. Using initial dataframes Does that mean that the repartitioned dataframes are not actually "co-partitioned"? If that's the case, I

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Oliver Ruebenacker
Wow, thank you so much! On Wed, Dec 21, 2022 at 10:27 AM Mich Talebzadeh wrote: > OK let us try this > > 1) we have a csv file as below called cities.csv > > country,city,population > Germany,Berlin,3520031 > Germany,Hamburg,1787408 > Germany,Munich,1450381 > Turkey,Ankara,4587558 >

Re: [PySpark] Getting the best row from each group

2022-12-21 Thread Mich Talebzadeh
OK let us try this 1) we have a csv file as below called cities.csv country,city,population Germany,Berlin,3520031 Germany,Hamburg,1787408 Germany,Munich,1450381 Turkey,Ankara,4587558 Turkey,Istanbul,14025646 Turkey,Izmir,2847691 United States,Chicago IL,2670406 United States,Los Angeles

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Artemis User
Try this one:  "select country, city, max(population) from your_table group by country" Please note this returns a table of three columns, instead of two. This is a standard SQL query, and supported by Spark as well. On 12/20/22 3:35 PM, Oliver Ruebenacker wrote: Hello,   Let's say

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Bjørn Jørgensen
https://github.com/apache/spark/pull/39134 tir. 20. des. 2022, 22:42 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you for the suggestion. This would, however, involve converting my > Dataframe to an RDD (and back later), which involves additional costs. > > On Tue, Dec 20,

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Thank you for the suggestion. This would, however, involve converting my Dataframe to an RDD (and back later), which involves additional costs. On Tue, Dec 20, 2022 at 7:30 AM Raghavendra Ganesh wrote: > you can groupBy(country). and use mapPartitions method in which you can > iterate over all

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Oliver Ruebenacker
Hello, Let's say the data is like this: +---+---++ | country | city | population | +---+---++ | Germany | Berlin| 3520031| | Germany | Hamburg | 1787408

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Raghavendra Ganesh
you can groupBy(country). and use mapPartitions method in which you can iterate over all rows keeping 2 variables for maxPopulationSoFar and corresponding city. Then return the city with max population. I think as others suggested, it may be possible to use Bucketing, it would give a more friendly

Re: [PySpark] Getting the best row from each group

2022-12-20 Thread Mich Talebzadeh
Hi, Windowing functions were invented to avoid doing lengthy group by etc. As usual there is a lot of heat but little light Please provide: 1. Sample input. I gather this data is stored in some csv, tsv, table format 2. The output that you would like to see. Have a look at this

Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described here. From: Eric Hanchrow Date: Thursday, December 8, 2022 at 17:03 To: user@spark.apache.org Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
Post an example dataframe and how you will have the result. man. 19. des. 2022 kl. 20:36 skrev Oliver Ruebenacker < oliv...@broadinstitute.org>: > Thank you, that is an interesting idea. Instead of finding the maximum > population, we are finding the maximum (population, city name) tuple. > > On

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Thank you, that is an interesting idea. Instead of finding the maximum population, we are finding the maximum (population, city name) tuple. On Mon, Dec 19, 2022 at 2:10 PM Bjørn Jørgensen wrote: > We have pandas API on spark >

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Bjørn Jørgensen
We have pandas API on spark which is very good. from pyspark import pandas as ps You can use pdf = df.pandas_api() Where df is your pyspark dataframe. [image: image.png] Does this help you?

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Patrick Tucci
Window functions don't work like traditional GROUP BYs. They allow you to partition data and pull any relevant column, whether it's used in the partition or not. I'm not sure what the syntax is for PySpark, but the standard SQL would be something like this: WITH InputData AS ( SELECT 'USA'

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
If we only wanted to know the biggest population, max function would suffice. The problem is I also want the name of the city with the biggest population. On Mon, Dec 19, 2022 at 11:58 AM Sean Owen wrote: > As Mich says, isn't this just max by population partitioned by country in > a window

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Sean Owen
As Mich says, isn't this just max by population partitioned by country in a window function? On Mon, Dec 19, 2022, 9:45 AM Oliver Ruebenacker wrote: > > Hello, > > Thank you for the response! > > I can think of two ways to get the largest city by country, but both > seem to be

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Oliver Ruebenacker
Hello, Thank you for the response! I can think of two ways to get the largest city by country, but both seem to be inefficient: (1) I could group by country, sort each group by population, add the row number within each group, and then retain only cities with a row number equal to 1.

Re: [PySpark] Getting the best row from each group

2022-12-19 Thread Mich Talebzadeh
In spark you can use windowing function s to achieve this HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-18 Thread Hariharan
Hi scrypso, Sorry for the late reply. Yes, I did mean spark.driver.extraClassPath. I was able to work around this issue by removing the need for an extra class, but I'll investigate along these lines nonetheless. Thanks again for all your help! On Thu, Dec 15, 2022 at 9:56 PM scrypso wrote: >

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Gnana Kumar
I have opened the pom file under cd ~/spark/sql/catalyst$ vi pom.xml and increases from "-Xss4m" to "-Xss4g" but still no luck..It is the same stackoverflow error. On Fri, Dec 16, 2022 at 9:42 PM Sean Owen wrote: > OK that's good. Hm, I seem to recall the build needs more mem in Java 11 >

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Bjørn Jørgensen
I use java 17 to build this. Are there any reasons why you have to build spark yourself? Can't you start from spark 3.3.1 tar file and build a docker image from there? fre. 16. des. 2022 kl. 18:26 skrev Gnana Kumar : > I have opened the pom file under cd ~/spark/sql/catalyst$ vi pom.xml and >

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen
OK that's good. Hm, I seem to recall the build needs more mem in Java 11 and/or some envs. As a quick check, try replacing all "-Xss4m" with "-Xss16m" or something larger, in the project build files. Just search and replace. On Fri, Dec 16, 2022 at 9:53 AM Gnana Kumar wrote: > I have been

Re: Unable to run Spark Job(3.3.2 SNAPSHOT) with Volcano scheduler in Kubernetes

2022-12-16 Thread Sean Owen
You need to increase the stack size during compilation. The included mvn wrapper in build does this. Are you using it? On Fri, Dec 16, 2022 at 9:13 AM Gnana Kumar wrote: > This is my latest error and fails to build SPARK CATALYST > > Exception in thread "main" java.lang.StackOverflowError >

RE: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-15 Thread haibo.w...@morganstanley.com
: Sean Owen Sent: Wednesday, December 14, 2022 10:27 PM To: Wang, Harper (FRPPE) Cc: user@spark.apache.org Subject: Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl The CVE you mention seems to affect jackson-databind, not jackson-mapper-asl. 3.3.1 already uses databind 2.13.x

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-15 Thread Sean Owen
Regards > > Harper > > > > *From:* Sean Owen > *Sent:* Wednesday, December 14, 2022 10:27 PM > *To:* Wang, Harper (FRPPE) > *Cc:* user@spark.apache.org > *Subject:* Re: [EXTERNAL] Re: [Spark vulnerability] replace > jackson-mapper-asl > > > > The CVE y

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-15 Thread scrypso
Hmm, did you mean spark.*driver*.extraClassPath? That is very odd then - if you check the logs directory for the driver (on the cluster) I think there should be a launch container log, where you can see the exact command used to start the JVM (at the very end), and a line starting "export

Re: Query regarding Apache spark version 3.0.1

2022-12-15 Thread Sean Owen
Do you mean, when is branch 3.0.x EOL? It was EOL around the end of 2021. But there were releases 3.0.2 and 3.0.3 beyond 3.0.1, so not clear what you mean by support for 3.0.1. On Thu, Dec 15, 2022 at 9:53 AM Pranav Kumar (EXT) wrote: > Hi Team, > > > > Could you please help us to know when

Re: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen
78a3a34c28fc15e898307e458d501a7e11d6d51?context=explore > > https://pypi.org/project/pyspark/ > > > > Regards > > Harper > > > > > > *From:* Sean Owen > *Sent:* Wednesday, December 14, 2022 9:32 PM > *To:* Wang, Harper (FRPPE) > *Cc:* user@spa

RE: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread haibo.w...@morganstanley.com
-0d4fd8bcb2ad63a35c9ba5be278a3a34c28fc15e898307e458d501a7e11d6d51?context=explore https://pypi.org/project/pyspark/ Regards Harper From: Sean Owen Sent: Wednesday, December 14, 2022 9:32 PM To: Wang, Harper (FRPPE) Cc: user@spark.apache.org Subject: [EXTERNAL] Re: [Spark vulnerability] replace jackson-mapper-asl What Spark

Re: [Spark vulnerability] replace jackson-mapper-asl

2022-12-14 Thread Sean Owen
What Spark version are you referring to? If it's an unsupported version, no, no plans to update it. What image are you referring to? On Wed, Dec 14, 2022 at 7:14 AM haibo.w...@morganstanley.com < haibo.w...@morganstanley.com> wrote: > Hi All > > > > Hope you are doing well. > > > > Writing this

Re: Spark-on-Yarn ClassNotFound Exception

2022-12-13 Thread Hariharan
Hi scrypso, Thanks for the help so far, and I think you're definitely on to something here. I tried loading the class as you suggested with the code below: try { Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName()); logger.info("Loaded

<    6   7   8   9   10   11   12   13   14   15   >