Re: Purpose of type in pandas_udf

2020-11-12 Thread Sean Owen
It's the return value On Thu, Nov 12, 2020 at 5:20 PM Daniel Stojanov wrote: > Hi, > > > Note "double" in the function decorator. Is this specifying the type of > the data that goes into pandas_mean, or the type returned by that function? > > > Regards, > > > > > @pandas_udf("double",

Purpose of type in pandas_udf

2020-11-12 Thread Daniel Stojanov
Hi, Note "double" in the function decorator. Is this specifying the type of the data that goes into pandas_mean, or the type returned by that function? Regards, @pandas_udf("double", PandasUDFType.GROUPED_AGG) def pandas_mean(v):     return v.sum()

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Dominique De Vito
Thanks Mich To be sure, are you really saying that, using the option "spark.yarn.archive", YOU have been able to OVERRIDE installed Spark JAR with the JAR given with the option "spark.yarn.archive" ? No more than "spark.yarn.archive" ? Thanks Dominique Le jeu. 12 nov. 2020 à 18:01, Mich

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Dominique De Vito
Thanks Russell > Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved. Your point is interesting, however I see

Re: Spark Dataset withColumn issue

2020-11-12 Thread Lalwani, Jayesh
Note that Spark never guarantees ordering of columns. There’s nothing in Spark documentation that says that the columns will be ordered a certain way. The proposed solution relies on an implementation detail that might change in future version of Spark. Ideally, you shouldn’t rely on Dataframe

Re: Spark Dataset withColumn issue

2020-11-12 Thread Vikas Garg
Ohh Thanks a lot On Thu, Nov 12, 2020, 21:23 Subash Prabakar wrote: > Hi Vikas, > > He suggested to use the select() function after your withColumn function. > > val ds1 = ds.select("Col1", "Col3").withColumn("Col2", > lit("sample”)).select(“Col1”, “Col2”, “Col3") > > > Thanks, > Subash >

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Mich Talebzadeh
As I understand Spark expects the jar files to be available on all nodes or if applicable on HDFS directory Putting Spark Jar files on HDFS In Yarn mode, *it is important that Spark jar files are available throughout the Spark cluster*. I have spent a fair bit of time on this and I recommend

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Russell Spitzer
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. Since the driver is responsible for

Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Dominique De Vito
Hi, I am using Spark 2.1 (BTW) on YARN. I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR. I am trying to do so through spark-submit. One helpful answer

Re: Spark Dataset withColumn issue

2020-11-12 Thread Subash Prabakar
Hi Vikas, He suggested to use the select() function after your withColumn function. val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample”)).select(“Col1”, “Col2”, “Col3") Thanks, Subash On Thu, Nov 12, 2020 at 9:19 PM Vikas Garg wrote: > I am deriving the col2 using with

Re: Spark Dataset withColumn issue

2020-11-12 Thread Sean Owen
You can still simply select the columns by name in order, after .withColumn() On Thu, Nov 12, 2020 at 9:49 AM Vikas Garg wrote: > I am deriving the col2 using with colunn which is why I cant use it like > you told me > > On Thu, Nov 12, 2020, 20:11 German Schiavon > wrote: > >>

Re: Spark Dataset withColumn issue

2020-11-12 Thread Vikas Garg
I am deriving the col2 using with colunn which is why I cant use it like you told me On Thu, Nov 12, 2020, 20:11 German Schiavon wrote: > ds.select("Col1", "Col2", "Col3") > > On Thu, 12 Nov 2020 at 15:28, Vikas Garg wrote: > >> In Spark Datase, if we add additional column using >> withColumn

Re: Spark Dataset withColumn issue

2020-11-12 Thread German Schiavon
ds.select("Col1", "Col2", "Col3") On Thu, 12 Nov 2020 at 15:28, Vikas Garg wrote: > In Spark Datase, if we add additional column using > withColumn > then the column is added in the last. > > e.g. > val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample")) > > the the order of

Spark Dataset withColumn issue

2020-11-12 Thread Vikas Garg
In Spark Datase, if we add additional column using withColumn then the column is added in the last. e.g. val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample")) the the order of columns is >> Col1 | Col3 | Col2 I want the order to be >> Col1 | Col2 | Col3 How can I

Spark[SqL] performance tuning

2020-11-12 Thread Lakshmi Nivedita
Hi all, I have pyspark sql script with loading of one table 80mb and one is 2 mb and rest 3 are small tables performing lots of joins in the script to fetch the data. My system configuration is 4 nodes,300 GB,64 cores To write a data frame into table 24Mb size records . System is taking 4