Re: Spark Dataset withColumn issue

2020-11-12 Thread Subash Prabakar
Hi Vikas, He suggested to use the select() function after your withColumn function. val ds1 = ds.select("Col1", "Col3").withColumn("Col2", lit("sample”)).select(“Col1”, “Col2”, “Col3") Thanks, Subash On Thu, Nov 12, 2020 at 9:19 PM Vikas Garg wrote: > I am deriving the col2 using with

Re: Going it alone.

2020-04-16 Thread Subash Prabakar
Looks like he had a very bad appraisal this year.. Fun fact : the coming year would be too :) On Thu, 16 Apr 2020 at 12:07, Qi Kang wrote: > Well man, check your attitude, you’re way over the line > > > On Apr 16, 2020, at 13:26, jane thorpe > wrote: > > F*U*C*K O*F*F > C*U*N*T*S > > >

Apache Arrow support for Apache Spark

2020-02-16 Thread Subash Prabakar
Hi Team, I have two questions regarding Arrow and Spark integration, 1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement? 2. Can we store the final data in

Re: [Spark SQL] failure in query

2019-08-29 Thread Subash Prabakar
What is the no of part files in that big table? And what is the distribution of request ID? Is the variance of the column is less or huge? Because partitionBy clause will move data with same request ID to one executor. If the data is huge it might put load on executor. On Sun, 25 Aug 2019 at

Re: Caching tables in spark

2019-08-28 Thread Subash Prabakar
When you mean by process is it two separate spark jobs? Or two stages within same spark code? Thanks Subash On Wed, 28 Aug 2019 at 19:06, wrote: > Take a look at this article > > > > > https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html > > > > *From:* Tzahi File

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-12 Thread Subash Prabakar
I had the similar issue reading the external parquet table . In my case I had permission issue in one partition so I added filter to exclude that partition but still the spark didn’t prune it. Then I read that in order for spark to be aware of all the partitions it first read the folders and then

Spark Dataframe NTILE function

2019-06-12 Thread Subash Prabakar
Hi, I am running a Spark Dataframe function of NTILE over a huge data - it spills lot of data while sorting and eventually it fails. The data size is roughly 80 Million record with size of 4G (not sure whether its serialized or deserialized) - I am calculating NTILE(10) for all these records

Feature engineering ETL for machine learning

2019-04-20 Thread Subash Prabakar
Hi, I have a series of queries to extract from multiple tables in hive and do a feature engineering on the extracted final data.. I can run queries using spark sql and use mllib to perform the feature transformation I needed. The question is do you guys use any kind of tool to perform this

Re: --jars vs --spark.executor.extraClassPath vs --spark.driver.extraClassPath

2019-04-20 Thread Subash Prabakar
Hey Rajat, The documentation page is self explanatory.. You can refer this for more configs https://spark.apache.org/docs/2.0.0/configuration.html or any version of Spark documentation Thanks. Subash On Sat, 20 Apr 2019 at 16:04, rajat kumar wrote: > Hi, > > Can anyone pls explain ? > > >

Difference between Checkpointing and Persist

2019-04-18 Thread Subash Prabakar
Hi All, I have a doubt about checkpointing and persist/saving. Say we have one RDD - containing huge data, 1. We checkpoint and perform join 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join 3. We save that intermediate RDD and perform join (using same RDD - saving is to just

Spark2: Deciphering saving text file name

2019-04-08 Thread Subash Prabakar
Hi, While saving in Spark2 as text file - I see encoded/hash value attached in the part files along with part number. I am curious to know what is that value is about ? Example: ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path) Produces,