Re: [Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-28 Thread Gokula Krishnan D
--+ > > |hash(40514X)|hash(41751)| > > ++---+ > > | -1898845883| 916273350| > > ++---+ > > > > > > scala> spark.sql("select hash('14589'),hash('40004')").show() > > +---+---+ > > |hash(1

[Spark SQL] why spark sql hash() are returns the same hash value though the keys/expr are not same

2018-09-25 Thread Gokula Krishnan D
Hello All, I am calculating the hash value of few columns and determining whether its an Insert/Delete/Update Record but found a scenario which is little weird since some of the records returns same hash value though the key's are totally different. For the instance, scala> spark.sql("select

Re: How to avoid duplicate column names after join with multiple conditions

2018-07-06 Thread Gokula Krishnan D
Nirav, withColumnRenamed() API might help but it does not different column and renames all the occurrences of the given column. either use select() API and rename as you want. Thanks & Regards, Gokula Krishnan* (Gokul)* On Mon, Jul 2, 2018 at 5:52 PM, Nirav Patel wrote: > Expr is `df1(a)

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Gokula Krishnan D
Do you see any changes or improvments in the *Core-API* in Spark 2.X when compared with Spark 1.6.0. ?. Thanks & Regards, Gokula Krishnan* (Gokul)* On Mon, Sep 25, 2017 at 1:32 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Thanks for the reply. Forgot to mention that,

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-25 Thread Gokula Krishnan D
2. Try to compare the CPU time instead of the wall-clock time 3. Check the stages that got slower and compare the DAGs 4. Test with dynamic allocation disabled On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Hello All, > > Currently our Batch ETL Jo

Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
actors that influence that. > > 2. Try to compare the CPU time instead of the wall-clock time > > 3. Check the stages that got slower and compare the DAGs > > 4. Test with dynamic allocation disabled > > On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D <email2...@gmail.com

What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
Hello All, Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade into Spark 2.1.0. With minor code changes (like configuration and Spark Session.sc) able to execute the existing JOB into Spark 2.1.0. But noticed that JOB completion timings are much better in Spark 1.6.0 but no

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
nt-core/src/main/java/org/apache/hadoop/mapred/InputFormat.java#L85>, > and the partition desired is at most a hint, so the final result could be a > bit different. > > On 25 July 2017 at 19:54, Gokula Krishnan D <email2...@gmail.com> wrote: > >> Excuse for the too

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Excuse for the too many mails on this post. found a similar issue https://stackoverflow.com/questions/24671755/how-to-partition-a-rdd Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:21 AM, Gokula Krishnan D <email2...@gmail.com> wrote: > In addition to th

Re: [Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
In addition to that, tried to read the same file with 3000 partitions but it used 3070 partitions. And took more time than previous please refer the attachment. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Jul 25, 2017 at 8:15 AM, Gokula Krishnan D <email2...@gmail.com> wrote

[Spark-Core] sc.textFile() explicit minPartitions did not work

2017-07-25 Thread Gokula Krishnan D
Hello All, I have a HDFS file with approx. *1.5 Billion records* with 500 Part files (258.2GB Size) and when I tried to execute the following I could see that it used 2290 tasks but it supposed to be 500 as like HDFS File, isn't it? val inputFile = val inputRdd = sc.textFile(inputFile)

Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-21 Thread Gokula Krishnan D
> >> https://spark.apache.org/docs/latest/job-scheduling.html#sch >> eduling-across-applications >> https://spark.apache.org/docs/latest/job-scheduling.html#sch >> eduling-within-an-application >> >> On Thu, Jul 20, 2017 at 2:02 PM, Gokula Krishnan D <email

Re: Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Gokula Krishnan D
scheduled using the prior > Task's resources -- the fair scheduler is not preemptive of running Tasks. > > On Thu, Jul 20, 2017 at 1:45 PM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello All, >> >> We are having cluster with 50 Executors each with

Spark on Cloudera Configuration (Scheduler Mode = FAIR)

2017-07-20 Thread Gokula Krishnan D
Hello All, We are having cluster with 50 Executors each with 4 Cores so can avail max. 200 Executors. I am submitting a Spark application(JOB A) with scheduler.mode as FAIR and dynamicallocation=true and it got all the available executors. In the meantime, submitting another Spark Application

Spark sc.textFile() files with more partitions Vs files with less partitions

2017-07-20 Thread Gokula Krishnan D
Hello All, our Spark Applications are designed to process the HDFS Files (Hive External Tables). Recently modified the Hive file size by setting the following parameters to ensure that files are having with the average size of 512MB. set hive.merge.mapfiles=true set hive.merge.mapredfiles=true

Re: unit testing in spark

2017-04-10 Thread Gokula Krishnan D
Hello Shiv, Unit Testing is really helping when you follow TDD approach. And it's a safe way to code a program locally and also you can make use those test cases during the build process by using any of the continuous integration tools ( Bamboo, Jenkins). If so you can ensure that artifacts are

Re: Task Deserialization Error

2016-09-21 Thread Gokula Krishnan D
Hello Sumit - I could see that SparkConf() specification is not being mentioned in your program. But rest looks good. Output: By the way, I have used the README.md template https://gist.github.com/jxson/1784669 Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Sep 20, 2016 at 2:15 AM,

Things to do learn Cassandra in Apache Spark Environment

2016-08-23 Thread Gokula Krishnan D
Hello All - Hope, you are doing good. I have a general question. I am working on Hadoop using Apache Spark. At this moment, we are not using Cassandra but I would like to know what's the scope of learning and using it in the Hadoop environment. It would be great if you could provide the use

How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
Hello All - I'm just trying to understand aggregate() and in the meantime got an question. *Is there any way to view the RDD databased on the partition ?.* For the instance, the following RDD has 2 partitions val multi2s = List(2,4,6,8,10,12,14,16,18,20) val multi2s_RDD =

Re: How to view the RDD data based on Partition

2016-01-12 Thread Gokula Krishnan D
nt]) : Iterator[String] = { > iter.toList.map(x => index + "," + x).iterator > } > x.mapPartitionsWithIndex(myfunc).collect() > res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9) > > On Tue, Jan 12, 2016 at 2:06 PM, Gokula Krishnan D <email2...@gm

Re: Problem with WINDOW functions?

2015-12-30 Thread Gokula Krishnan D
Hello Vadim - Alternatively, you can achieve by using the *window functions* which is available from 1.4.0 *code_value.txt (Input)* = 1000,200,Descr-200,01 1000,200,Descr-200-new,02 1000,201,Descr-201,01 1000,202,Descr-202-new,03 1000,202,Descr-202,01

Re: difference between ++ and Union of a RDD

2015-12-29 Thread Gokula Krishnan D
Ted - Thanks for the updates. Then its the same case with sc.parallelize() or sc.makeRDD() right. Thanks & Regards, Gokula Krishnan* (Gokul)* On Tue, Dec 29, 2015 at 1:43 PM, Ted Yu wrote: > From RDD.scala : > > def ++(other: RDD[T]): RDD[T] = withScope { >

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Gokula Krishnan D
You can try this .. But slightly modified the input structure since first two columns were not in Json format. [image: Inline image 1] Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Dec 23, 2015 at 9:46 AM, Eran Witkon wrote: > Did you get a solution for this? > >

Database does not exist: (Spark-SQL ===> Hive)

2015-12-14 Thread Gokula Krishnan D
Hello All - I tried to execute a Spark-Scala Program in order to create a table in HIVE and faced couple of error so I just tried to execute the "show tables" and "show databases" And I have already created a database named "test_db".But I have encountered the error "Database does not exist"

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
Hello Prashant - Can you please try like this : For the instance, input file name is "student_detail.txt" and ID,Name,Sex,Age === 101,Alfred,Male,30 102,Benjamin,Male,31 103,Charlie,Female,30 104,Julie,Female,30 105,Maven,Male,30 106,Dexter,Male,30 107,Lundy,Male,32

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
> > On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello Prashant - >> >> Can you please try like this : >> >> For the instance, input file name is "student_detail.txt" and >> >> ID,Name,S

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Gokula Krishnan D
2 AM, Gokula Krishnan D <email2...@gmail.com> wrote: > Ok, then you can slightly change like > > [image: Inline image 1] > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > > > On Wed, Dec 9, 2015 at 11:09 AM, Prashant Bhardwaj < > prashant2006s...@gmail

How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
Hello All - In spark-shell when we press tab after . ; we could see the possible list of transformations and actions. But unable to see all the list. is there any other way to get the rest of the list. I'm mainly looking for sortByKey() val sales_RDD = sc.textFile("Data/Scala/phone_sales.txt")

Re: How to get the list of available Transformations and actions for a RDD in Spark-Shell

2015-12-04 Thread Gokula Krishnan D
value pair to > work. I think in scala their are transformation such as .toPairRDD(). > > On Sat, Dec 5, 2015 at 12:01 AM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello All - >> >> In spark-shell when we press tab after . ; we could see the >