Right now, I am doing it like below,
import scala.io.Source
val animalsFile = "/home/ajay/dataset/animal_types.txt"
val animalTypes = Source.fromFile(animalsFile).getLines.toArray
for ( anmtyp <- animalTypes ) {
val distinctAnmTypCount = sqlContext.sql("select
count(distinct("+anmtyp+")) from TEST1 ")
println("Calculating Metrics for Animal Type: "+anmtyp)
if( distinctAnmTypCount.head().getAs[Long](0) <= 10 ){
println("Animal Type: "+anmtyp+" has <= 10 distinct values")
} else {
println("Animal Type: "+anmtyp+" has > 10 distinct values")
}
}
But the problem is it is running sequentially.
Any inputs are appreciated. Thank you.
Regards,
Ajay
On Tue, Oct 4, 2016 at 7:44 PM, Ajay Chander <[email protected]> wrote:
> Hi Everyone,
>
> I have a use-case where I have two Dataframes like below,
>
> 1) First Dataframe(DF1) contains,
>
> * ANIMALS *
> Mammals
> Birds
> Fish
> Reptiles
> Amphibians
>
> 2) Second Dataframe(DF2) contains,
>
> * ID, Mammals, Birds, Fish, Reptiles, Amphibians *
> 1, Dogs, Eagle, Goldfish, NULL, Frog
> 2, Cats, Peacock, Guppy, Turtle, Salamander
> 3, Dolphins, Eagle, Zander, NULL, Frog
> 4, Whales, Parrot, Guppy, Snake, Frog
> 5, Horses, Owl, Guppy, Snake, Frog
> 6, Dolphins, Kingfisher, Zander, Turtle, Frog
> 7, Dogs, Sparrow, Goldfish, NULL, Salamander
>
> Now I want to take each row from DF1 and find out its distinct count in
> DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals))
> from DF2 i.e. 5
>
> DF1 has 70 distinct rows/Animal types
> DF2 has some million rows
>
> Whats the best way to achieve this efficiently using parallelism ?
>
> Any inputs are helpful. Thank you.
>
> Regards,
> Ajay
>
>