Re: Code optimization

2016-04-19 Thread Alonso Isidoro Roman
Hi Angel,

how about to use this :

k.filter(k("WT_ID")

as a val variable? i think you can avoid that and do not forget to use
System.nanoTime to know the profit...

Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-04-19 9:46 GMT+02:00 Angel Angel :

> Hello,
>
> I am writing the one spark application, it runs well but takes long
> execution time can anyone help me to optimize my query to increase the
> processing speed.
>
>
> I am writing one application in which i have to construct the histogram
> and compare the histograms in order to find the final candidate.
>
>
> My code in which i read the text file and matches the first field and
> subtract the second fild from the matched candidates and update the table.
>
> Here is my code, Please help me to optimize it.
>
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
>
> import sqlContext.implicits._
>
>
> val Array_Ele =
> sc.textFile("/root/Desktop/database_200/patch_time_All_20_modified_1.txt").flatMap(line=>line.split("
> ")).take(900)
>
>
> val df1=
> sqlContext.read.parquet("hdfs://hadoopm0:8020/tmp/input1/database_modified_No_name_400.parquet")
>
>
> var k = df1.filter(df1("Address").equalTo(Array_Ele(0) ))
>
> var a= 0
>
>
> for( a <-2 until 900 by 2){
>
> k=k.unionAll(
> df1.filter(df1("Address").equalTo(Array_Ele(a))).select(df1("Address"),df1("Couple_time")-Array_Ele(a+1),df1("WT_ID")))}
>
>
> k.cache()
>
>
> val WT_ID_Sort  = k.groupBy("WT_ID").count().sort(desc("count"))
>
>
> val temp = WT_ID_Sort.select("WT_ID").rdd.map(r=>r(0)).take(10)
>
>
> val Table0=
> k.filter(k("WT_ID").equalTo(temp(0))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table1=
> k.filter(k("WT_ID").equalTo(temp(1))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table2=
> k.filter(k("WT_ID").equalTo(temp(2))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table3=
> k.filter(k("WT_ID").equalTo(temp(3))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table4=
> k.filter(k("WT_ID").equalTo(temp(4))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table5=
> k.filter(k("WT_ID").equalTo(temp(5))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table6=
> k.filter(k("WT_ID").equalTo(temp(6))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table7=
> k.filter(k("WT_ID").equalTo(temp(7))).groupBy("Couple_time").count().select(max($"count")).show()
>
> val Table8=
> k.filter(k("WT_ID").equalTo(temp(8))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
>
> val Table10=
> k.filter(k("WT_ID").equalTo(temp(10))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
> val Table11=
> k.filter(k("WT_ID").equalTo(temp(11))).groupBy("Couple_time").count().select(max($"count")).show()
>
>
> and last one how can i compare the all this tables to find the maximum
> value.
>
>
>
>
> Thanks,
>
>
>


Code optimization

2016-04-19 Thread Angel Angel
Hello,

I am writing the one spark application, it runs well but takes long
execution time can anyone help me to optimize my query to increase the
processing speed.


I am writing one application in which i have to construct the histogram and
compare the histograms in order to find the final candidate.


My code in which i read the text file and matches the first field and
subtract the second fild from the matched candidates and update the table.

Here is my code, Please help me to optimize it.


val sqlContext = new org.apache.spark.sql.SQLContext(sc)


import sqlContext.implicits._


val Array_Ele =
sc.textFile("/root/Desktop/database_200/patch_time_All_20_modified_1.txt").flatMap(line=>line.split("
")).take(900)


val df1=
sqlContext.read.parquet("hdfs://hadoopm0:8020/tmp/input1/database_modified_No_name_400.parquet")


var k = df1.filter(df1("Address").equalTo(Array_Ele(0) ))

var a= 0


for( a <-2 until 900 by 2){

k=k.unionAll(
df1.filter(df1("Address").equalTo(Array_Ele(a))).select(df1("Address"),df1("Couple_time")-Array_Ele(a+1),df1("WT_ID")))}


k.cache()


val WT_ID_Sort  = k.groupBy("WT_ID").count().sort(desc("count"))


val temp = WT_ID_Sort.select("WT_ID").rdd.map(r=>r(0)).take(10)


val Table0=
k.filter(k("WT_ID").equalTo(temp(0))).groupBy("Couple_time").count().select(max($"count")).show()

val Table1=
k.filter(k("WT_ID").equalTo(temp(1))).groupBy("Couple_time").count().select(max($"count")).show()

val Table2=
k.filter(k("WT_ID").equalTo(temp(2))).groupBy("Couple_time").count().select(max($"count")).show()

val Table3=
k.filter(k("WT_ID").equalTo(temp(3))).groupBy("Couple_time").count().select(max($"count")).show()

val Table4=
k.filter(k("WT_ID").equalTo(temp(4))).groupBy("Couple_time").count().select(max($"count")).show()

val Table5=
k.filter(k("WT_ID").equalTo(temp(5))).groupBy("Couple_time").count().select(max($"count")).show()

val Table6=
k.filter(k("WT_ID").equalTo(temp(6))).groupBy("Couple_time").count().select(max($"count")).show()

val Table7=
k.filter(k("WT_ID").equalTo(temp(7))).groupBy("Couple_time").count().select(max($"count")).show()

val Table8=
k.filter(k("WT_ID").equalTo(temp(8))).groupBy("Couple_time").count().select(max($"count")).show()



val Table10=
k.filter(k("WT_ID").equalTo(temp(10))).groupBy("Couple_time").count().select(max($"count")).show()


val Table11=
k.filter(k("WT_ID").equalTo(temp(11))).groupBy("Couple_time").count().select(max($"count")).show()


and last one how can i compare the all this tables to find the maximum
value.




Thanks,