Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread Michael Armbrust
It's likely that since your UDF is a black box to hive's query optimizer
that it must choose a less efficient join algorithm that passes all
possible  matches to your function for comparison.  This will happen any
time your UDF touches attributes from both sides of the join.

In general you can learn more about the chosen execution strategy by
running explain.
On Jul 18, 2014 12:04 PM, "S Malligarjunan" 
wrote:

> Hello Experts,
>
> Appreciate your input highly, please suggest/ give me hint, what would be
> the issue here?
>
>
> Thanks and Regards,
> Malligarjunan S.
>
>
>
>   On Thursday, 17 July 2014, 22:47, S Malligarjunan <
> smalligarju...@yahoo.com> wrote:
>
>
> Hello Experts,
>
> I am facing performance problem when I use the UDF function call. Please
> help me to tune the query.
> Please find the details below
>
> shark> select count(*) from table1;
> OK
> 151096
> Time taken: 7.242 seconds
> shark> select count(*) from table2;
> OK
> 938
> Time taken: 1.273 seconds
>
>
> *Without UDF:*shark> SELECT
>  > count(pvc1.time)
>  >   FROM table2 pvc2 JOIN table1 pvc1
>  >   WHERE pvc1.col1 = pvc2.col2
>  >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') >
> unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
> OK
> 328
> Time taken: 200.487 seconds
>
>
> shark>
>  > SELECT
>  > count(pvc1.time)
>  >   FROM table2 pvc2 JOIN table1 pvc1
>  >   WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
>  >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') >
> unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
> OK
> 331
> Time taken: 292.86 seconds
>
> *With UDF:*
> shark>
>  >  SELECT
>  > count(pvc1.time)
>  >   FROM table2 pvc2 JOIN table1 pvc1
>  >   WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2)
>  >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') >
> unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
>
> OK
> 331
> Time taken: 3718.23 seconds
> The above UDF query takes more time to run.
>
> Where testCompare is an udf function, The function just does the pvc1.col1
> = pvc2.col1 OR pvc1.col1 = pvc2.col2
>
> Please let me know what is the issue here?
>
>
> Thanks and Regards,
> Sankar S.
>
>
>
>


Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread S Malligarjunan
Hello Experts,

Appreciate your input highly, please suggest/ give me hint, what would be the 
issue here?

 
Thanks and Regards,
Malligarjunan S.  



On Thursday, 17 July 2014, 22:47, S Malligarjunan  
wrote:
 


Hello Experts,

I am facing performance problem when I use the UDF function call. Please help 
me to tune the query.
Please find the details below

shark> select count(*) from table1;
OK
151096
Time taken: 7.242 seconds
shark> select count(*) from table2; 
OK
938
Time taken: 1.273 seconds

Without UDF:
shark> SELECT
 >
 count(pvc1.time)
 >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE pvc1.col1 = pvc2.col2
 >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
328
Time taken: 200.487 seconds


shark> 
 > SELECT
 > count(pvc1.time)
     >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
 >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
331
Time taken: 292.86 seconds

With UDF:
shark>  
 >  SELECT
 > count(pvc1.time)
 >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2)
 >  
 AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');

OK
331
Time taken: 3718.23 seconds

The above UDF query takes more time to run. 


Where testCompare is an udf function, The function just does the pvc1.col1 = 
pvc2.col1 OR pvc1.col1 = pvc2.col2

Please let me know what is the issue here?

 
Thanks and Regards,
Sankar S.  

Need help on Spark UDF (Join) Performance tuning .

2014-07-17 Thread S Malligarjunan
Hello Experts,

I am facing performance problem when I use the UDF function call. Please help 
me to tune the query.
Please find the details below

shark> select count(*) from table1;
OK
151096
Time taken: 7.242 seconds
shark> select count(*) from table2; 
OK
938
Time taken: 1.273 seconds

Without UDF:
shark> SELECT
 > count(pvc1.time)
 >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE pvc1.col1 = pvc2.col2
 >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
328
Time taken: 200.487 seconds


shark> 
 > SELECT
 > count(pvc1.time)
     >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
 >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
331
Time taken: 292.86 seconds

With UDF:
shark>  
 >  SELECT
 > count(pvc1.time)
 >   FROM table2 pvc2 JOIN table1 pvc1
 >   WHERE testCompare(pvc1.col1,pvc1.col2, pvc2.col1,pvc2.col2)
 >   AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > 
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');

OK
331
Time taken: 3718.23 seconds

The above UDF query takes more time to run. 


Where testCompare is an udf function, The function just does the pvc1.col1 = 
pvc2.col1 OR pvc1.col1 = pvc2.col2

Please let me know what is the issue here?

 
Thanks and Regards,
Sankar S.