Re: Need help on Spark UDF (Join) Performance tuning .
It's likely that since your UDF is a black box to hive's query optimizer that it must choose a less efficient join algorithm that passes all possible matches to your function for comparison. This will happen any time your UDF touches attributes from both sides of the join. In general you can learn more about the chosen execution strategy by running explain. On Jul 18, 2014 12:04 PM, "S Malligarjunan" wrote: > Hello Experts, > > Appreciate your input highly, please suggest/ give me hint, what would be > the issue here? > > > Thanks and Regards, > Malligarjunan S. > > > > On Thursday, 17 July 2014, 22:47, S Malligarjunan < > smalligarju...@yahoo.com> wrote: > > > Hello Experts, > > I am facing performance problem when I use the UDF function call. Please > help me to tune the query. > Please find the details below > > shark> select count(*) from table1; > OK > 151096 > Time taken: 7.242 seconds > shark> select count(*) from table2; > OK > 938 > Time taken: 1.273 seconds > > > *Without UDF:*shark> SELECT > > count(pvc1.time) > > FROM table2 pvc2 JOIN table1 pvc1 > > WHERE pvc1.col1 = pvc2.col2 > > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); > OK > 328 > Time taken: 200.487 seconds > > > shark> > > SELECT > > count(pvc1.time) > > FROM table2 pvc2 JOIN table1 pvc1 > > WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2) > > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); > OK > 331 > Time taken: 292.86 seconds > > *With UDF:* > shark> > > SELECT > > count(pvc1.time) > > FROM table2 pvc2 JOIN table1 pvc1 > > WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2) > > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); > > OK > 331 > Time taken: 3718.23 seconds > The above UDF query takes more time to run. > > Where testCompare is an udf function, The function just does the pvc1.col1 > = pvc2.col1 OR pvc1.col1 = pvc2.col2 > > Please let me know what is the issue here? > > > Thanks and Regards, > Sankar S. > > > >
Re: Need help on Spark UDF (Join) Performance tuning .
Hello Experts, Appreciate your input highly, please suggest/ give me hint, what would be the issue here? Thanks and Regards, Malligarjunan S. On Thursday, 17 July 2014, 22:47, S Malligarjunan wrote: Hello Experts, I am facing performance problem when I use the UDF function call. Please help me to tune the query. Please find the details below shark> select count(*) from table1; OK 151096 Time taken: 7.242 seconds shark> select count(*) from table2; OK 938 Time taken: 1.273 seconds Without UDF: shark> SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE pvc1.col1 = pvc2.col2 > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 328 Time taken: 200.487 seconds shark> > SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2) > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 331 Time taken: 292.86 seconds With UDF: shark> > SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2) > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 331 Time taken: 3718.23 seconds The above UDF query takes more time to run. Where testCompare is an udf function, The function just does the pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2 Please let me know what is the issue here? Thanks and Regards, Sankar S.
Need help on Spark UDF (Join) Performance tuning .
Hello Experts, I am facing performance problem when I use the UDF function call. Please help me to tune the query. Please find the details below shark> select count(*) from table1; OK 151096 Time taken: 7.242 seconds shark> select count(*) from table2; OK 938 Time taken: 1.273 seconds Without UDF: shark> SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE pvc1.col1 = pvc2.col2 > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 328 Time taken: 200.487 seconds shark> > SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2) > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 331 Time taken: 292.86 seconds With UDF: shark> > SELECT > count(pvc1.time) > FROM table2 pvc2 JOIN table1 pvc1 > WHERE testCompare(pvc1.col1,pvc1.col2, pvc2.col1,pvc2.col2) > AND unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') > unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS'); OK 331 Time taken: 3718.23 seconds The above UDF query takes more time to run. Where testCompare is an udf function, The function just does the pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2 Please let me know what is the issue here? Thanks and Regards, Sankar S.