RE: Hive using Spark engine vs native spark with hive integration.
Thank you so much Mich! Although a bit older, this is the most detailed comparison I’ve read on the subject. Thanks again. Regards, -Manu From: Mich Talebzadeh Sent: Tuesday, October 06, 2020 12:37 PM To: user Subject: Re: Hive using Spark engine vs native spark with hive integration. EXTERNAL Hi Manu, In the past (July 2016), I made a presentation organised by then Hortonworks in London titled "Query Engines for Hive: MR, Spark, Tez with LLAP – Considerations! " The PDF presentation is here<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftalebzadehmich.files.wordpress.com%2F2016%2F08%2Fhive_on_spark_only.pdf=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=8sYj7ps6GdC1QWAqaQdbIdd9c5PqCZ0IkRwvalLpYe8%3D=0>. With a caveat that was more than 4 years ago! However, as of today I would recommend writing the code in Spark with Scala and running against Spark. You can try it using spark-shell to start with. If you are reading from Hive table or any other source like CSV etc, there are plenty of examples in Spark web https://spark.apache.org/examples.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fexamples.html=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=lQWc7VLCia7VyhLohawAaStXnYX1ShbN%2FmU5kAjfaBQ%3D=0> Also I suggest that you use Scala as Spark itself is written in Scala (though Python is more popular with Data Science guys). HTH [https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ] LinkedIn https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3DAAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962415176=QB0525D6xXin7RdcFYdkOAWKARki6uFBq2GQcdNJ0dc%3D=0> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Tue, 6 Oct 2020 at 16:47, Manu Jacob mailto:manu.ja...@sas.com>> wrote: Hi All, Not sure if I need to ask this question on hive community or spark community. We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options. 1. Use the current code based on HQL, with engine set as spark. 2. Write pure spark code in scala/python using SparkQL and hive integration. The first approach helps us to transition to Spark quickly but not sure if this is the best approach in terms of performance. Could not find any reasonable comparisons of this two approaches. It looks like writing pure Spark code, gives us more control to add logic and also control some of the performance features, for example things like caching/evicting etc. Any advise on this is much appreciated. Thanks, -Manu
Hive using Spark engine vs native spark with hive integration.
Hi All, Not sure if I need to ask this question on hive community or spark community. We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options. 1. Use the current code based on HQL, with engine set as spark. 2. Write pure spark code in scala/python using SparkQL and hive integration. The first approach helps us to transition to Spark quickly but not sure if this is the best approach in terms of performance. Could not find any reasonable comparisons of this two approaches. It looks like writing pure Spark code, gives us more control to add logic and also control some of the performance features, for example things like caching/evicting etc. Any advise on this is much appreciated. Thanks, -Manu
Help in hive query
Hi All, The result for the below query is 194965.00.0 , but 194965 is the result of inner query from count(q1.response). It looks like the outer query [select avg(q2.auth_count), stddev_pop(q2.auth_count)]didn't work at all. //Query select avg(q2.auth_count), stddev_pop(q2.auth_count) from ( select q1.TEXT_CCYY ,count(q1.response) as auth_count from( select * from Sale1 where TEXT_DD=7 AND TEXT_HH=15 AND response=00)q1 group by q1.TEXT_CCYY,q1.response)q2 group by q2.auth_count; Please help me is there anything i have to change in query. *Thanks Regards,* *Manu*
Re: Help in hive query
Thanks Jan. It worked! Regards, Manu On Wed, Oct 10, 2012 at 12:00 PM, Jan Dolinár dolik@gmail.com wrote: Hi Manu, I believe the last group by q2.auth_count is wrong, because it causes computing average only across lines with same value of q2.auth_count, which is of course equal to its value. Best regards, J. Dolinar On Wed, Oct 10, 2012 at 8:19 AM, Manu A hadoophi...@gmail.com wrote: Hi All, The result for the below query is 194965.00.0 , but 194965 is the result of inner query from count(q1.response). It looks like the outer query [select avg(q2.auth_count), stddev_pop(q2.auth_count)]didn't work at all. //Query select avg(q2.auth_count), stddev_pop(q2.auth_count) from ( select q1.TEXT_CCYY ,count(q1.response) as auth_count from( select * from Sale1 where TEXT_DD=7 AND TEXT_HH=15 AND response=00)q1 group by q1.TEXT_CCYY,q1.response)q2 group by q2.auth_count; Please help me is there anything i have to change in query. Thanks Regards, Manu
Re: Custom MR scripts using java in Hive
Hi Manish, Thanks,I did like the same.but how to invoke the custom java map/reduce functions ( com.hive.test.TestMapper ) since there is no script as it is a jar file.The process looks bit different from UDF( used create temporary function). On Wed, Sep 26, 2012 at 12:25 PM, Manish.Bhoge manish.bh...@target.comwrote: Sorry for late reply. ** ** For anything which you want to run as MAP and REDUCE you have to extend map reduce classes for your functionality irrespective of language (Java, python or any other). Once you have extended class move the jar to the Hadoop cluster. Bertrand has also mention about reflection. That is something new for me. You can give a try to reflection. ** ** Thank You, Manish ** ** *From:* Tamil A [mailto:4tamil...@gmail.com] *Sent:* Tuesday, September 25, 2012 6:48 PM *To:* user@hive.apache.org *Subject:* Re: Custom MR scripts using java in Hive ** ** Hi Manish, Thanks for your help.I did the same using UDF.Now trying with Transform,Map and Reduce clauses.so is it mean by using java we have to goahead through UDF and for other languages using MapReduce Scripts i.e., the Transform,Map and Reduce clauses. Please correct me if am wrong. Thanks Regards, Manu ** ** On Tue, Sep 25, 2012 at 5:19 PM, Manish.Bhoge manish.bh...@target.com wrote: Manu, If you have written UDF in Java for Hive then you need to copy your JAR on your Hadoop cluster in /usr/lib/hive/lib/ folder to hive to use this JAR. Thank You, Manish *From:* Manu A [mailto:hadoophi...@gmail.com] *Sent:* Tuesday, September 25, 2012 3:44 PM *To:* user@hive.apache.org *Subject:* Custom MR scripts using java in Hive Hi All, I am learning hive. Please let me know if any one tried with custom Map Reduce scripts using java in hive or refer me some links and blogs with an example. when i tried i got the below error : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2012-09-25 02:47:23,720 Stage-1 map = 0%, reduce = 0% 2012-09-25 02:47:56,943 Stage-1 map = 100%, reduce = 100% Ended Job = job_20120931_0001 with errors Error during job, obtaining debugging information... Examining task ID: task_20120931_0001_m_02 (and more) from job job_20120931_0001 Exception in thread Thread-51 java.lang.RuntimeException: Error while reading from task log url at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211) at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: // removed as confidential at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at java.net.URL.openStream(URL.java:1010) at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) ... 3 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec Thanks for ur help in advance :) Thanks Regards, Manu -- *Thanks Regards,* *Tamil* ** **
Re: Custom MR scripts using java in Hive
Thanks Manish. ll try with the same. Thanks Regards, Manu On Tue, Sep 25, 2012 at 5:19 PM, Manish.Bhoge manish.bh...@target.comwrote: Manu, ** ** If you have written UDF in Java for Hive then you need to copy your JAR on your Hadoop cluster in /usr/lib/hive/lib/ folder to hive to use this JAR. ** ** Thank You, Manish ** ** *From:* Manu A [mailto:hadoophi...@gmail.com] *Sent:* Tuesday, September 25, 2012 3:44 PM *To:* user@hive.apache.org *Subject:* Custom MR scripts using java in Hive ** ** Hi All, I am learning hive. Please let me know if any one tried with custom Map Reduce scripts using java in hive or refer me some links and blogs with an example. when i tried i got the below error : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2012-09-25 02:47:23,720 Stage-1 map = 0%, reduce = 0% 2012-09-25 02:47:56,943 Stage-1 map = 100%, reduce = 100% Ended Job = job_20120931_0001 with errors Error during job, obtaining debugging information... Examining task ID: task_20120931_0001_m_02 (and more) from job job_20120931_0001 Exception in thread Thread-51 java.lang.RuntimeException: Error while reading from task log url at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130) at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211) at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL: // removed as confidential at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) at java.net.URL.openStream(URL.java:1010) at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120) ... 3 more FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask MapReduce Jobs Launched: Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec Thanks for ur help in advance :) Thanks Regards, Manu ** ** ** **