RE: Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Thank you so much Mich! Although a bit older, this is the most detailed 
comparison I’ve read on the subject. Thanks again.

Regards,
-Manu

From: Mich Talebzadeh 
Sent: Tuesday, October 06, 2020 12:37 PM
To: user 
Subject: Re: Hive using Spark engine vs native spark with hive integration.


EXTERNAL
Hi Manu,

In the past (July 2016), I made a presentation organised by then Hortonworks in 
London titled "Query Engines for Hive: MR, Spark, Tez with LLAP – 
Considerations! "

The PDF presentation is 
here<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftalebzadehmich.files.wordpress.com%2F2016%2F08%2Fhive_on_spark_only.pdf=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=8sYj7ps6GdC1QWAqaQdbIdd9c5PqCZ0IkRwvalLpYe8%3D=0>.
 With a caveat that was more than 4 years ago!

However, as of today I would recommend writing the code in Spark with Scala and 
running against Spark. You can try it using spark-shell to start with.

If you are reading from Hive table or any other source like CSV etc, there are 
plenty of examples in Spark web 
https://spark.apache.org/examples.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fexamples.html=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962405176=lQWc7VLCia7VyhLohawAaStXnYX1ShbN%2FmU5kAjfaBQ%3D=0>

Also I suggest that you use Scala as Spark itself is written in Scala (though 
Python is more popular with Data Science guys).

HTH

[https://docs.google.com/uc?export=download=1qt8nKd2bxgs6clwYFqGy-k84L3N79hW6=0B1BiUVX33unjallLZWQwN1BDbGRMNTI5WUw3TlloMmJZRThjPQ]



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3DAAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw=02%7C01%7CManu.Jacob%40sas.com%7C3dc3f79a7cec4da02f5f08d86a161db8%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637375990962415176=QB0525D6xXin7RdcFYdkOAWKARki6uFBq2GQcdNJ0dc%3D=0>







Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 6 Oct 2020 at 16:47, Manu Jacob 
mailto:manu.ja...@sas.com>> wrote:
Hi All,

Not sure if I need to ask this question on hive community or spark community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.

  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.


Any advise on this is much appreciated.


Thanks,
-Manu


Hive using Spark engine vs native spark with hive integration.

2020-10-06 Thread Manu Jacob
Hi All,

Not sure if I need to ask this question on hive community or spark community.

We have a set of hive scripts that runs on EMR (Tez engine). We would like to 
experiment by moving some of it onto Spark. We are planning to experiment with 
two options.

  1.  Use the current code based on HQL, with engine set as spark.
  2.  Write pure spark code in scala/python using SparkQL and hive integration.

The first approach helps us to transition to Spark quickly but not sure if this 
is the best approach in terms of performance.  Could not find any reasonable 
comparisons of this two approaches.  It looks like writing pure Spark code, 
gives us more control to add logic and also control some of the performance 
features, for example things like caching/evicting etc.


Any advise on this is much appreciated.


Thanks,
-Manu


Help in hive query

2012-10-10 Thread Manu A
 Hi All,
The result for the below query is 194965.00.0 , but 194965 is the
result of inner query from count(q1.response). It looks like the outer
query [select avg(q2.auth_count), stddev_pop(q2.auth_count)]didn't work at
all.


 //Query
select avg(q2.auth_count), stddev_pop(q2.auth_count)
from (
select q1.TEXT_CCYY ,count(q1.response) as auth_count
from(
   select * from Sale1 where TEXT_DD=7 AND TEXT_HH=15  AND
response=00)q1
group by q1.TEXT_CCYY,q1.response)q2
group by q2.auth_count;


Please help me is there anything i have to change in query.


*Thanks  Regards,*
*Manu*


Re: Help in hive query

2012-10-10 Thread Manu A
Thanks Jan. It worked!


Regards,
Manu

On Wed, Oct 10, 2012 at 12:00 PM, Jan Dolinár dolik@gmail.com wrote:

 Hi Manu,

 I believe the last group by q2.auth_count is wrong, because it
 causes computing average only across lines with same value of
 q2.auth_count, which is of course equal to its value.

 Best regards,
 J. Dolinar

 On Wed, Oct 10, 2012 at 8:19 AM, Manu A hadoophi...@gmail.com wrote:
  Hi All,
  The result for the below query is 194965.00.0 , but 194965 is the
  result of inner query from count(q1.response). It looks like the outer
 query
  [select avg(q2.auth_count), stddev_pop(q2.auth_count)]didn't work at all.
 
 
   //Query
  select avg(q2.auth_count), stddev_pop(q2.auth_count)
  from (
  select q1.TEXT_CCYY ,count(q1.response) as auth_count
  from(
 select * from Sale1 where TEXT_DD=7 AND TEXT_HH=15  AND
  response=00)q1
  group by q1.TEXT_CCYY,q1.response)q2
  group by q2.auth_count;
 
 
  Please help me is there anything i have to change in query.
 
 
  Thanks  Regards,
  Manu
 
 



Re: Custom MR scripts using java in Hive

2012-09-26 Thread Manu A
Hi Manish,
Thanks,I did like the same.but how to invoke the custom java map/reduce
functions ( com.hive.test.TestMapper ) since there is no script as it is a
jar file.The process looks bit different from UDF( used create temporary
function).


On Wed, Sep 26, 2012 at 12:25 PM, Manish.Bhoge manish.bh...@target.comwrote:

  Sorry for late reply. 

 ** **

 For anything which you want to run as MAP and REDUCE you have to extend
 map reduce classes for your functionality irrespective of language (Java,
 python or any other).  Once you have extended class move the jar to the
 Hadoop cluster. 

 Bertrand has also mention about reflection. That is something new for me.
 You can give a try to reflection.

 ** **

 Thank You,

 Manish

 ** **

 *From:* Tamil A [mailto:4tamil...@gmail.com]
 *Sent:* Tuesday, September 25, 2012 6:48 PM
 *To:* user@hive.apache.org
 *Subject:* Re: Custom MR scripts using java in Hive

 ** **

 Hi Manish,

  

 Thanks for your help.I did the same using UDF.Now trying with
 Transform,Map and Reduce clauses.so is it mean by using java we have to
 goahead through UDF and for other languages using MapReduce Scripts i.e.,
 the Transform,Map and Reduce clauses.

 Please correct me if am wrong.

  

  

  

 Thanks  Regards,

 Manu 

 ** **

 On Tue, Sep 25, 2012 at 5:19 PM, Manish.Bhoge manish.bh...@target.com
 wrote:

 Manu,

  

 If you have written UDF in Java for Hive then you need to copy your JAR on
 your Hadoop cluster in /usr/lib/hive/lib/ folder to hive to use this JAR.
 

  

 Thank You,

 Manish

  

 *From:* Manu A [mailto:hadoophi...@gmail.com]
 *Sent:* Tuesday, September 25, 2012 3:44 PM
 *To:* user@hive.apache.org
 *Subject:* Custom MR scripts using java in Hive

  

 Hi All,

 I am learning hive. Please let me know if any one tried with custom Map
 Reduce scripts using java in hive or refer me some links and blogs with an
 example.

  

 when i tried i got the below error :

  

 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 0
 2012-09-25 02:47:23,720 Stage-1 map = 0%,  reduce = 0%
 2012-09-25 02:47:56,943 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_20120931_0001 with errors
 Error during job, obtaining debugging information...
 Examining task ID: task_20120931_0001_m_02 (and more) from job
 job_20120931_0001
 Exception in thread Thread-51 java.lang.RuntimeException: Error while
 reading from task log url
 at
 org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
 at
 org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
 at
 org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.IOException: Server returned HTTP response code: 400
 for URL: // removed as confidential

 at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
 at java.net.URL.openStream(URL.java:1010)
 at
 org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
 ... 3 more
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapRedTask
 MapReduce Jobs Launched:
 Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
 Total MapReduce CPU Time Spent: 0 msec

  

  

  

  

 Thanks for ur help in advance :)

  

  

  

 Thanks  Regards,

 Manu 

  

  

  

  

  

  




 --
 *Thanks  Regards,* 

 *Tamil*

 ** **



Re: Custom MR scripts using java in Hive

2012-09-25 Thread Manu A
Thanks Manish. ll try with the same.



Thanks  Regards,

Manu 

 


On Tue, Sep 25, 2012 at 5:19 PM, Manish.Bhoge manish.bh...@target.comwrote:

  Manu,

 ** **

 If you have written UDF in Java for Hive then you need to copy your JAR on
 your Hadoop cluster in /usr/lib/hive/lib/ folder to hive to use this JAR.
 

 ** **

 Thank You,

 Manish

 ** **

 *From:* Manu A [mailto:hadoophi...@gmail.com]
 *Sent:* Tuesday, September 25, 2012 3:44 PM
 *To:* user@hive.apache.org
 *Subject:* Custom MR scripts using java in Hive

 ** **

 Hi All,

 I am learning hive. Please let me know if any one tried with custom Map
 Reduce scripts using java in hive or refer me some links and blogs with an
 example.

  

 when i tried i got the below error :

  

 Hadoop job information for Stage-1: number of mappers: 1; number of
 reducers: 0
 2012-09-25 02:47:23,720 Stage-1 map = 0%,  reduce = 0%
 2012-09-25 02:47:56,943 Stage-1 map = 100%,  reduce = 100%
 Ended Job = job_20120931_0001 with errors
 Error during job, obtaining debugging information...
 Examining task ID: task_20120931_0001_m_02 (and more) from job
 job_20120931_0001
 Exception in thread Thread-51 java.lang.RuntimeException: Error while
 reading from task log url
 at
 org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
 at
 org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
 at
 org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.IOException: Server returned HTTP response code: 400
 for URL: // removed as confidential

 at
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
 at java.net.URL.openStream(URL.java:1010)
 at
 org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
 ... 3 more
 FAILED: Execution Error, return code 2 from
 org.apache.hadoop.hive.ql.exec.MapRedTask
 MapReduce Jobs Launched:
 Job 0: Map: 1   HDFS Read: 0 HDFS Write: 0 FAIL
 Total MapReduce CPU Time Spent: 0 msec

  

  

  

  

 Thanks for ur help in advance :)

  

  

  

 Thanks  Regards,

 Manu 

  

  

  

  

 ** **

 ** **