Difference between Local Hive Metastore server and A Hive-based Metastore server

2015-12-17 Thread Divya Gehlot
Hi,
I am new bee to spark and using 1.4.1
Got confused between  Local Metastore server and a hive based metastore
server.
Can somebody share the usecases when to use which one  and pros and cons ?

I am using HDP 2,.3.2 in which hive-site-xml is already in spark
configuration directory that means HDP 2.3.2 already uses hive based
metastore server.


Hive on Spark throw java.lang.NullPointerException

2015-12-17 Thread Jone Zhang
*My query is *
set hive.execution.engine=spark;
select
t3.pcid,channel,version,ip,hour,app_id,app_name,app_apk,app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num,
(case when t4.cnt is null then 0 else 1 end) as is_evil
from
(select /*+mapjoin(t2)*/
pcid,channel,version,ip,hour,
(case when t2.app_id is null then t1.app_id else t2.app_id end) as app_id,
t2.name as app_name,
app_apk,
app_version,app_type,dwl_tool,dwl_status,err_type,dwl_store,dwl_maxspeed,dwl_minspeed,dwl_avgspeed,last_time,dwl_num
from
t_ed_soft_downloadlog_molo t1 left outer join t_rd_soft_app_pkg_name t2 on
(lower(t1.app_apk) = lower(t2.package_id) and t1.ds = 20151217 and t2.ds =
20151217)
where
t1.ds = 20151217) t3
left outer join
(
select pcid,count(1) cnt  from t_ed_soft_evillog_molo where ds=20151217
 group by pcid
) t4
on t3.pcid=t4.pcid;


*And the error log is *
2015-12-18 08:10:18,685 INFO  [main]: spark.SparkMapJoinOptimizer
(SparkMapJoinOptimizer.java:process(79)) - Check if it can be converted to
map join
2015-12-18 08:10:18,686 ERROR [main]: ql.Driver
(SessionState.java:printError(966)) - FAILED: NullPointerException null
java.lang.NullPointerException
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedParentMapJoinSize(SparkMapJoinOptimizer.java:312)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getConnectedMapJoinSize(SparkMapJoinOptimizer.java:292)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.getMapJoinConversionInfo(SparkMapJoinOptimizer.java:271)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkMapJoinOptimizer.process(SparkMapJoinOptimizer.java:80)
at
org.apache.hadoop.hive.ql.optimizer.spark.SparkJoinOptimizer.process(SparkJoinOptimizer.java:58)
at
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:92)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:97)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:81)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:135)
at
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:112)
at
org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeOperatorPlan(SparkCompiler.java:128)
at
org.apache.hadoop.hive.ql.parse.TaskCompiler.compile(TaskCompiler.java:102)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:10238)
at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:210)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:233)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:425)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:308)
at
org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1171)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1060)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1050)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:208)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:160)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:447)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:357)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:795)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:767)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:704)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)


*Some properties on hive-site.xml is *

   hive.ignore.mapjoin.hint
   false


hive.auto.convert.join
true


   hive.auto.convert.join.noconditionaltask
   true



*The error relevant code is *
long mjSize = ctx.getMjOpSizes().get(op);
*I think it should be checked whether or not * ctx.getMjOpSizes().get(op) *is
null.*

*Of course, more strict logic need to you to decide.*


*Thanks.*
*Best Wishes.*


Re: Hive partition loan

2015-12-17 Thread Suyog Parlikar
Thanks Alan for the reply.

I have one more question on the similar line -

Can we move data from one partitions to another in a hive table based on a
condition ?

If yes , what will be the efficient way to that.

Thanks in advance.

Regards,
Suyog
On Dec 17, 2015 11:43 PM, "Alan Gates"  wrote:

> Yes, you can load different partitions simultaneously.
>
> Alan.
>
> Suyog Parlikar 
> December 17, 2015 at 5:02
>
> Hello everyone,
>
> Can we load different partitions of a hive table simultaneously.
>
> Is there any locking issues in that if yes what are they?
>
> Please find below example for more details.
>
> Consider I have a hive table test with two partition p1 and p2.
>
> I want to load the data into partition p1 and p2 at the same time.
>
> Awaiting your reply.
>
> Thanks,
> Suyog
>
>


Re: Discussion: permanent UDF with database name

2015-12-17 Thread jipengz...@meilishuo.com
@ Furcy Pin
I agree you idea!
when i found after hive-0.13,user can define permanent UDF.but it must bind 
with database name.
so if we want to use the udf without database name,we must create it at all of 
the databases name.
it take another problem,when we create a new databases.we need get all of the 
udfs that we have been defined.
then create them one by one.
This is the biggest problem I have encountered in the use of.

jipengzeng



 
From: Furcy Pin
Date: 2015-12-17 20:14
To: user
Subject: Discussion: permanent UDF with database name
Hi Hive users,

I would like to pursue the discussion that happened during the design of the 
feature:
https://issues.apache.org/jira/browse/HIVE-6167

Some concern where raised back then, and I think that maybe now that it has 
been implemented, some user feedbacks could bring water to the mill.

Even if I understand the utility of grouping UDFs inside databases, I find it 
really annoying not to be able to define my UDFs globally.

For me, one of the main interests of UDFs is to extend the built-in Hive 
functions with the company's user-defined functions, either because some useful 
generic function are missing in the built-in functions or to add 
business-specific functions.

In the latter case, I understand very well the necessity of qualifying them 
with a business-specific database name. But in the former case?


Let's take an example:
It happened several times that we needed a Hive UDF that was did not exist yet 
on the Hive version that we were currently running. To use it, all we had to do 
was take the UDF's source code from a more recent version of Hive, built it in 
a JAR, and add the UDF manually.

When we upgraded, we only add to remove our UDF since it was now built-in.

(To be more specific it happened with collect_list prior to Hive 0.13).

With HIVE-6167, this became impossible, since we ought to create a 
"database_name.function_name", and use it as is. Hence, when upgrading we need 
to rename everywhere "database_name.function_name" with "function_name".

This is just an example, but I would like to emphasize the point that sometimes 
we want to create permanent UDFs that are as global as built-in UDFs and not 
bother if it is a built-in or user-defined function. As someone pointed out in 
HIVE-6167's discussion, imagine if all the built-in UDFs had to be called with 
"sys.function_name".

I would just like to have other Hive user's feedback on that matter.

Did anyone else had similar issues with this behavior? How did you treat them?

Maybe it would make sense to create a feature request for being able to specify 
a GLOBAL keyword when creating a permanent UDF, when we really want it to be 
global?

What do you think?

Regards,

Furcy



Is there any documentation of how the field delimiter is specified?

2015-12-17 Thread Toby Allsopp
What we want to do is to generate the CREATE TABLE statement for a
delimited file where the delimiter has been specified by the user.

That is, given a character with ASCII code C, how should we generate the
FIELDS TERMINATED BY '?' clause?

Is it correct to convert to octal and say '\ooo'?

We're confused because specifying '1', '\1', and '\001' all result in the
default delimiter of ASCII char 1, but '\01' does not (or at least it
doesn't function correctly).

Also, specifying '\u0009' specifies TAB - is this something that we can
expect to work or is this an accident?

The documentation says "Hive uses C-style escaping within the strings" when
describing the string type, but that doesn't appear to be entirely true. Is
there any documentation of exactly which escapes are supported and what
they mean?

Thanks,
Toby.


RE: Synchronizing Hive metastores across clusters

2015-12-17 Thread Mich Talebzadeh
Hi Elliot.

 

Strictly speaking I believe your question is when the metastore in the 
replicate gets out of sync in replicate. So any query against cloud table will 
only show say partitions at time T0 as opposed to T1?

 

I don’t know what your metastore is on. With ours on Oracle this can happen 
when there is a network glitch hence the metadata tables can get out of sync. 
Each table has a Materialized view (MV) log that keeps the deltas for that 
table and pushes the deltas to the replicate table every say 30 seconds 
(configurable). So this is the scenario

 

1.Network issue. Data cannot be delivered (deltas) and the replicate table 
is out of sync. The replicated table data is kept in the primary table MV log 
until the network is back and the next scheduled refresh delivers it. There 
could be a backlog

2.The replicated table gets out of sync. In this case Oracle package 
DBMS_MVIEW.REFRESH is used to sync the replicate table. Again best done when 
there is no activity in the primary

 

 

We use Oracle for our metastore as the Bank has many instances of Oracle, 
Sybase, Microsoft SQL server and it is pretty easy for DBAs to look after a 
small Hive schema on an Oracle instance.

 

I gather if we build a model based on what classic databases do to keep 
reporting database tables in sync (which is in essence what we are talking 
about) then we should be OK.

 

That takes care of metadata but I noticed that you are also mentioning synching 
data on HDFS in the replicate as well. Sounds like many people go for DistCp 
  — an application 
shipped with Hadoop that uses a MapReduce job to copy files in parallel. There 
seems to be a good article here 

  on general replication for Facebook.

 

 

HTH,

 

 

Mich Talebzadeh

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Ltd, its subsidiaries nor their employees accept any 
responsibility.

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 17:17
To: user@hive.apache.org
Subject: Re: Synchronizing Hive metastores across clusters

 

Hi Mich,

 

In your scenario is there any coordination of data syncing on HDFS and metadata 
in HCatalog? I.e. could a situation occur where the replicated metastore shows 
a partition as 'present' yet the data that backs the partition in HDFS has not 
yet arrived at the replica filesystem? I Imagine one could avoid this by 
snapshotting the source metastore, then syncing HDFS, and then finally shipping 
the snapshot to the replica(?).

 

Thanks - Elliot.

 

On 17 December 2015 at 16:57, Mich Talebzadeh mailto:m...@peridale.co.uk> > wrote:

Sounds like one way replication of metastore. Depending on your metastore 
platform that could be achieved pretty easily. 

 

Mine is Oracle and I use Materialised View replication which is pretty good but 
no latest technology. Others would be GoldenGate or SAP replication server.

 

HTH,

 

Mich

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk  
] 
Sent: 17 December 2015 16:47
To: user@hive.apache.org  
Subject: RE: Synchronizing Hive metastores across clusters

 

Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org  
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from 
a traditional Hadoop cluster into a parallel cloud based cluster. This is not a 
one off, it needs to be a constantly running sync process. As new tables and 
partitions are added in one cluster, they need to be s

Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-17 Thread Xuefu Zhang
These missing classes are in hadoop jar. If you have HADOOP_HOME set, then
they should be in Hive classpath.

--Xuefu

On Thu, Dec 17, 2015 at 10:12 AM, Ophir Etzion  wrote:

> it seems like the problem is that the spark client needs FSDataInputStream
> but is not included in the hive-exec-1.1.0-cdh5.4.3.jar that is passed in
> the class path.
> I need to look more in spark-submit / org.apache.spark.deploy to see if
> there is a way to include more jars.
>
>
> 2015-12-17 17:34:01,679 INFO org.apache.hive.spark.client.SparkClientImpl:
> Running client driver with argv:
> /export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark/bin/spark-submit
> --executor-cores 1 --executor-memory 268435456 --proxy-user anonymous
> --properties-file /tmp/spark-submit.1508744664719491459.properties --class
> org.apache.hive.spark.client.RemoteDriver
> /export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/jars/hive-exec-1.1.0-cdh5.4.3.jar
> --remote-host ezaq6.prod.foursquare.com --remote-port 44306 --conf
> hive.spark.client.connect.timeout=1000 --conf
> hive.spark.client.server.connect.timeout=9 --conf
> hive.spark.client.channel.log.level=null --conf
> hive.spark.client.rpc.max.size=52428800 --conf
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/FSDataInputStream
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper$.main(SparkSubmitDriverBootstrapper.scala:71)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper.main(SparkSubmitDriverBootstrapper.scala)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.fs.FSDataInputStream
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.security.AccessController.doPrivileged(Native Method)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
> java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> 2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: ...
> 2 more
> 2015-12-17 17:34:02,438 WARN org.apache.hive.spark.client.SparkClientImpl:
> Child process exited with code 1.
>
> On Tue, Dec 15, 2015 at 11:15 PM, Xuefu Zhang  wrote:
>
>> As to the spark versions that are supported. Spark has made
>> non-compatible API changes in 1.5, and that's the reason why Hive 1.1.0
>> doesn't work with Spark 1.5. However, the latest Hive in master or branch-1
>> should work with spark 1.5.
>>
>> Also, later CDH 5.4.x versions have already supported Spark 1.5. CDH 5.7,
>> which is coming so, will support Spark 1.6.
>>
>> --Xuefu
>>
>> On Tue, Dec 15, 2015 at 3:50 PM, Mich Talebzadeh 
>> wrote:
>>
>>> To answer your point:
>>>
>>>
>>>
>>> “why would spark 1.5.2 specifically would not work with hive?”
>>>
>>>
>>>
>>> Because I tried Spark 1.5.2 and it did not work and unfortunately the
>>> only version seem to work (albeit requires messaging around) is version
>>> 1.3.1 of Spark.
>>>
>>>
>>>
>>> Look at the threads on “Managed to make Hive run on Spark engine” in
>>> user@hive.apache.org
>>>
>>>
>>>
>>>
>>>
>>> HTH,
>>>
>>>
>>>
>>>
>>>
>>> Mich Talebzadeh
>>>
>>>
>>>
>>> *Sybase ASE 15 Gold Medal Award 2008*
>>>
>>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>>
>>>
>>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>>
>>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>>> 15", ISBN 978-0-9563693-0-7*.
>>>
>>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>>> 978-0-9759693-0-4*
>>>
>>> *Publications due shortly:*
>>>
>>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>>> 978-0-9563693-3-8
>>>
>>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>>> one out shortly
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> NOTE: The information in this email is proprietary and confidential.
>>> This message is for the

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Sushanth Sowmyan
Also, while I have not wiki-ized the documentation for the above, I
have uploaded slides from talks that I've given in hive user group
meetup on the subject, and also a doc that describes the replication
protocol followed for the EXIM replication that are attached over at
https://issues.apache.org/jira/browse/HIVE-10264

On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan  wrote:
> Hi,
>
> I think that the replication work added with
> https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
> alley.
>
> Per Eugene's suggestion of MetaStoreEventListener, this replication
> system plugs into that and gets you a stream of notification events
> from HCatClient for the exact purpose you mention.
>
> There's some work still outstanding on this task, most notably
> documentation (sorry!) but please have a look at
> HCatClient.getReplicationTasks(...) and
> org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> your implementation of  ReplicationTask.Factory to inject your own
> logic for how to handle the replication according to your needs.
> (currently there exists an implementation that uses Hive EXPORT/IMPORT
> to perform replication - you can look at the code for this, and the
> tests for these classes to see how that is achieved. Falcon already
> uses this to perform cross-hive-warehouse replication)
>
>
> Thanks,
>
> -Sushanth
>
> On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
>  wrote:
>> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> which may be useful here
>>
>> Eugene
>>
>> From: Elliot West 
>> Reply-To: "user@hive.apache.org" 
>> Date: Thursday, December 17, 2015 at 8:21 AM
>> To: "user@hive.apache.org" 
>> Subject: Synchronizing Hive metastores across clusters
>>
>> Hello,
>>
>> I'm thinking about the steps required to repeatedly push Hive datasets out
>> from a traditional Hadoop cluster into a parallel cloud based cluster. This
>> is not a one off, it needs to be a constantly running sync process. As new
>> tables and partitions are added in one cluster, they need to be synced to
>> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
>> working, I'm wondering what steps I need to take to reliably ship the
>> HCatalog metadata across. I use HCatalog as the point of truth as to when
>> when data is available and where it is located and so I think that metadata
>> is a critical element to replicate in the cloud based cluster.
>>
>> Does anyone have any recommendations on how to achieve this in practice? One
>> issue (of many I suspect) is that Hive appears to store table/partition
>> locations internally with absolute, fully qualified URLs, therefore unless
>> the target cloud cluster is similarly named and configured some path
>> transformation step will be needed as part of the synchronisation process.
>>
>> I'd appreciate any suggestions, thoughts, or experiences related to this.
>>
>> Cheers - Elliot.
>>
>>


Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Sushanth Sowmyan
Hi,

I think that the replication work added with
https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
alley.

Per Eugene's suggestion of MetaStoreEventListener, this replication
system plugs into that and gets you a stream of notification events
from HCatClient for the exact purpose you mention.

There's some work still outstanding on this task, most notably
documentation (sorry!) but please have a look at
HCatClient.getReplicationTasks(...) and
org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
your implementation of  ReplicationTask.Factory to inject your own
logic for how to handle the replication according to your needs.
(currently there exists an implementation that uses Hive EXPORT/IMPORT
to perform replication - you can look at the code for this, and the
tests for these classes to see how that is achieved. Falcon already
uses this to perform cross-hive-warehouse replication)


Thanks,

-Sushanth

On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
 wrote:
> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
> which may be useful here
>
> Eugene
>
> From: Elliot West 
> Reply-To: "user@hive.apache.org" 
> Date: Thursday, December 17, 2015 at 8:21 AM
> To: "user@hive.apache.org" 
> Subject: Synchronizing Hive metastores across clusters
>
> Hello,
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
> Does anyone have any recommendations on how to achieve this in practice? One
> issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
> Cheers - Elliot.
>
>


Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Eugene Koifman
Metastore supports MetaStoreEventListener and MetaStorePreEventListener which 
may be useful here

Eugene

From: Elliot West mailto:tea...@gmail.com>>
Reply-To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Date: Thursday, December 17, 2015 at 8:21 AM
To: "user@hive.apache.org" 
mailto:user@hive.apache.org>>
Subject: Synchronizing Hive metastores across clusters

Hello,

I'm thinking about the steps required to repeatedly push Hive datasets out from 
a traditional Hadoop cluster into a parallel cloud based cluster. This is not a 
one off, it needs to be a constantly running sync process. As new tables and 
partitions are added in one cluster, they need to be synced to the cloud 
cluster. Assuming for a moment that I have the HDFS data syncing working, I'm 
wondering what steps I need to take to reliably ship the HCatalog metadata 
across. I use HCatalog as the point of truth as to when when data is available 
and where it is located and so I think that metadata is a critical element to 
replicate in the cloud based cluster.

Does anyone have any recommendations on how to achieve this in practice? One 
issue (of many I suspect) is that Hive appears to store table/partition 
locations internally with absolute, fully qualified URLs, therefore unless the 
target cloud cluster is similarly named and configured some path transformation 
step will be needed as part of the synchronisation process.

I'd appreciate any suggestions, thoughts, or experiences related to this.

Cheers - Elliot.




Re: Hive partition load

2015-12-17 Thread Alan Gates

Yes, you can load different partitions simultaneously.

Alan.


Suyog Parlikar 
December 17, 2015 at 5:02

Hello everyone,

Can we load different partitions of a hive table simultaneously.

Is there any locking issues in that if yes what are they?

Please find below example for more details.

Consider I have a hive table test with two partition p1 and p2.

I want to load the data into partition p1 and p2 at the same time.

Awaiting your reply.

Thanks,
Suyog



Re: Hive on Spark - Error: Child process exited before connecting back

2015-12-17 Thread Ophir Etzion
it seems like the problem is that the spark client needs FSDataInputStream
but is not included in the hive-exec-1.1.0-cdh5.4.3.jar that is passed in
the class path.
I need to look more in spark-submit / org.apache.spark.deploy to see if
there is a way to include more jars.


2015-12-17 17:34:01,679 INFO org.apache.hive.spark.client.SparkClientImpl:
Running client driver with argv:
/export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark/bin/spark-submit
--executor-cores 1 --executor-memory 268435456 --proxy-user anonymous
--properties-file /tmp/spark-submit.1508744664719491459.properties --class
org.apache.hive.spark.client.RemoteDriver
/export/hdb3/data/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/jars/hive-exec-1.1.0-cdh5.4.3.jar
--remote-host ezaq6.prod.foursquare.com --remote-port 44306 --conf
hive.spark.client.connect.timeout=1000 --conf
hive.spark.client.server.connect.timeout=9 --conf
hive.spark.client.channel.log.level=null --conf
hive.spark.client.rpc.max.size=52428800 --conf
hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStream
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
org.apache.spark.deploy.SparkSubmitDriverBootstrapper$.main(SparkSubmitDriverBootstrapper.scala:71)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
org.apache.spark.deploy.SparkSubmitDriverBootstrapper.main(SparkSubmitDriverBootstrapper.scala)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl:
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.FSDataInputStream
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.net.URLClassLoader$1.run(URLClassLoader.java:366)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.net.URLClassLoader$1.run(URLClassLoader.java:355)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.security.AccessController.doPrivileged(Native Method)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.net.URLClassLoader.findClass(URLClassLoader.java:354)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.lang.ClassLoader.loadClass(ClassLoader.java:425)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: at
java.lang.ClassLoader.loadClass(ClassLoader.java:358)
2015-12-17 17:34:02,435 INFO org.apache.hive.spark.client.SparkClientImpl: ...
2 more
2015-12-17 17:34:02,438 WARN org.apache.hive.spark.client.SparkClientImpl:
Child process exited with code 1.

On Tue, Dec 15, 2015 at 11:15 PM, Xuefu Zhang  wrote:

> As to the spark versions that are supported. Spark has made non-compatible
> API changes in 1.5, and that's the reason why Hive 1.1.0 doesn't work with
> Spark 1.5. However, the latest Hive in master or branch-1 should work with
> spark 1.5.
>
> Also, later CDH 5.4.x versions have already supported Spark 1.5. CDH 5.7,
> which is coming so, will support Spark 1.6.
>
> --Xuefu
>
> On Tue, Dec 15, 2015 at 3:50 PM, Mich Talebzadeh 
> wrote:
>
>> To answer your point:
>>
>>
>>
>> “why would spark 1.5.2 specifically would not work with hive?”
>>
>>
>>
>> Because I tried Spark 1.5.2 and it did not work and unfortunately the
>> only version seem to work (albeit requires messaging around) is version
>> 1.3.1 of Spark.
>>
>>
>>
>> Look at the threads on “Managed to make Hive run on Spark engine” in
>> user@hive.apache.org
>>
>>
>>
>>
>>
>> HTH,
>>
>>
>>
>>
>>
>> Mich Talebzadeh
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this 

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Jörn Franke
Hive has the export/import commands, alternatively Falcon+oozie

> On 17 Dec 2015, at 17:21, Elliot West  wrote:
> 
> Hello,
> 
> I'm thinking about the steps required to repeatedly push Hive datasets out 
> from a traditional Hadoop cluster into a parallel cloud based cluster. This 
> is not a one off, it needs to be a constantly running sync process. As new 
> tables and partitions are added in one cluster, they need to be synced to the 
> cloud cluster. Assuming for a moment that I have the HDFS data syncing 
> working, I'm wondering what steps I need to take to reliably ship the 
> HCatalog metadata across. I use HCatalog as the point of truth as to when 
> when data is available and where it is located and so I think that metadata 
> is a critical element to replicate in the cloud based cluster.
> 
> Does anyone have any recommendations on how to achieve this in practice? One 
> issue (of many I suspect) is that Hive appears to store table/partition 
> locations internally with absolute, fully qualified URLs, therefore unless 
> the target cloud cluster is similarly named and configured some path 
> transformation step will be needed as part of the synchronisation process.
> 
> I'd appreciate any suggestions, thoughts, or experiences related to this.
> 
> Cheers - Elliot.
> 
> 


Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
Hi Mich,

In your scenario is there any coordination of data syncing on HDFS and
metadata in HCatalog? I.e. could a situation occur where the replicated
metastore shows a partition as 'present' yet the data that backs the
partition in HDFS has not yet arrived at the replica filesystem? I Imagine
one could avoid this by snapshotting the source metastore, then syncing
HDFS, and then finally shipping the snapshot to the replica(?).

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh  wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:tea...@gmail.com ]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>


Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
Hi Mich,

Thanks for your reply. The cloud cluster is to be used for read-only
analytics, so effectively one-way, stand-by. I'll take a look at your
suggested technologies as I'm not familiar with them.

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh  wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:tea...@gmail.com ]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>


RE: Synchronizing Hive metastores across clusters

2015-12-17 Thread Mich Talebzadeh
Sounds like one way replication of metastore. Depending on your metastore 
platform that could be achieved pretty easily. 

 

Mine is Oracle and I use Materialised View replication which is pretty good but 
no latest technology. Others would be GoldenGate or SAP replication server.

 

HTH,

 

Mich

 

From: Mich Talebzadeh [mailto:m...@peridale.co.uk] 
Sent: 17 December 2015 16:47
To: user@hive.apache.org
Subject: RE: Synchronizing Hive metastores across clusters

 

Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org  
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from 
a traditional Hadoop cluster into a parallel cloud based cluster. This is not a 
one off, it needs to be a constantly running sync process. As new tables and 
partitions are added in one cluster, they need to be synced to the cloud 
cluster. Assuming for a moment that I have the HDFS data syncing working, I'm 
wondering what steps I need to take to reliably ship the HCatalog metadata 
across. I use HCatalog as the point of truth as to when when data is available 
and where it is located and so I think that metadata is a critical element to 
replicate in the cloud based cluster.

 

Does anyone have any recommendations on how to achieve this in practice? One 
issue (of many I suspect) is that Hive appears to store table/partition 
locations internally with absolute, fully qualified URLs, therefore unless the 
target cloud cluster is similarly named and configured some path transformation 
step will be needed as part of the synchronisation process.

 

I'd appreciate any suggestions, thoughts, or experiences related to this.

 

Cheers - Elliot.

 

 



RE: Synchronizing Hive metastores across clusters

2015-12-17 Thread Mich Talebzadeh
Are both clusters in active/active mode or the cloud based cluster is standby?

 

From: Elliot West [mailto:tea...@gmail.com] 
Sent: 17 December 2015 16:21
To: user@hive.apache.org
Subject: Synchronizing Hive metastores across clusters

 

Hello,

 

I'm thinking about the steps required to repeatedly push Hive datasets out from 
a traditional Hadoop cluster into a parallel cloud based cluster. This is not a 
one off, it needs to be a constantly running sync process. As new tables and 
partitions are added in one cluster, they need to be synced to the cloud 
cluster. Assuming for a moment that I have the HDFS data syncing working, I'm 
wondering what steps I need to take to reliably ship the HCatalog metadata 
across. I use HCatalog as the point of truth as to when when data is available 
and where it is located and so I think that metadata is a critical element to 
replicate in the cloud based cluster.

 

Does anyone have any recommendations on how to achieve this in practice? One 
issue (of many I suspect) is that Hive appears to store table/partition 
locations internally with absolute, fully qualified URLs, therefore unless the 
target cloud cluster is similarly named and configured some path transformation 
step will be needed as part of the synchronisation process.

 

I'd appreciate any suggestions, thoughts, or experiences related to this.

 

Cheers - Elliot.

 

 



Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West
Hello,

I'm thinking about the steps required to repeatedly push Hive datasets out
from a traditional Hadoop cluster into a parallel cloud based cluster. This
is not a one off, it needs to be a constantly running sync process. As new
tables and partitions are added in one cluster, they need to be synced to
the cloud cluster. Assuming for a moment that I have the HDFS data syncing
working, I'm wondering what steps I need to take to reliably ship the
HCatalog metadata across. I use HCatalog as the point of truth as to when
when data is available and where it is located and so I think that metadata
is a critical element to replicate in the cloud based cluster.

Does anyone have any recommendations on how to achieve this in practice?
One issue (of many I suspect) is that Hive appears to store table/partition
locations internally with absolute, fully qualified URLs, therefore unless
the target cloud cluster is similarly named and configured some path
transformation step will be needed as part of the synchronisation process.

I'd appreciate any suggestions, thoughts, or experiences related to this.

Cheers - Elliot.


Re: increase number of reducers

2015-12-17 Thread Muni Chada
Is this table bucketed? If so, please set the number of reducers (set
mapreduce.job.reduces=bucket_size) to match to the table's bucket size.

On Thu, Dec 17, 2015 at 1:25 AM, Awhan Patnaik  wrote:

> 3 node cluster with 15 gigs of RAM per node. Two tables L is approximately
> 1 Million rows, U is 100 Million. They both have latitude and longitude
> columns. I want to find the count of rows in U that are within a 10 mile
> radius of each of the row in L.
>
> I have indexed the latitude and longitude columns in U. U is date wise
> partitioned. U and L are both stored in ORC Snappy file format.
>
> My query is like this:
>
> select l.id, count(u.id) from L l, U u
> where
> u.lat !=0 and
> u.lat > l.lat - 10/69 and u.lat < l.lat + 10/69 and
> u.lon > l.lon - ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
> v.lon < l.lon + ( 10 / ( 69 * cos(radians(l.lat)) ) ) and
> 3960 *acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.lon)
> - radians(u.lon)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0
> group by l.id;
>
> The conditions in the where part enforce a bounding box filtering
> constraint based on lat/long values.
>
> The problem is that this results in 9 mappers but only 1 reducer. I notice
> that the job gets stuck at the 67% of the reduce phase. When I run htop I
> find that 2 of the nodes are sitting idle while the third node is busy
> running the single reduce task.
>
> I tried using "set mapreduce.job.reduces=50;" but that did not help as the
> number of reduce jobs was deduced to be 1 during compile time.
>
> How do I force more reducers?
>


Fwd: problem with hive.reloadable.aux.jars.path

2015-12-17 Thread Justyna



Hi,
I wanted to use hiveserver without restarting it for every auxiliary jar
change.
According to https://issues.apache.org/jira/browse/HIVE-7553, I switched
jars with udfs in folder from the path specified in the
hive.reloadable.aux.jars.path.
I executed command reload via the beeline.
It turned out that it was not work exactly as it should.
Adding and deleting udfs without restarting HiveServer2 works fine.
There is problem with updating udfs.
When the modification is in initialize method of udfs, HS2 is not aware
of the changes and method works as before, whilst
modification in evaluate method are seen.

What may be the reason of this? Can I do something to fix it?

Configuration is that hive.reloadable.aux.jars.path and
hive.aux.jars.path specify the same path to the local directory.
Value reload is set in hive.security.command.whitelist

Kind Regards,

Justyna Wardzinska





Główne Spółki Grupy Wirtualna Polska:

Wirtualna Polska Holding Spółka Akcyjna z siedzibą w Warszawie, ul. Jutrzenki 
137A, 02-231 Warszawa, wpisana do Krajowego Rejestru Sądowego - Rejestru 
Przedsiębiorców prowadzonego przez Sąd Rejonowy dla m.st. Warszawy w Warszawie 
pod nr KRS: 407130, kapitał zakładowy: 1 245 651,90 zł (w całości 
wpłacony), Numer Identyfikacji Podatkowej (NIP): 521-31-11-513

Grupa Wirtualna Polska Spółka Akcyjna z siedzibą w Warszawie, ul. Jutrzenki 
137A, 02-231 Warszawa, wpisana do Krajowego Rejestru Sądowego - Rejestru 
Przedsiębiorców prowadzonego przez Sąd Rejonowy dla m.st. Warszawy w Warszawie 
pod nr KRS: 580004, kapitał zakładowy: 317 957 800,00 zł, Numer 
Identyfikacji Podatkowej (NIP): 527-26-45-593

WP Shopping Spółka z ograniczoną odpowiedzialnością z siedzibą w Warszawie, ul. 
Jutrzenki 137A, 02-231 Warszawa, wpisana do Krajowego Rejestru Sądowego - 
Rejestru Przedsiębiorców prowadzonego przez Sąd Rejonowy Gdańsk - Północ w 
Gdańsku pod nr KRS: 546914, kapitał zakładowy: 170.000,00 złotych (w 
całości wpłacony), Numer Identyfikacji Podatkowej (NIP): 957-07-51-216


Hive partition load

2015-12-17 Thread Suyog Parlikar
Hello everyone,

Can we load different partitions of a hive table simultaneously.

Is there any locking issues in that if yes what are they?

Please find below example for more details.

Consider I have a hive table test with two partition p1 and p2.

I want to load the data into partition p1 and p2 at the same time.

Awaiting your reply.

Thanks,
Suyog


Discussion: permanent UDF with database name

2015-12-17 Thread Furcy Pin
Hi Hive users,

I would like to pursue the discussion that happened during the design of
the feature:
https://issues.apache.org/jira/browse/HIVE-6167

Some concern where raised back then, and I think that maybe now that it has
been implemented, some user feedbacks could bring water to the mill.

Even if I understand the utility of grouping UDFs inside databases, I find
it really annoying not to be able to define my UDFs globally.

For me, one of the main interests of UDFs is to extend the built-in Hive
functions with the company's user-defined functions, either because some
useful generic function are missing in the built-in functions or to add
business-specific functions.

In the latter case, I understand very well the necessity of qualifying them
with a business-specific database name. But in the former case?


Let's take an example:
It happened several times that we needed a Hive UDF that was did not exist
yet on the Hive version that we were currently running. To use it, all we
had to do was take the UDF's source code from a more recent version of
Hive, built it in a JAR, and add the UDF manually.

When we upgraded, we only add to remove our UDF since it was now built-in.

(To be more specific it happened with collect_list prior to Hive 0.13).

With HIVE-6167, this became impossible, since we ought to create a
"database_name.function_name", and use it as is. Hence, when upgrading we
need to rename everywhere "database_name.function_name" with
"function_name".

This is just an example, but I would like to emphasize the point that
sometimes we want to create permanent UDFs that are as global as built-in
UDFs and not bother if it is a built-in or user-defined function. As
someone pointed out in HIVE-6167's discussion, imagine if all the built-in
UDFs had to be called with "sys.function_name".

I would just like to have other Hive user's feedback on that matter.

Did anyone else had similar issues with this behavior? How did you treat
them?

Maybe it would make sense to create a feature request for being able to
specify a GLOBAL keyword when creating a permanent UDF, when we really want
it to be global?

What do you think?

Regards,

Furcy