Issue in Insert Overwrite directory operation

2016-06-13 Thread Udit Mehta
Hi All,

I see a weird issue when trying to do a "INSERT OVERWRITE DIRECTORY"
operation. The query seems to work when I limit the data set but fails with
the following exception if the data set is larger:

Failed with exception Unable to move source
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605
to destination /user/grp_admin/external_test1/output

I ensured that the directory has enough space so there is no disk quota
issues here.
Does anyone know what is happening here?

Running Hive on Tez. Hive version is 1.2.1. Fails even with Hive on MR.

Run 1 with smaller data set:

> insert overwrite directory
'/user/grp_admin/external_test1/output' row format delimited fields
terminated by '\t'

> select * from test_table limit 1000;

Query ID = hive_20160613213624_d9d54ef0-0b28-4e98-b49e-197043f67c43

Total jobs = 3

Launching Job 1 out of 3





Status: Running (Executing on YARN cluster with App id
application_1464825277140_26149)





VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED



Map 1 ..   SUCCEEDED 12 1200   0   0

Reducer 2 ..   SUCCEEDED  1  100   0   0



VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 21.03 s



Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to:
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-36-24_620_4270199609063911787-1/-ext-1

Moving data to: /user/grp_admin/external_test1/output

OK

Time taken: 21.501 seconds

Run 2 with larger data set:

> insert overwrite directory
'/user/grp_admin/external_test1/output' row format delimited fields
terminated by '\t'

> select * from test_table;

Query ID = hive_20160613213436_a1b0087a-84ff-48a0-ac76-25811aaafe28

Total jobs = 3

Launching Job 1 out of 3

Tez session was closed. Reopening...

Session re-established.





Status: Running (Executing on YARN cluster with App id
application_1464825277140_26149)





VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED



Map 1 ..   SUCCEEDED 12 1200   0   0



VERTICES: 01/01  [==>>] 100%  ELAPSED TIME: 72.69 s



Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to:
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605303086037347-1/-ext-1

Moving data to: /user/grp_admin/external_test1/output

Failed with exception Unable to move source
hdfs://namenode/user/grp_admin/external_test1/output/.hive-staging_hive_2016-06-13_21-34-36_449_7074605303086037347-1/-ext-1/00_0
to destination /user/grp_admin/external_test1/output

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask


Re: Disable Hive autogather optimization

2016-04-29 Thread Udit Mehta
|
>> MANAGED_TABLE   |
>> NULL  |
>> | Table Parameters: |
>> NULL|
>> NULL  |
>> |   |
>> last_modified_by|
>> hduser|
>> |   |
>> last_modified_time  |
>> 1461973002|
>> |   |
>> numFiles|
>> 1 |
>> |   |
>> numRows |
>> -1|
>> |   |
>> rawDataSize |
>> 54853 |
>> |   |
>> totalSize   |
>> 55853 |
>> |   |
>> transient_lastDdlTime   |
>> 1461973002|
>> |   |
>> NULL|
>> NULL  |
>> | # Storage Information |
>> NULL|
>> NULL  |
>> | SerDe Library:|
>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  |
>> NULL  |
>> | InputFormat:  |
>> org.apache.hadoop.mapred.TextInputFormat|
>> NULL  |
>> | OutputFormat: |
>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  |
>> NULL  |
>> | Compressed:   |
>> No  |
>> NULL  |
>> | Num Buckets:  |
>> -1  |
>> NULL  |
>> | Bucket Columns:   |
>> []  |
>> NULL  |
>> | Sort Columns: |
>> []  |
>> NULL  |
>> | Storage Desc Params:  |
>> NULL|
>> NULL  |
>> |   |
>> serialization.format|
>> 1 |
>>
>> +---+-+---+--+
>>
>> Hopefully that will turn off the autogather feature for existing tables.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 29 April 2016 at 23:32, Udit Mehta <ume...@groupon.com> wrote:
>>
>>> Hi,
>>>
>>> Thanks for the replies.
>>> We have a scenario where we have an ETL job inserting into a table with
>>> thousands of partitions using dynamic partitioning. We have certain SLA's
>>> within which we would like the job to finish and sometimes there are
>>> scenarios where they are missed (extra data or a busy cluster). I
>>> understand that stats are essential for Hive CBO but we are trying to
>>> explore how much overhead do these stats collection add to the job runtime.
>>> A lot of these tables are intermediary tables so having stats for them
>>> might not be entirely necessary.
>>>
>>> I just wanted to figure if there was a easy way to disable the stats and
>>> then compare the performance.
>>>
>>> Mich, can you give more information on how to disable it in the table
>>> struct as I cant find any documentation on it.
>>>
>>> Thanks again.
>>> Udit
>>>
>>> On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxi...@apache.org>
>>> wrote:
>>>
>>>> Hi Udit,
>>>>
>>>> Could u be more specific about your problem? Like, what settings
>>>> you have, what query you run and what is the result and what result do you
>>>> expect?
>>>>
>>>> From what you said, my understanding is that, you want to wipe out
>>>> the basic stats for existing tables? And, could u also let us know why you
>>>> would like to get rid of the stats? Stats is crucial for Hive CBO to work
>>>> and we are moving towards the direction to make table/column stats
>>>> collection automatically. It seems that you prefer an opposite direction.
>>>> There is nothing wrong here and we would like to listen to your idea and
>>>> motivation so that we can better design Hive stats collection. Thanks!
>>>>
>>>> Best
>>>> Pengcheng
>>>>
>>>>
>>>> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <ume...@groupon.com> wrote:
>>>>
>>>>> Any insights on this?
>>>>>
>>>>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com>
>>>>> wrote:
>>>>>
>>>>>> Update: Realized this works if we create a fresh table with this
>>>>>> config already disabled but does not work if there is already a table
>>>>>> created when this config was enabled. We now need to figure out how to
>>>>>> disable this config for a table created when this config was true.
>>>>>>
>>>>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hive version we are using is 1.2.1.
>>>>>>>
>>>>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We need to disable the Hive autogather stats optimization by
>>>>>>>> disabling "*hive.stats.autogather*" but for some reason, the
>>>>>>>> config change doesnt seem to go through. We modified this config in the
>>>>>>>> hive-site.xml and restarted the Hive metastore. We also made this 
>>>>>>>> change
>>>>>>>> explicitly in the job but it doesnt seem to help.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *set hive.stats.autogather=false;*
>>>>>>>> Does anyone know the right way to disable this config since we dont
>>>>>>>> want to compute stats in out jobs.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Udit
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Disable Hive autogather optimization

2016-04-29 Thread Udit Mehta
Hi,

Thanks for the replies.
We have a scenario where we have an ETL job inserting into a table with
thousands of partitions using dynamic partitioning. We have certain SLA's
within which we would like the job to finish and sometimes there are
scenarios where they are missed (extra data or a busy cluster). I
understand that stats are essential for Hive CBO but we are trying to
explore how much overhead do these stats collection add to the job runtime.
A lot of these tables are intermediary tables so having stats for them
might not be entirely necessary.

I just wanted to figure if there was a easy way to disable the stats and
then compare the performance.

Mich, can you give more information on how to disable it in the table
struct as I cant find any documentation on it.

Thanks again.
Udit

On Fri, Apr 29, 2016 at 10:42 AM, Pengcheng Xiong <pxi...@apache.org> wrote:

> Hi Udit,
>
> Could u be more specific about your problem? Like, what settings you
> have, what query you run and what is the result and what result do you
> expect?
>
> From what you said, my understanding is that, you want to wipe out the
> basic stats for existing tables? And, could u also let us know why you
> would like to get rid of the stats? Stats is crucial for Hive CBO to work
> and we are moving towards the direction to make table/column stats
> collection automatically. It seems that you prefer an opposite direction.
> There is nothing wrong here and we would like to listen to your idea and
> motivation so that we can better design Hive stats collection. Thanks!
>
> Best
> Pengcheng
>
>
> On Thu, Apr 28, 2016 at 4:12 PM, Udit Mehta <ume...@groupon.com> wrote:
>
>> Any insights on this?
>>
>> On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com> wrote:
>>
>>> Update: Realized this works if we create a fresh table with this config
>>> already disabled but does not work if there is already a table created when
>>> this config was enabled. We now need to figure out how to disable this
>>> config for a table created when this config was true.
>>>
>>> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> wrote:
>>>
>>>> Hive version we are using is 1.2.1.
>>>>
>>>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We need to disable the Hive autogather stats optimization by disabling
>>>>> "*hive.stats.autogather*" but for some reason, the config change
>>>>> doesnt seem to go through. We modified this config in the hive-site.xml 
>>>>> and
>>>>> restarted the Hive metastore. We also made this change explicitly in the
>>>>> job but it doesnt seem to help.
>>>>>
>>>>>
>>>>>
>>>>> *set hive.stats.autogather=false;*
>>>>> Does anyone know the right way to disable this config since we dont
>>>>> want to compute stats in out jobs.
>>>>>
>>>>> Thanks,
>>>>> Udit
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Disable Hive autogather optimization

2016-04-28 Thread Udit Mehta
Any insights on this?

On Tue, Apr 26, 2016 at 7:32 PM, Udit Mehta <ume...@groupon.com> wrote:

> Update: Realized this works if we create a fresh table with this config
> already disabled but does not work if there is already a table created when
> this config was enabled. We now need to figure out how to disable this
> config for a table created when this config was true.
>
> On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> wrote:
>
>> Hive version we are using is 1.2.1.
>>
>> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> wrote:
>>
>>> Hi,
>>>
>>> We need to disable the Hive autogather stats optimization by disabling "
>>> *hive.stats.autogather*" but for some reason, the config change doesnt
>>> seem to go through. We modified this config in the hive-site.xml and
>>> restarted the Hive metastore. We also made this change explicitly in the
>>> job but it doesnt seem to help.
>>>
>>>
>>>
>>> *set hive.stats.autogather=false;*
>>> Does anyone know the right way to disable this config since we dont want
>>> to compute stats in out jobs.
>>>
>>> Thanks,
>>> Udit
>>>
>>
>>
>


Re: Disable Hive autogather optimization

2016-04-26 Thread Udit Mehta
Update: Realized this works if we create a fresh table with this config
already disabled but does not work if there is already a table created when
this config was enabled. We now need to figure out how to disable this
config for a table created when this config was true.

On Tue, Apr 26, 2016 at 6:16 PM, Udit Mehta <ume...@groupon.com> wrote:

> Hive version we are using is 1.2.1.
>
> On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> wrote:
>
>> Hi,
>>
>> We need to disable the Hive autogather stats optimization by disabling "
>> *hive.stats.autogather*" but for some reason, the config change doesnt
>> seem to go through. We modified this config in the hive-site.xml and
>> restarted the Hive metastore. We also made this change explicitly in the
>> job but it doesnt seem to help.
>>
>>
>>
>> *set hive.stats.autogather=false;*
>> Does anyone know the right way to disable this config since we dont want
>> to compute stats in out jobs.
>>
>> Thanks,
>> Udit
>>
>
>


Re: Disable Hive autogather optimization

2016-04-26 Thread Udit Mehta
Hive version we are using is 1.2.1.

On Tue, Apr 26, 2016 at 6:01 PM, Udit Mehta <ume...@groupon.com> wrote:

> Hi,
>
> We need to disable the Hive autogather stats optimization by disabling "
> *hive.stats.autogather*" but for some reason, the config change doesnt
> seem to go through. We modified this config in the hive-site.xml and
> restarted the Hive metastore. We also made this change explicitly in the
> job but it doesnt seem to help.
>
>
>
> *set hive.stats.autogather=false;*
> Does anyone know the right way to disable this config since we dont want
> to compute stats in out jobs.
>
> Thanks,
> Udit
>


Disable Hive autogather optimization

2016-04-26 Thread Udit Mehta
Hi,

We need to disable the Hive autogather stats optimization by disabling "
*hive.stats.autogather*" but for some reason, the config change doesnt seem
to go through. We modified this config in the hive-site.xml and restarted
the Hive metastore. We also made this change explicitly in the job but it
doesnt seem to help.



*set hive.stats.autogather=false;*
Does anyone know the right way to disable this config since we dont want to
compute stats in out jobs.

Thanks,
Udit


Re: Hive Metastore Bottleneck

2016-03-30 Thread Udit Mehta
But dont the clients always pick the first URI for multiple instances
mentioned in "*hive.metastore.uris" *config and fallback to the others only
if the first is unreachable? This way, we would still have a bottleneck,
right?
Can you give a little more information on your setup and how you enable
load balancing?
I think  i am missing something here.

Thanks,
Udit

On Wed, Mar 30, 2016 at 3:20 PM, Gautam <gautamkows...@gmail.com> wrote:

> The metastore service is a java process that is a thrift server .. so you
> can point multiple such hive metastore instances with
> "javax.jdo.option.ConnectionURL" poitning to the same mysql db.
>
> On Wed, Mar 30, 2016 at 3:11 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>>
>>
>> Can you clarify this please
>>
>> "Have you tried putting multiple metastores behind a load balancer"
>>
>> Are you implying that metastore and backend DB are different entities
>> here.
>>
>> As far as I know $HIVE_HOME/bin/hive --service metastore & starts Hive
>> threads to the backend database/metastore and Hive server2 acts a gateway
>> for remote access to Hive metastore through beeline or other clients
>>
>> There is only one metastore here namely MySQL/Oracle or others.
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 30 March 2016 at 22:53, Gautam <gautamkows...@gmail.com> wrote:
>>
>>> Can you elaborate on where you see the bottleneck?   A general overview
>>> of your access path would be useful. For instance if you'r accessing Hive
>>> metastore via HiveServer2 or from webhcat using embedded cli or something
>>> else.
>>>
>>> Have you tried putting multiple metastores behind a load balancer? It's
>>> just a thrift service over mysql so can have multiple instances pointing to
>>> same backend db.
>>>
>>> On Wed, Mar 30, 2016 at 2:28 PM, Udit Mehta <ume...@groupon.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We are currently running Hive in production and staging with the
>>>> metastore connecting to a MySql database in the backend. The traffic in
>>>> production accessing the metastore is more than staging which is expected.
>>>> We have had a sudden increase in traffic which has led to the metastore
>>>> operation taking a lot longer than before. The same query on staging takes
>>>> a lot less due to the lesser traffic on the staging cluster.
>>>>
>>>> We tried increasing the heap space for the metastore process as well as
>>>> bumped up the memory for the mysql database. Both these changes did not
>>>> seem to help much and we still see delays. Is there any other config we can
>>>> increase to counter this increased traffic? I am looking at config for max
>>>> threads as well but im not sure if this is the right path ahead.
>>>>
>>>> Im wondering if the metastore is a bottleneck here or im missing
>>>> something.
>>>>
>>>> Looking forward to your reply,
>>>> Udit
>>>>
>>>
>>>
>>>
>>> --
>>> "If you really want something in this life, you have to work for it.
>>> Now, quiet! They're about to announce the lottery numbers..."
>>>
>>
>>
>
>
> --
> "If you really want something in this life, you have to work for it. Now,
> quiet! They're about to announce the lottery numbers..."
>


Re: Hive Metastore Bottleneck

2016-03-30 Thread Udit Mehta
I was looking at : *hive**.metastore.max.server.threads *but reading more
into it tells me its a config for the thrift server and not the metastore.

Most of our applications accessing the metastore are Spark Sql applications
which do INSERT operations on multiple partitions on a hourly basis. This
basically implies that most of these queries dont use the thrift server but
directly connect to the metastore from the spark application.

Can you give me more information on how we can have multiple metastores
behind a load balancer? I read about providing multiple urls in "
*hive.metastore.uris*" but saw that it would always pick the first URI and
choose from the rest only in case of a failure.

Thanks again for the replies.

On Wed, Mar 30, 2016 at 2:30 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Are you talking about increase in number of threads from Hive server2
> connection to your database (MySQL)?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 30 March 2016 at 22:28, Udit Mehta <ume...@groupon.com> wrote:
>
>> Hi all,
>>
>> We are currently running Hive in production and staging with the
>> metastore connecting to a MySql database in the backend. The traffic in
>> production accessing the metastore is more than staging which is expected.
>> We have had a sudden increase in traffic which has led to the metastore
>> operation taking a lot longer than before. The same query on staging takes
>> a lot less due to the lesser traffic on the staging cluster.
>>
>> We tried increasing the heap space for the metastore process as well as
>> bumped up the memory for the mysql database. Both these changes did not
>> seem to help much and we still see delays. Is there any other config we can
>> increase to counter this increased traffic? I am looking at config for max
>> threads as well but im not sure if this is the right path ahead.
>>
>> Im wondering if the metastore is a bottleneck here or im missing
>> something.
>>
>> Looking forward to your reply,
>> Udit
>>
>
>


Hive Metastore Bottleneck

2016-03-30 Thread Udit Mehta
Hi all,

We are currently running Hive in production and staging with the metastore
connecting to a MySql database in the backend. The traffic in production
accessing the metastore is more than staging which is expected. We have had
a sudden increase in traffic which has led to the metastore operation
taking a lot longer than before. The same query on staging takes a lot less
due to the lesser traffic on the staging cluster.

We tried increasing the heap space for the metastore process as well as
bumped up the memory for the mysql database. Both these changes did not
seem to help much and we still see delays. Is there any other config we can
increase to counter this increased traffic? I am looking at config for max
threads as well but im not sure if this is the right path ahead.

Im wondering if the metastore is a bottleneck here or im missing something.

Looking forward to your reply,
Udit


Hive on spark table caching

2015-12-02 Thread Udit Mehta
Hi,

I have started using Hive on Spark recently and am exploring the benefits
it offers. I was wondering if Hive on Spark has capabilities to cache table
like Spark SQL. Or does it do any form of implicit caching in the long
running job which it starts after running the first query?

Thanks,
Udit


Re: Hive on spark table caching

2015-12-02 Thread Udit Mehta
Im using Spark 1.3 with Hive 1.2.1. I dont mind using a version of Spark
higher than that but I read somewhere that 1.3 is the version of Spark
currently supported by Hive. Can I use Spark 1.4 or 1.5 with Hive 1.2.1?

On Wed, Dec 2, 2015 at 3:19 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Hi,
>
>
>
> Which version of spark are you using please?
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__login.sybase.com_files_Product-5FOverviews_ASE-2DWinning-2DStrategy-2D091908.pdf=CwMFaQ=LNdz7nrxyGFUIUTz2qIULQ=HtcUckpQd4kosOR2p8M5TR9HIZCAYDjZ-hXCa7BOA8s=tmCKqNuqXObnpIHLHjchVBKzvP-a-Cf9rJX8pYVyFwg=ywSlj9sSTCyGzGqFcS_R3QekFLuEUHjwuxfkG5kXJgk=>
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__talebzadehmich.wordpress.com_=CwMFaQ=LNdz7nrxyGFUIUTz2qIULQ=HtcUckpQd4kosOR2p8M5TR9HIZCAYDjZ-hXCa7BOA8s=tmCKqNuqXObnpIHLHjchVBKzvP-a-Cf9rJX8pYVyFwg=RGzdspRcDSHc6I8C1iNijlPc3DmGSAvYv14pVHD6RSI=>
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept any responsibility.
>
>
>
> *From:* Udit Mehta [mailto:ume...@groupon.com]
> *Sent:* 02 December 2015 23:01
> *To:* user@hive.apache.org
> *Subject:* Hive on spark table caching
>
>
>
> Hi,
>
> I have started using Hive on Spark recently and am exploring the benefits
> it offers. I was wondering if Hive on Spark has capabilities to cache table
> like Spark SQL. Or does it do any form of implicit caching in the long
> running job which it starts after running the first query?
>
> Thanks,
>
> Udit
>


Building Spark to use for Hive on Spark

2015-11-18 Thread Udit Mehta
Hi,

I am planning to test out the Hive on Spark functionality provided by the
newer versions of Hive. I wanted to know  why is it necessary to remove the
Hive jars from the Spark build as mentioned on this this page.

This would require me to have 2 spark builds, one with the Hive jars and
one without.

Any help is appreciated,
Udit


Re: Hive version with Spark

2015-11-18 Thread Udit Mehta
As per this link :
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started,
you need to build Spark without Hive.

On Wed, Nov 18, 2015 at 8:50 AM, Sofia  wrote:

> Hello
>
> After various failed tries to use my Hive (1.2.1) with my Spark (Spark
> 1.4.1 built for Hadoop 2.2.0) I decided to try to build again Spark with
> Hive.
> I would like to know what is the latest Hive version that can be used to
> build Spark at this point.
>
> When downloading Spark 1.5 source and trying:
>
> *mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-1.2.1
> -Phive-thriftserver  -DskipTests clean package*
>
> I get :
>
> *The requested profile "hive-1.2.1" could not be activated because it does
> not exist.*
>
> Thank you
> Sofia
>


Hive Server2 Monitoring

2015-11-11 Thread Udit Mehta
Hi,

I was planning to use the Hive Server2 in production and was wondering how
others monitor the Hive Server2 usage. I saw some recent errors related to
the PermGen and Heap Space so maybe that could be a start. It would also be
useful to detect some unusual activity like a sudden increase in
connections or queries.

Let me know if anyone has any ideas around this.

Thanks in advance,
Udit


Re: Hive Server2 Monitoring

2015-11-11 Thread Udit Mehta
Yes thats right. We dont use Cloudera Manager so cannot use that.
Increasing the PermGen and Heap space is what I did as well but I still
have no insight into what causes it to increase.

On Wed, Nov 11, 2015 at 11:23 AM, Personal  wrote:

> We use Cloudera Manager, but I’m guessing since you’re asking you don’t
> use that.
> We also had an issue with PermGen and Heap Space — the default settings
> had these very low so we were getting OOM errors for very simple requests.


Best way to deal with incompatible column type changes

2015-10-15 Thread Udit Mehta
Hi,

I have a Hive external table with a lot of partitions where the underlying
data is in JSON. I use this popular serde
 to read and write in JSON
format.

So I have a data stream where sometimes there are changes to the JSON
structure. For eg, a key might change its type from string to a struct or
an array. Replacing/Changing the column via the ALTER statement does not
really help since it results in a ClassCastException(Based on this ticket
).

My question is what would be the best way to deal with such schema changes
without dropping/creating the table again. Basically i dont want to lose my
partitions.

I am currently using Hive version 0.13.1 but planning to move to version
1.2.1 soon.

Any help/advise would be appreciated.

Thanks,
Udit


Change hive column size

2015-05-28 Thread Udit Mehta
Hi,

Per this ticket: https://issues.apache.org/jira/browse/HIVE-1364 , the max
column size in hive is limited to 4000 chars. But i do read that there is a
way to increase it via *mysql* which is our database for the metastore. Can
anyone point me as to how I can do this?

Our columns have deeply nested structs and easily cross the *4000* chars
limit.

Thanks,
Udit


Re: Change hive column size

2015-05-28 Thread Udit Mehta
Also this might be relevant now:
https://issues.apache.org/jira/browse/HIVE-9815

On Thu, May 28, 2015 at 10:41 PM, Udit Mehta ume...@groupon.com wrote:

 Hi Steve,

 I do see that it applies to hive 0.5. But I am facing a similar issue
 where my column of complex type is not able to hold a nested struct beyond
 a certain size(around 4000). I guessed 4000 still might be the followed
 size. Could it be a restriction imposed on the serde(which I doubt)? I use
 the json serde : https://github.com/rcongiu/Hive-JSON-Serde/tree/master

 Thanks,
 Udit

 On Thu, May 28, 2015 at 7:19 PM, Steve Howard stevedhow...@gmail.com
 wrote:

 Hi Udit,

 That JIRA is five years old and applies to hive 0.5.  Newer releases are
 far larger...

 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

 Thanks,

 Steve

 Sent from my iPad

 On May 28, 2015, at 9:08 PM, Udit Mehta ume...@groupon.com wrote:

 Hi,

 Per this ticket: https://issues.apache.org/jira/browse/HIVE-1364 , the
 max column size in hive is limited to 4000 chars. But i do read that there
 is a way to increase it via *mysql* which is our database for the
 metastore. Can anyone point me as to how I can do this?

 Our columns have deeply nested structs and easily cross the *4000* chars
 limit.

 Thanks,
 Udit





Storage Based Authorization

2015-05-11 Thread Udit Mehta
Hi,

I have enabled storage based authorization in the hive metastore by adding
the following configs to hive-site:

   property
 namehive.security.authorization.enabled/name
 valuetrue/value
   /property

   property
 namehive.security.authorization.manager/name

  
 valueorg.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider/value
   /property

   property
 namehive.server2.enable.doAs/name
 valuetrue/value
   /property

   property
 namehive.security.metastore.authorization.manager/name
 valueorg.apache.hadoop.hive.ql.security.authorization.StorageBasedAuthorizationProvider/value
   /property

   property
 namehive.warehouse.subdir.inherit.perms/name
 valuetrue/value
   /property


These configs work fine when I try to run hive using the hive-cli. But when
I try to connect to the thrift server(hive-server2) using beeline, these
configs dont seem to take effect.
Is there any other config I need to add to enable Storage Based
Authorization in the thrift server?

Any help would be appreciated.

Thanks,
Udit


Re: Hive drop table error

2015-04-08 Thread Udit Mehta
As a secondary thought, is it possible to remove the table from mysql if
its not possible to remove it from hive. What all entries in the mysql
tables would I need to remove?


On Tue, Apr 7, 2015 at 10:52 AM, Udit Mehta ume...@groupon.com wrote:

 Hi,

 I was able to create a highly nested table in hive but for some reason now
 I am unable to drop it or describe it. I get an IllegalArgumentException
 Error and dont know how to delete the table now.
 Does anyone have any ideas on how I can do this?
 The table has more than a 100 fields.

 Thanks,
 Udit



Hive drop table error

2015-04-07 Thread Udit Mehta
Hi,

I was able to create a highly nested table in hive but for some reason now
I am unable to drop it or describe it. I get an IllegalArgumentException
Error and dont know how to delete the table now.
Does anyone have any ideas on how I can do this?
The table has more than a 100 fields.

Thanks,
Udit


hyphen in hive struct field

2015-03-25 Thread Udit Mehta
Hi,

I have a hive table query:

create external table test (field1 struct `inner-table` : string);

I believe hyphens are disallowed  but to overcome that i read that we can
use ``(ticks) around them. But even this seems to fail.

Is there a way around this or hypens are not allowed in nested hive tables?

Thanks,
Udit


Hive Json Serde

2015-02-23 Thread Udit Mehta
I am using hive from HDP 2.2 and need to create a Hive table to query
multilevel Json data in HDFS of the following format:
{
timestamp: 1424100629409,
head: {
time: 2015-02-16T15:30:29.409Z,
place: {
url: null,
country: US
},
event_type: null,
name: hive_test,
event_id: 1234,
metadata: {
scope: search,
context: test,
extra_info: null
}
},
sourceType: test_source,
millisecond: null,
sourceFile: test_file
}

I am currently using the json serde :
https://github.com/rcongiu/Hive-JSON-Serde

But this does not let me define a table of the above format where the
head key has a few string:string mappings and a few string:maps.

Does anyone know of a serde to define a table in this format?
Any help will be appreciated.

Thanks,
Udit