Re: Record too large for Tez in-memory buffer...

2016-02-11 Thread Gautam
setting tez.task.scale.memory.enabled=false; failed the job sooner.

But the fix from TEZ-2575  worked!

Upgraded from 0.7.0 to 0.7.1 and applied
https://github.com/apache/tez/commit/0e155e7185d1350f64dead488103777295ac76d1


Goes through without any fatal issues. Will continue testing / benchmarking
further.

thanks!
-Gautam.

On Wed, Feb 10, 2016 at 8:12 PM, Gopal Vijayaraghavan 
wrote:

>
> > Good to know there's a fix .. Is there a jira that talks about this
> >issue? Coz I couldn't find one.
>
> https://github.com/apache/tez/commit/714461f47e6408ec331acd0ddd640335e6a7a0
> 6c
>
>
> Also, it looks like Reducer 16 is the one failing - not Reducer 17.
>
> You can draw out the explain using https://github.com/t3rmin4t0r/lipwig
>
> PTF doesn't actually tell the UDAF name in the explain, so I'm guessing it
> a ROW_NUMBER() <= 50 - because that's the only one which didn't get
> optimized.
>
> I see absolutely no broadcast edges in this, so it's possible to disable
> the weighted memory scaler in Tez to sort of dumb it down to MRv2 mode.
>
> set tez.task.scale.memory.enabled=false;
>
> *or* do extensive tuning for it (see
> tez.task.scale.memory.additionalreservation.fraction.max).
>
> Cheers,
> Gopal
>
>


-- 
"If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers..."


RE: Add partition data to an external ORC table.

2016-02-11 Thread no jihun
actually original source is flume stream, avro formatted rows.
flume sink stream to hdfs's partition directory.

data flow.
flume  > avro > hdfs sink > daily partition dir.

my expected best flow
flume > orc > hdfs sink > partition dir

another option
flume > hdfs sink
then hive 'load data' command.
this let hive load text to hive with irc formatted.

because large amount of data should be processed hdfs sink distributes the
load.
if I use hive sink of flume, hive daemon may be a bottleneck, i think.

there seems many cases that change avro to orc.
if their previous data flow is based on flume + hdfs sink I am curious how
they did in detail.
2016. 2. 12. 오전 4:34에 "Ryan Harris" 님이 작성:

> If your original source is text, why don't you make your ORC-based table a
> hive managed table instead of an external table.
>
> Then you can load/partition your text data into the external table, query
> from that and insert into your ORC-backed Hive managed table.
>
>
>
> Theoretically, if you had your data in ORC files, you could just copy them
> to the external table/partition like you do with the text data, but the
> challenge is, how are you going to create the ORC source data?  You can
> create it with Hive, Pig, custom Java, etc, but **somehow** you are going
> to have to get your data into ORC format.  Hive is probably the easiest
> tool to use to do that.  You could load the data into a hive managed table,
> and then copy the ORC files back to an external table, but why?
>
>
>
> *From:* no jihun [mailto:jees...@gmail.com]
> *Sent:* Thursday, February 11, 2016 11:48 AM
> *To:* user@hive.apache.org
> *Subject:* Add partition data to an external ORC table.
>
>
>
> hello.
>
> I wanna know this could be possible or not.
>
> There would be an table which created by
>
> create external table test (
> date_string String,
> message String)
> STORED AS ORC
> PARTIONED BY (date_string STRING)
> LOCATION '/message';
>
> with this table
> I will never add row by 'insert' statement
> but want to
> #1. add data of each day to hdfs's partition location directly.
>   e.g /message/20160212
>   ( by $ hadoop fs -put )
> #2. then i will add partition everyday morning.
> ALTER TABLE test
> ADD PARTITION (date_string=’20160212’)
> location '/message/20160212';
> #3. query for the added data.
>
> with this scenario what or how can I prepare the ORC formatted data in
> step#1? when stored format is textfile I just need to copy raw file to
> partition directory, but with orc table I dont think this possible so
> easily.
>
> raw application log is json formatted and each day may have 1M json rows.
>
> Actually I do this jobs on my cluster with textfile table not ORC. now I
> am trying to table format.
>
> Any advise would be great.
> thanks
> --
> THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS
> CONFIDENTIAL and may contain information that is privileged and exempt from
> disclosure under applicable law. If you are neither the intended recipient
> nor responsible for delivering the message to the intended recipient,
> please note that any dissemination, distribution, copying or the taking of
> any action in reliance upon the message is strictly prohibited. If you have
> received this communication in error, please notify the sender immediately.
> Thank you.
>


Anyway to show current user name on beeline

2016-02-11 Thread Jim Green
Hi Team,

I could not find a way to show current logged-on user on beeline.
Is there any way to show that?

Something like:
Show current_user;
?


-- 
Thanks,
www.openkb.info
(Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)


RE: Hive Permanent functions not working after a cluster restart.

2016-02-11 Thread Chagarlamudi, Prasanth
Fyi Bastian,
This is fixed in version 1.2. https://issues.apache.org/jira/browse/HIVE-10288
If you are using cloudera distribution ? This issue is fixed in CHD 5.5 version.
I wasn’t able to find an alternative so far than to look for a latest version 
or recreate them every time you restart.

Thanks
Prasanth Chagarlamudi

From: Bastian Kronenbitter [mailto:bastian.kronenbit...@adello.com]
Sent: Thursday, February 11, 2016 2:20 AM
To: user@hive.apache.org
Subject: RE: Hive Permanent functions not working after a cluster restart.

Hi all,

We face the same problem. Every time we restart the hive server, permanent 
functions are not working. We are also using the hive 1.1 version of CDH5.
Listing the functions using “SHOW FUNCTIONS” shows the permanent functions, but 
without the database prefix they normally have.
We use a mysql metastore db and looking into it, I see the permanent functions 
in FUNCS and the corresponding resources in FUNC_RU.
Dropping the functions and creating them again solves the problem (until the 
next restart).

Any help or pointer is very much appreciated.

Best regards,
Bastian

From: Surendra , Manchikanti 
[mailto:surendra.manchika...@gmail.com]
Sent: Mittwoch, 10. Februar 2016 20:09
To: user@hive.apache.org
Subject: Re: Hive Permanent functions not working after a cluster restart.

Please check your metastore database tables refreshing after a restart ? 
Permanent functions will be stored in DB.

-- Surendra Manchikanti

On Wed, Feb 10, 2016 at 7:58 AM, Chagarlamudi, Prasanth 
mailto:prasanth.chagarlam...@epsilon.com>> 
wrote:
Hi Surendra,
Its Derby.

Thanks
Prasanth C

From: Surendra , Manchikanti 
[mailto:surendra.manchika...@gmail.com]
Sent: Tuesday, February 09, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Hive Permanent functions not working after a cluster restart.

Hi,

What's your meta store DB. Is it Derby (Internal) or external Database?

Regards,
Surendra M

On Fri, Feb 5, 2016 at 10:07 AM, Chagarlamudi, Prasanth 
mailto:prasanth.chagarlam...@epsilon.com>> 
wrote:
I created permanent functions(rather than temp functions) in Hive to use it 
across different sessions. It all works fine until I actually restart the hive 
server or cluster for any reason.

So is this the intended functionality of  Permanent functions?
Here is the hive doc link for Permanent functions.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction


1)  Placed my utils jar for hive in hdfs location 
hdfs:///opt/myUtiljars/myUtil.jar

2)  Create function schemaName.myFunctName as 
‘com.myclass.name’ using 
‘hdfs:///opt/myUtiljars/myUtil.jar’

3)  Select schemaName.myFunctName() from tableName;

This is what I did to create a permanent function through beeline. And this is 
working fine in other beeline sessions as well.
Now after I restart the servers I was able to see the functions name in “show 
functions;” command but I cannot use this function in any of my queries.

When I issue the command in 3) Error: Error while compiling statement: FAILED: 
SemanticException Line 0:-1 Invalid function schemaName.myFunctName ' 
(state=42000,code=4)

I would like to create Permanent functions as mentioned above and I don’t want 
to deal with them every time I restart.

Any corrections(if I am missing anything) or suggestion are greatly appreciated.

Thanks in advance
Prasanth Chagarlamudi




This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.




This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.




This e-mail and files transmitted with it are confiden

RE: Add partition data to an external ORC table.

2016-02-11 Thread Ryan Harris
If your original source is text, why don't you make your ORC-based table a hive 
managed table instead of an external table.
Then you can load/partition your text data into the external table, query from 
that and insert into your ORC-backed Hive managed table.

Theoretically, if you had your data in ORC files, you could just copy them to 
the external table/partition like you do with the text data, but the challenge 
is, how are you going to create the ORC source data?  You can create it with 
Hive, Pig, custom Java, etc, but *somehow* you are going to have to get your 
data into ORC format.  Hive is probably the easiest tool to use to do that.  
You could load the data into a hive managed table, and then copy the ORC files 
back to an external table, but why?

From: no jihun [mailto:jees...@gmail.com]
Sent: Thursday, February 11, 2016 11:48 AM
To: user@hive.apache.org
Subject: Add partition data to an external ORC table.


hello.

I wanna know this could be possible or not.

There would be an table which created by

create external table test (
date_string String,
message String)
STORED AS ORC
PARTIONED BY (date_string STRING)
LOCATION '/message';

with this table
I will never add row by 'insert' statement
but want to
#1. add data of each day to hdfs's partition location directly.
  e.g /message/20160212
  ( by $ hadoop fs -put )
#2. then i will add partition everyday morning.
ALTER TABLE test
ADD PARTITION (date_string=’20160212’)
location '/message/20160212';
#3. query for the added data.

with this scenario what or how can I prepare the ORC formatted data in step#1? 
when stored format is textfile I just need to copy raw file to partition 
directory, but with orc table I dont think this possible so easily.

raw application log is json formatted and each day may have 1M json rows.

Actually I do this jobs on my cluster with textfile table not ORC. now I am 
trying to table format.

Any advise would be great.
thanks

==
THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL 
and may contain information that is privileged and exempt from disclosure under 
applicable law. If you are neither the intended recipient nor responsible for 
delivering the message to the intended recipient, please note that any 
dissemination, distribution, copying or the taking of any action in reliance 
upon the message is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately.  Thank you.


Add partition data to an external ORC table.

2016-02-11 Thread no jihun
hello.

I wanna know this could be possible or not.

There would be an table which created by

create external table test (
date_string String,
message String)
STORED AS ORC
PARTIONED BY (date_string STRING)
LOCATION '/message';

with this table
I will never add row by 'insert' statement
but want to
#1. add data of each day to hdfs's partition location directly.
  e.g /message/20160212
  ( by $ hadoop fs -put )
#2. then i will add partition everyday morning.
ALTER TABLE test
ADD PARTITION (date_string=’20160212’)
location '/message/20160212';
#3. query for the added data.

with this scenario what or how can I prepare the ORC formatted data in
step#1? when stored format is textfile I just need to copy raw file to
partition directory, but with orc table I dont think this possible so
easily.

raw application log is json formatted and each day may have 1M json rows.

Actually I do this jobs on my cluster with textfile table not ORC. now I am
trying to table format.

Any advise would be great.
thanks


ApacheCon NA 2016 - Important Dates!!!

2016-02-11 Thread Melissa Warnkin
 Hello everyone!
I hope this email finds you well.  I hope everyone is as excited about 
ApacheCon as I am!
I'd like to remind you all of a couple of important dates, as well as ask for 
your assistance in spreading the word! Please use your social media platform(s) 
to get the word out! The more visibility, the better ApacheCon will be for 
all!! :)
CFP Close: February 12, 2016CFP Notifications: February 29, 2016Schedule 
Announced: March 3, 2016
To submit a talk, please visit:  
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp

Link to the main site can be found here:  
http://events.linuxfoundation.org/events/apache-big-data-north-america

Apache: Big Data North America 2016 Registration Fees:
Attendee Registration Fee: US$599 through March 6, US$799 through April 10, 
US$999 thereafterCommitter Registration Fee: US$275 through April 10, US$375 
thereafterStudent Registration Fee: US$275 through April 10, $375 thereafter
Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an 
add-on option on the registration form to join the conference for a discounted 
fee of US$399, available only to Apache: Big Data North America attendees.
So, please tweet away!!
I look forward to seeing you in Vancouver! Have a groovy day!!
~Melissaon behalf of the ApacheCon Team






Re: HIVE insert to dynamic partition table runs forever / hangs

2016-02-11 Thread Harshit Sharan
Hey Prasanth

Thanks. Setting this param worked like a charm.

Query was straightforward, but on a big data.
Hive version is 0.13.1

Quoting apache wiki for this param:
hive.optimize.sort.dynamic.partition

   - Default Value: true in Hive 0.13.0 and 0.13.1; false in Hive 0.14.0
   and later (HIVE-8151 )
   - Added In: Hive 0.13.0 with HIVE-6455
   

When enabled, dynamic partitioning column will be globally sorted. This way
we can keep only one record writer open for each partition value in the
reducer thereby reducing the memory pressure on reducers.

On Thu, Feb 11, 2016 at 8:14 PM, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi
>
> What query are you running? And what hive version are you using?
>
> You can try with hive.optimize.sort.dynamic.partition set to false.
>
> Thanks
> Prasanth
>
> _
> From: Harshit Sharan 
> Sent: Thursday, February 11, 2016 5:07 AM
> Subject: HIVE insert to dynamic partition table runs forever / hangs
> To: 
>
>
>
> Let us say we have 2 hive tables, tableA & tableB. I am exploding tableA,
> JOINing it with few other tables, and then inserting into tableB.
>
> Insert works fine when tableB has no partitions, or insertions are done
> using static partition.
>
> However, when there is a dynamic partition, the map reduce jobs doesn't
> even start. It sort of hangs.
>
> To debug more, I set the following param while initializing hive:
>
> -hiveconf hive.root.logger=DEBUG,console
>
> Now, I can see that the job is not actually hung. It is continuously
> printing logs like:
>
> 16/02/11 09:25:50 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:25:50 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2139 and EX_2140 as parent 
> of FS_68 and child of EX_213816/02/11 09:25:55 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:25:55 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2141 and EX_2142 as parent 
> of FS_68 and child of EX_214016/02/11 09:25:59 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:25:59 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2143 and EX_2144 as parent 
> of FS_68 and child of EX_214216/02/11 09:26:03 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:03 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2145 and EX_2146 as parent 
> of FS_68 and child of EX_214416/02/11 09:26:08 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:08 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2147 and EX_2148 as parent 
> of FS_68 and child of EX_214616/02/11 09:26:12 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:12 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2149 and EX_2150 as parent 
> of FS_68 and child of EX_214816/02/11 09:26:17 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:17 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2151 and EX_2152 as parent 
> of FS_68 and child of EX_215016/02/11 09:26:19 [Thread-5]: INFO 
> metrics.MetricsSaver: Saved 8:22 records to 
> /mnt/var/em/raw/i-63eec5e6_20160211_RunJar_14276_raw.bin16/02/11 09:26:21 
> [main]: INFO optimizer.SortedDynPartitionOptimizer: Sorted dynamic 
> partitioning optimization kicked in..16/02/11 09:26:21 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2153 and EX_2154 as parent 
> of FS_68 and child of EX_215216/02/11 09:26:26 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:26 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2155 and EX_2156 as parent 
> of FS_68 and child of EX_215416/02/11 09:26:30 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:30 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2157 and EX_2158 as parent 
> of FS_68 and child of EX_215616/02/11 09:26:35 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
> optimization kicked in..16/02/11 09:26:35 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Inserted RS_2159 and EX_2160 as parent 
> of FS_68 and child of EX_215816/02/11 09:26:40 [main]: INFO 
> optimizer.SortedDynPartitionOptimizer: Sorted dynami

Using Contains function call

2016-02-11 Thread Mich Talebzadeh
 

Hi, 

Is this possible to look for more than one work in a file using
contains? 

Example 

scala> oralog.filter(line =>
line.contains("Errors")).collect().foreach(line => println(line)) 

This operation will return anything with Errors in it. 

Few considerations please 

Much like UNIX or hdfs can I do case insensitive contains call like
ERROR or error (Like grep -i error) 

Is this possible to have more than one work entry in contains call like
egrep -i "Errors|ORA-" 

THanks 
-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: reading ORC format on Spark-SQL

2016-02-11 Thread Philip Lee
Thansk for your reply!

according to you because of its natural property of ORC, it cannot be
splited by the default chunk.
Because it is not composed of lines like csv.

Until you run out of capacity, a distributed system *has* to show sub-linear
scaling -
and will show flat scaling upto a particular point because of Amdahl's law.

This sentence is a bit confusing. so time of reading CSV file on Spark is
linearnly increasing as the data increase.
because it employes the full cluster, which means it runs out of capacity?

On the other hand, the reason why time of reading ORC format shows flat
scaling.
because it is not over the capacity yet?

but you know loading csv file is not much big as I guess.

Could you correct me?
Thanks in advance.

Best,
Phil

On Wed, Feb 10, 2016 at 11:17 PM, Philip Lee  wrote:

> Thansk for your reply!
>
> according to you because of its natural property of ORC, it cannot be
> splited by the default chunk.
> Because it is not composed of lines like csv.
>
> Until you run out of capacity, a distributed system *has* to show sub-linear
> scaling -
> and will show flat scaling upto a particular point because of Amdahl's
> law.
>
> This sentence is a bit confusing. so time of reading CSV file on Spark is
> linearnly increasing as the data increase.
> because it employes the full cluster, which means it runs out of capacity?
>
> On the other hand, the reason why time of reading ORC format shows flat
> scaling.
> because it is not over the capacity yet?
>
> but you know loading csv file is not much big as I guess.
>
> Could you correct me?
> Thanks in advance.
>
> Best,
> Phil
>
> On Wed, Feb 10, 2016 at 10:51 PM, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>>
>>
>> Your point on
>>
>>
>>
>> *" ORC readers are more efficient than reading text, but ORC readers
>> cannot*
>>
>> *split beyond a 64Mb chunk, while text readers can split down to 1 line
>> per*
>>
>> *task."*
>>
>>
>>
>> I thought you could decide on the stripe sizes less than default 64MB.
>> For example 16MB with setting 'orc.stripe.size'='16777216'
>>
>>
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of
>> Gopal Vijayaraghavan
>> Sent: 10 February 2016 21:43
>> To: user@hive.apache.org
>> Subject: Re: reading ORC format on Spark-SQL
>>
>>
>>
>>
>>
>> > The reason why I am asking this kind of question is reading csv file on
>>
>> >Spark is linearly increasing as the data size increase a bit, but reading
>>
>> >ORC format on Spark-SQL is still same as the data size increses in
>>
>> >.
>>
>> ...
>>
>> > This cause is from (just property of reading ORC format) or (creating
>>
>> >the table for input and loading the input in the table) or both?
>>
>>
>>
>> ORC readers are more efficient than reading text, but ORC readers cannot
>>
>> split beyond a 64Mb chunk, while text readers can split down to 1 line per
>>
>> task.
>>
>>
>>
>> So, it's possible the CSV readers are producing many many more divisions
>>
>> and running the query using the full cluster always - splitting
>>
>> indiscriminately is not always faster as each task has some fixed overhead
>>
>> unrelated to the data size (like plan deserialization in Kryo).
>>
>>
>>
>> For ORC - 59 tasks can run in the same time as 193 tasks, as long as
>>
>> there's capacity to run 193 in a single pass (like 200 executors).
>>
>>
>>
>> Until you run out of capacity, a distributed system *has* to show
>>
>> sub-linear scaling - and will show flat scaling upto a particular point
>>
>> because of Amdahl's law.
>>
>>
>>
>> Cheers,
>>
>> Gopal
>>
>
>
>
> --
>
> ==
>
> *Hae Joon Lee*
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> =

Re: HIVE insert to dynamic partition table runs forever / hangs

2016-02-11 Thread Prasanth Jayachandran
Hi

What query are you running? And what hive version are you using?

You can try with hive.optimize.sort.dynamic.partition set to false.

Thanks
Prasanth

_
From: Harshit Sharan mailto:hsincredi...@gmail.com>>
Sent: Thursday, February 11, 2016 5:07 AM
Subject: HIVE insert to dynamic partition table runs forever / hangs
To: mailto:user@hive.apache.org>>



Let us say we have 2 hive tables, tableA & tableB. I am exploding tableA, 
JOINing it with few other tables, and then inserting into tableB.

Insert works fine when tableB has no partitions, or insertions are done using 
static partition.

However, when there is a dynamic partition, the map reduce jobs doesn't even 
start. It sort of hangs.

To debug more, I set the following param while initializing hive:

-hiveconf hive.root.logger=DEBUG,console

Now, I can see that the job is not actually hung. It is continuously printing 
logs like:


16/02/11 09:25:50 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:25:50 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2139 and EX_2140 as parent 
of FS_68 and child of EX_213816/02/11 09:25:55 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:25:55 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2141 and EX_2142 as parent 
of FS_68 and child of EX_214016/02/11 09:25:59 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:25:59 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2143 and EX_2144 as parent 
of FS_68 and child of EX_214216/02/11 09:26:03 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:03 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2145 and EX_2146 as parent 
of FS_68 and child of EX_214416/02/11 09:26:08 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:08 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2147 and EX_2148 as parent 
of FS_68 and child of EX_214616/02/11 09:26:12 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:12 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2149 and EX_2150 as parent 
of FS_68 and child of EX_214816/02/11 09:26:17 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:17 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2151 and EX_2152 as parent 
of FS_68 and child of EX_215016/02/11 09:26:19 [Thread-5]: INFO 
metrics.MetricsSaver: Saved 8:22 records to 
/mnt/var/em/raw/i-63eec5e6_20160211_RunJar_14276_raw.bin16/02/11 09:26:21 
[main]: INFO optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
optimization kicked in..16/02/11 09:26:21 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2153 and EX_2154 as parent 
of FS_68 and child of EX_215216/02/11 09:26:26 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:26 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2155 and EX_2156 as parent 
of FS_68 and child of EX_215416/02/11 09:26:30 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:30 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2157 and EX_2158 as parent 
of FS_68 and child of EX_215616/02/11 09:26:35 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:35 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2159 and EX_2160 as parent 
of FS_68 and child of EX_215816/02/11 09:26:40 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:40 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2161 and EX_2162 as parent 
of FS_68 and child of EX_216016/02/11 09:26:45 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning optimization 
kicked in..16/02/11 09:26:45 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2163 and EX_2164 as parent 
of FS_68 and child of EX_216216/02/11 09:26:49 [Thread-5]: INFO 
metrics.MetricsSaver: Saved 8:22 records to 
/mnt/var/em/raw/i-63eec5e6_20160211_RunJar_14276_raw.bin16/02/11 09:26:50 
[main]: INFO optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning 
optimization kicked in..16/02/11 09:26:50 [main]: INFO 
optimizer.SortedDynPartitionOptimizer: Inserted RS_2165 and EX_2166 as parent

HIVE insert to dynamic partition table runs forever / hangs

2016-02-11 Thread Harshit Sharan
Let us say we have 2 hive tables, tableA & tableB. I am exploding tableA,
JOINing it with few other tables, and then inserting into tableB.

Insert works fine when tableB has no partitions, or insertions are done
using static partition.

However, when there is a dynamic partition, the map reduce jobs doesn't
even start. It sort of hangs.

To debug more, I set the following param while initializing hive:

-hiveconf hive.root.logger=DEBUG,console

Now, I can see that the job is not actually hung. It is continuously
printing logs like:



16/02/11 09:25:50 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:25:50 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2139 and EX_2140 as
parent of FS_68 and child of EX_2138
16/02/11 09:25:55 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:25:55 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2141 and EX_2142 as
parent of FS_68 and child of EX_2140
16/02/11 09:25:59 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:25:59 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2143 and EX_2144 as
parent of FS_68 and child of EX_2142
16/02/11 09:26:03 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:03 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2145 and EX_2146 as
parent of FS_68 and child of EX_2144
16/02/11 09:26:08 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:08 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2147 and EX_2148 as
parent of FS_68 and child of EX_2146
16/02/11 09:26:12 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:12 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2149 and EX_2150 as
parent of FS_68 and child of EX_2148
16/02/11 09:26:17 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:17 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2151 and EX_2152 as
parent of FS_68 and child of EX_2150
16/02/11 09:26:19 [Thread-5]: INFO metrics.MetricsSaver: Saved
8:22 records to
/mnt/var/em/raw/i-63eec5e6_20160211_RunJar_14276_raw.bin
16/02/11 09:26:21 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:21 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2153 and EX_2154 as
parent of FS_68 and child of EX_2152
16/02/11 09:26:26 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:26 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2155 and EX_2156 as
parent of FS_68 and child of EX_2154
16/02/11 09:26:30 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:30 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2157 and EX_2158 as
parent of FS_68 and child of EX_2156
16/02/11 09:26:35 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:35 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2159 and EX_2160 as
parent of FS_68 and child of EX_2158
16/02/11 09:26:40 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:40 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2161 and EX_2162 as
parent of FS_68 and child of EX_2160
16/02/11 09:26:45 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:45 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2163 and EX_2164 as
parent of FS_68 and child of EX_2162
16/02/11 09:26:49 [Thread-5]: INFO metrics.MetricsSaver: Saved
8:22 records to
/mnt/var/em/raw/i-63eec5e6_20160211_RunJar_14276_raw.bin
16/02/11 09:26:50 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:50 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2165 and EX_2166 as
parent of FS_68 and child of EX_2164
16/02/11 09:26:56 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Sorted dynamic partitioning
optimization kicked in..
16/02/11 09:26:56 [main]: INFO
optimizer.SortedDynPartitionOptimizer: Inserted RS_2167 and EX_2168 as
parent of FS_68 and child of EX_2166

..


These logs are printed like forever! However, without the dynamic
partition, the complete insert quer

RE: Hive Permanent functions not working after a cluster restart.

2016-02-11 Thread Bastian Kronenbitter
Hi all,

We face the same problem. Every time we restart the hive server, permanent 
functions are not working. We are also using the hive 1.1 version of CDH5.
Listing the functions using “SHOW FUNCTIONS” shows the permanent functions, but 
without the database prefix they normally have.
We use a mysql metastore db and looking into it, I see the permanent functions 
in FUNCS and the corresponding resources in FUNC_RU.
Dropping the functions and creating them again solves the problem (until the 
next restart).

Any help or pointer is very much appreciated.

Best regards,
Bastian

From: Surendra , Manchikanti [mailto:surendra.manchika...@gmail.com]
Sent: Mittwoch, 10. Februar 2016 20:09
To: user@hive.apache.org
Subject: Re: Hive Permanent functions not working after a cluster restart.

Please check your metastore database tables refreshing after a restart ? 
Permanent functions will be stored in DB.

-- Surendra Manchikanti

On Wed, Feb 10, 2016 at 7:58 AM, Chagarlamudi, Prasanth 
mailto:prasanth.chagarlam...@epsilon.com>> 
wrote:
Hi Surendra,
Its Derby.

Thanks
Prasanth C

From: Surendra , Manchikanti 
[mailto:surendra.manchika...@gmail.com]
Sent: Tuesday, February 09, 2016 3:44 PM
To: user@hive.apache.org
Subject: Re: Hive Permanent functions not working after a cluster restart.

Hi,

What's your meta store DB. Is it Derby (Internal) or external Database?

Regards,
Surendra M

On Fri, Feb 5, 2016 at 10:07 AM, Chagarlamudi, Prasanth 
mailto:prasanth.chagarlam...@epsilon.com>> 
wrote:
I created permanent functions(rather than temp functions) in Hive to use it 
across different sessions. It all works fine until I actually restart the hive 
server or cluster for any reason.

So is this the intended functionality of  Permanent functions?
Here is the hive doc link for Permanent functions.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/ReloadFunction


1)  Placed my utils jar for hive in hdfs location 
hdfs:///opt/myUtiljars/myUtil.jar

2)  Create function schemaName.myFunctName as 
‘com.myclass.name’ using 
‘hdfs:///opt/myUtiljars/myUtil.jar’

3)  Select schemaName.myFunctName() from tableName;

This is what I did to create a permanent function through beeline. And this is 
working fine in other beeline sessions as well.
Now after I restart the servers I was able to see the functions name in “show 
functions;” command but I cannot use this function in any of my queries.

When I issue the command in 3) Error: Error while compiling statement: FAILED: 
SemanticException Line 0:-1 Invalid function schemaName.myFunctName ' 
(state=42000,code=4)

I would like to create Permanent functions as mentioned above and I don’t want 
to deal with them every time I restart.

Any corrections(if I am missing anything) or suggestion are greatly appreciated.

Thanks in advance
Prasanth Chagarlamudi




This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.




This e-mail and files transmitted with it are confidential, and are intended 
solely for the use of the individual or entity to whom this e-mail is 
addressed. If you are not the intended recipient, or the employee or agent 
responsible to deliver it to the intended recipient, you are hereby notified 
that any dissemination, distribution or copying of this communication is 
strictly prohibited. If you are not one of the named recipient(s) or otherwise 
have reason to believe that you received this message in error, please 
immediately notify sender by e-mail, and destroy the original message. Thank 
You.