Re: [DISCUSS] Supporting Hadoop-1 and experimental features

2015-05-24 Thread Edward Capriolo
"Same goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases."

I am not trying to pick on your words here but I want to acknowledge
something.

"Been around for 2 releases" means less to people than you would think.
Many of users are locked in by when the distribution chooses to cut a
release. Now as it turns outs there are two major distributions, one
distribution does pretty much nothing to support tez. Here is what "around
for two releases" means for a CDH user:

http://search-hadoop.com/m/8er9RFVSf2&subj=Re+Getting+Tez+working+against+cdh+5+3

After much hacking with a rather new CDH version I was actually unable to
get the alternative running.

The other alternative, which I am presuming, to mean hive-on-spark probably
has not shipped in many distributions either. I do not think either
"alternative" has much real world battlefield experience.

The reality is a normal user has to test a series of processes before they
can pull the trigger on an upgrade. For example, I used to work at a adtech
company. Hive added a feature called "Exchange partitions".Tthis actually
broke a number of our processes because we use the word "exchange" all the
time.It became a keyword many of our scripts broke. This is not a fault of
hive or the feature, but is is just a fact that no one wants to touch test
big lumbering ETL proceses (even with lightning fast sexy engines) five
times a year.

I mentioned this before but I want to repeat. Hive was "releasable trunk"
for a long time and it served users well. We never had 2-4 feature
branches. One binary dropped ontop of hadoop 17, 20, 21, 203 and 2.0. If we
get in a situation where all the "old users" "don't care about new
features" we can easily land in a situation where are actual users are
running the "old" hadoop unable to upgrade to the "hive with the new
features" because it requires dependencies < 2 months old not ported to
their distribution yet. As a user I am already starting to see this where
the distributions behind hive because a point upgrade is not compelling for
the distributor.

On Fri, May 22, 2015 at 4:19 PM, Alan Gates  wrote:

> I agree with *All* features with the exception that some features might be
> branch-1 specific (if it's a feature on something no longer supported in
> master, like hadoop-1).  Without this we prevent new features for older
> technology, which doesn't strike me as reasonable.
>
> I see your point on saying the contributor may not understand where best
> to put the patch, and thus the committer decides.  However, it would be
> very disappointing for a contributor who uses branch-1 to build a new
> feature only to have the committer put it only in master.  So I would
> modify your modification to say "at the discretion of the contributor and
> Hive committers".
>
> Alan.
>
>   kulkarni.swar...@gmail.com
>  May 22, 2015 at 11:41
> +1 on the new proposal. Feedback below:
>
> > New features must be put into master.  Whether to put them into
> branch-1 is at the discretion of the developer.
>
> How about we change this to "*All* features must be put into master.
> Whether to put them into branch-1 is at the discretion of the *committer*."
> The reason I think is going forward for us to sustain as a happy and
> healthy community, it's imperative for us to make it not only easy for the
> users, but also for developers and committers to contribute/commit patches.
> To me being a hive contributor would be hard to determine which branch my
> code belongs. Also IMO(and I might be wrong) but many committers have their
> own areas of expertise and it's also very hard for them to immediately
> determine what branch a patch should go to unless very well documented
> somewhere. Putting all code into the master would be an easy approach to
> follow and then cherry picking to other branches can be done. So even if
> people forget to do that, we can always go back to master and port the
> patches out to these branches. So we have a master branch, a branch-1 for
> stable code, branch-2 for experimental and "bleeding edge" code and so on.
> Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on.
>
> Another reason I say this is because in my experience, a pretty
> significant amount of work is hive is still bug fixes and I think that is
> what the user cares most about(correctness above anything else). So with
> this approach, might be very obvious to what branches to commit this to.
>
>
>
>
> --
> Swarnim
>Chris Drome 
>  May 22, 2015 at 0:49
> I understand the motivation and benefits of creating a branch-2 where more
> disruptive work can go on without affecting branch-1. While not necessarily
> against this approach, from Yahoo's standpoint, I do have some questions
> (concerns).
> Upgrading to a new version of Hive requires a significant commitment of
> time and resources to stabilize and certify a build for deployment 

Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for HoS.[Spark Branch]

2015-05-24 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85107
---



ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java


Could you change this to DEBUG level and "if (LOG.isDebugEnabled())"?



ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java


worksToCache or WorksToClone?


- Xuefu Zhang


On May 22, 2015, 6:18 a.m., chengxiang li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> ---
> 
> (Updated May 22, 2015, 6:18 a.m.)
> 
> 
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
> 
> 
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
> 
> 
> Repository: hive-git
> 
> 
> Description
> ---
> 
> see jira description
> 
> 
> Diffs
> -
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 
> 3f240f5 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 
> e6c845c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
>  PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 
> 19aae70 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79 
> 
> Diff: https://reviews.apache.org/r/34455/diff/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> chengxiang li
> 
>



[jira] [Created] (HIVE-10814) hive on tez skew table plan wrong

2015-05-24 Thread tangjunjie (JIRA)
tangjunjie created HIVE-10814:
-

 Summary: hive on tez   skew table plan wrong
 Key: HIVE-10814
 URL: https://issues.apache.org/jira/browse/HIVE-10814
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 1.1.0
 Environment: hive 1.1.0 + tez 0.53 
Reporter: tangjunjie


set hive.execution.engine=mr; 
set hive.mapred.supports.subdirectories=true; 
set hive.optimize.skewjoin.compiletime = true; 

ALTER TABLE tandem.fct_traffic_navpage_path_detl SKEWED BY 
(ordr_code,cart_prod_id) ON (('','NULL')); 

Vertex failed, vertexName=initialmap, 
vertexId=vertex_1419300485749_1514787_1_00, diagnostics=[Task failed, 
taskId=task_1419300485749_1514787_1_00_000245, diagnostics=[TaskAttempt 0 
failed, info=[Error: Failure while running task:java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while 
processing row 
{"parnt_ordr_id":3715999535959,"parnt_ordr_code":"3715999535959","end_user_id":163846959,"comb_prod_id":7873715,"sale_amt":99.0,"actl_sale_amt":99.0,"sale_num":1,"updt_time":"2015-05-11
 03:58:13","etl_batch_id":0,"ds":"2015-05-10"} 
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) 
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) 
at 
org.apache.tez.mapreduce.processor.map.MapProcessor.runOldMapper(MapProcessor.java:183)
 
at 
org.apache.tez.mapreduce.processor.map.MapProcessor.run(MapProcessor.java:126) 
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324)
 
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:176)
 
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:168)
 

I think this is the error,cart_prod_id col not exist in table 
univ_parnt_tranx_comb_detl 
TableScan 
alias: o 
Statistics: Num rows: 109845709 Data size: 14499651703 Basic stats: PARTIAL 
Column stats: NONE 
Filter Operator 
predicate: (not ((ordr_code = '') and (cart_prod_id = null))) (type: boolean) 
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE 
Reduce Output Operator 
key expressions: parnt_ordr_code (type: string), comb_prod_id (type: bigint) 
sort order: ++ 
Map-reduce partition columns: parnt_ordr_code (type: string), comb_prod_id 
(type: bigint) 
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE 
value expressions: end_user_id (type: bigint), actl_sale_amt (type: double), 
sale_num (type: bigint), ds (type: string) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)