[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-11-12 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
I think we should add the support back.  It sounded like some people didn't 
like this PR for a fix so we would need to investigate something else, 
@cloud-fan  did you have more specifics about what is missing from the PR 
(other then a test) or what approach should be taken here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-11-10 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
So, 2.4 is out. Where are we? Rereading the comments above, looks we should:

1. Find the root cause
2. Officially drop it if the workaround is not easy
3. Fix it if the workaround is simple
4. add a test

If not, I would just document that we dropped this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
Maybe we should discuss it in another thread, I think we should officially 
write down the procedure about how to evaluate a bug report and label it as a 
blocker or not. Currently https://spark.apache.org/contributing.html only says 
that data correctness issues should be blockers, which can't cover all the 
cases(like this one). It's also inefficient if we require a PMC vote for every 
issue to decide if it's a blocker or not.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/22144
  
Thanks @tgravescs for your latest posts -- they've saved me from posting 
something similar in many respects but more strongly worded.

What is bothering me (not just in the discussion of this PR, but more 
broadly) is that we have individuals making declarative statements about 
whether something can or can't block a release, or that something "is not that 
important to Spark at this point", etc. -- things for which there is no 
supporting PMC vote or declared policy. It may be your opinion, @cloud-fan , 
that Hive compatibility should no longer be important to the Apache Spark 
project, and I have no problem with you expressing your opinion on the matter. 
That may even be the internal policy at your employer, I don't know. But you 
are just not in a position on your own to make this declaration for the Apache 
Spark project.

I don't mean to single you out, @cloud-fan , as the only offender, since 
this isn't a unique instance. For example, heading into a recent release we 
also saw individual declarations that the data correctness issue caused by the 
shuffle replay partitioning issue was not a blocker because it was not a 
regression or that it was not significant enough to alter the release schedule. 
Rather, my point is that things like release schedules, the declaration of 
release candidates, labeling JIRA tickets with "blocker", and de facto or even 
declared policy on regressions and release blockers are just tools in the 
service of the PMC. If, as was the case with the shuffle data correctness 
issue, PMC members think that the issue must be fixed before the next release, 
then release schedules, RC-status, other individuals' perceptions of importance 
to the project or of policy ultimately don't matter -- only the vote of the PMC 
does. What is concerning me is that, instead of efforts to persuade the 
 PMC members that something should not block the next release or should not be 
important to the project, I am seeing flat declarations that an issue is not a 
blocker or not important. That may serve to stifle work to immediately fix a 
bug, or to discourage other contributions, but I can assure that trying to make 
the PMC serve the tools instead of the other way around won't serve to persuade 
at least some PMC members on how they should vote.

Sorry, I guess I can't avoid wording things strongly after all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan 

I agree with you that most issues do not have to block a release but I 
don't agree with most of your criteria for making that decision.   I've already 
talked about the other points you said that I disagree with so no point in 
going over again. 

The one I agree goes into the decision is your point 2:
> It fails the job instead of returning wrong result

 I think at this point it makes sense to start discussion on dev list to 
make sure people are in sync and in fact there is no written policy that I am 
aware of.  I want to make sure we don't have hard rules that state things like  
"since it was a bug in previous release it shouldn't be a blocker now".Its 
always going to be at people discretion as I think its impossible to have exact 
rules for this, but that is part of what we trust committers and PMC members to 
do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
@tgravescs please quote my full comment instead of part of it.

> After all, this is a bug and a regression from previous releases, like 
other 1000 we've fixed before.

The point I was making there is, this issue is not the ones that HAVE TO 
block a release, like correctness issue. I immediately list the reasons 
afterward why I don't think it's a blocker.

> hive compatibility is not that important to Spark at this point

I'm sorry if this worries you. It's true that we focus more on Spark itself 
instead of Hive compatibility in the recent development, but this should not be 
applied to existing Hive compatibility features in Spark and we should still 
maintain them.

BTW, I removed the `supportPartial` flag because no aggregate functions in 
Spark need it(including the adapted Hive UDAF), but the problem exists in how 
to adapt Hive UDAF, which was introduced by  
https://issues.apache.org/jira/browse/SPARK-18186


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22144
  
@tgravescs yes, I didn't say it was normal or good, but that it's not 
forbidden. Ex: no more Java 7 support in Spark 2.3. The point was: if this 
change were on purpose, then no it's not a regression. Can it have been on 
purpose? yes, cf. supra. Was it on purpose? no, not here, as it turns out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
while I'm ok with not blocking 2.4 for this as well, not for many of the 
reasons stated though. Note the jira was filed a Major not a blocker. Based on 
the information we have, the impact on the number of users for this issue seems 
low, it doesn't seem to cause a correctness issue as it seems to fail when the 
issue is hit, the 2.4 release is far enough underway and has other things users 
are waiting on that we don't want to delay.  But I think we need to investigate 
more and make a decision what we are doing with it, if we find that this does 
have higher impact we can do a 2.4.1 and really we would want it in previous 
versions as well.

I think the overall decision has to be based on the impact of the issue.  
As far as I know we don't have any written rules about this, but perhaps we 
need some.   The ultimate decision is basically if the release vote passes.  If 
the PMC members pass it they think its sufficient as a release.

I do also agree with @markhamstra about our criteria for calling it a 
non-blocker.  We should not be making that decision based on if it was 
regression from only last previous release.

I do NOT agree with @cloud-fan on most of his points as to why this is ok.

"After all, this is a bug and a regression from previous releases, like 
other 1000 we've fixed before. "

I'll state this again, this should have very little to do with the decision 
on if its a blocker, if its a correctness bug we are going to ignore because 
its been wrong for multiple release, the answer is NO we better not.  Many 
people don't upgrade immediately so things are found right away or its a 
obscure thing that only happens occasionally so it takes time for it to be 
reported.I do agree that the time the issue has been around does go into 
the calculation of the impact though. 

 - is a hive compatibility bug. Spark fails to run some Hive UDAFs

Really ?  what does a hive compatibility bug have to do with anything?  We 
state in our docs we support hive UDFs and UDAFs.  You seem very unconcerned 
with this which concerns me ("hive compatibility is not that important to Spark 
at this point") . Was there an official decision to drop this?  If so please 
point it out as I would strongly -1 this, otherwise anyone making changes here 
should keep compatibility and our committers are the ones that should enforce 
this and make sure it happens.  This is the basics of api compatibility. If we 
drop support for this many users will be hurt.  Just because your particular 
users don't use this, others do and as a member of Apache you should be 
concerned with the community not just your companies users.

@cloud-fan  you are the one that removed the supportPartial flag here: 
https://issues.apache.org/jira/browse/SPARK-19060 so we were assuming you had 
some knowledge of the code in this area and might have the back ground on it.

@srowen  your statement here: "Dropping support for something in a minor 
release isn't crazy though."  also concerns me.  We should not be dropping 
features on purpose in minor releases. This again is an api compatibility 
thing.  Unless its developer api or experimental. Obviously things get dropped 
by accident but we should not be doing this on purpose. Otherwise why do we 
have minor vs major releases at all.   



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-24 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
Adding it as a known issue sounds reasonable to me as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22144
  
I think this is different from the blocker tickets we opened before. We 
should try our best to avoid accidentally dropping the existing support. Please 
encourage more people in the community to try our RCs and find out all the new 
regressions or bugs. 

During the RC stage, for the new regressions, we can either fix it if the 
fix is very safe/tiny; or revert the PRs that introduce the regressions. This 
is how we handle the regressions during the RC stage. 

For this specific case, I do not think the root cause is found. If we 
revert the previous PRs https://issues.apache.org/jira/browse/SPARK-18186 that 
were merged in 2.2 release, it could easily introduce new regressions. Thus, we 
normally do not revert the PRs that were merged in the previous releases. 
Please add it as a known issue in the release note. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
Unfortunately, we didn't drop it mistakenly. It's a mistake and we should 
fix it. What I try to avoid is adding back the `supportsPartial` flag. We 
should look into the root cause and see how to fix it better.

I don't know if this policy is written down officially, but I do remember 
we followed this policy many times in the previous releases. Please correct me 
if I am wrong.

I'll list it as a known issue in 2.4.0 release notes. It will be great if 
someone can investigate the root cause and propose a fix(with a test).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
> According to the policy, we don't have to block the current release 
because of i

@cloud-fan, BTW, would you mind if I ask to share what you read? I want to 
be aware of the policy as well. Maybe do you mean correctness issue and data 
lose mentioned at https://spark.apache.org/contributing.html? They can be 
considered as blockers?

Blocker means: pointless to release without this change as the release 
would be unusable to a large minority of users. Correctness and data loss 
issues should be considered Blockers.

It's difficult to say but at least looks not so few of users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
After all, this is a bug and a regression from previous releases, like 
other 1000 we've fixed before. According to the policy, we don't have to block 
the current release because of it, and this bug
1. is a hive compatibility bug. Spark fails to run some Hive UDAFs
2. It fails the job instead of returning wrong result
3. the root cause is unknown

Thus, I think it's a non-blocker.

I looked into it more and I'm 90% sure this is caused by 
https://issues.apache.org/jira/browse/SPARK-18186 . We should spend more time 
on understanding how hive execute UDAF and fix the way we adapt it to Spark 
aggregate function.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/22144
  
Yes, @rxin I know that I was a little unfair to you in order to make my 
point sharper. Apologies. My concern is real, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  

@markhamstra how did you arrive at that conclusion? I said "it’s not a 
new regression and also
somewhat esoteric"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/22144
  
@srowen I understand and agree. What bothers me is that the block-no block 
decision now often seems to be "not a regression; automatic no block" -- and 
that doesn't seem right to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22144
  
@markhamstra I do think there's an unspoken but legitimate consideration 
here, and that's that there is also a cost to not shipping the N thousand other 
things users are waiting on in this release. This one sounds like it may cause 
weeks of delay, which is a lot. (And while the fact that this is already well 
delayed shouldn't matter to that logic, it kind of does in practice.)

That cost is why I'd say this change shouldn't block, rather than an 
erosion of some standard. In fact, there seemed to be plenty of appetite to 
back and fix several things in 2.4 that weren't, probably, correctness or 
regression problems.

As to whether this is a regression -- does sound like the current behavior 
is not what's intended and even wrong in some cases and that has to be fixed. 
It seems it was a regression in 2.2 that wasn't caught, and therefore should 
have blocked that release, but that ship has sailed. Now, it's become simply an 
established bug like any other. That doesn't make it less bad but it should be 
treated like something we found today vis-a-vis the release process.

I am left a little unclear on whether there's a correctness issue here. 
And: does 'fixing' this cause some user UDAFs that worked around this to break 
too?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/22144
  
> It’s certainly not a blocker since it’s not a new regression

I really hate this line of argument. Somehow we seem to have slipped from 
"if it is a regression, then we must block" to "if it is not a regression, then 
we don't necessarily have to block, but we might still choose to" to "if it is 
not a regression, then we necessarily should not block." I'm not at all 
comfortable with what now appears to be our de facto policy on regressions and 
blocking releases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread AlexanderSaydakov
Github user AlexanderSaydakov commented on the issue:

https://github.com/apache/spark/pull/22144
  
I believe there is a misunderstanding. It is not just about HLL Sketch 
UDFs. It seems to be a wrong way of executing Hive UDFs in general. It does not 
always manifest itself as a failure. A wrong state is being passed to some 
merge() calls. This violates the contract described in Hive's documentation.


https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFEvaluator.java

A state is instantiated for PARTIAL1 mode, but then passed to merge() 
method, which should only happen in PARTIAL2 mode. This is just wrong. it just 
happens to work for many simple aggregations, which don't notice.

HLL Sketch UDFs use different subclass of state for optimization, so 
merge() fails to cast the state the the class it expects. A workaround would be 
to give up the optimization and use the same state, but I am not sure we can 
trust the results.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22144
  
So, some particular functionality worked in Spark 2.1, but didn't work 
starting in 2.2. If anything, that was the breaking change. (I wonder if it was 
on purpose, or documented, but, what's done is done in any event.) Dropping 
support for something in a minor release isn't crazy though.

But it doesn't work in 2.2 or 2.3. I don't see that it has to work in 2.4, 
right? 

I understand it impacts a user and was reported a long time ago, but, so 
were 100 other things.

Consider the broader context of the release cycle, no I do not think this 
should block 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
The fact that it isn't a new regression shouldn't determine if its a 
blocker,  there are lots of things you don't hit right away. 

The impact does affect if we decide if this is a blocker.  Here it is 
definitely impacting the data sketches folks, it was reported a long time ago, 
I'll check back with them to see if they have worked around the issue.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
Wait wait .. do we only care about regressions as blockers for the last 
release (2.3)? I'm asking this because I really don't know. If so, I'm okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22144
  
It’s certainly not a blocker since it’s not a new regression and also
somewhat esoteric. Would be good to fix though.

On Tue, Oct 23, 2018 at 8:20 AM Wenchen Fan 
wrote:

> This is not a PR that is ready to merge. We are likely talking about
> delaying 2.4.0 for multiple weeks because of this issue. Is it really 
worth?
>
> I'm not sure what's the exact policy, let's ping more people. @rxin
>  @srowen  @vanzin
>  @felixcheung 
> @gatorsmile 
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
This is not a PR that is ready to merge. We are likely talking about 
delaying 2.4.0 for multiple weeks because of this issue. Is it really worth?

I'm not sure what's the exact policy, let's ping more people. @rxin  
@srowen @vanzin @felixcheung @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
For instance, 
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/sketches-user/GmH4-OlHP9g/MW-J7Hg4BwAJ
 this discussion thread was started almost one year ago.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
It does try to fix a backward compatibility issue. It is found later now 
but still it is true we found a breaking issue to fix. Broken backward 
compatibility that potentially affects a set of users (it at least looks 
blocking an external Hive UDF library repo) sounds like a blocker to me.

Some of Hive compatibility issues are reported late because IMHO it's a bit 
complicated to follow (it requires to take a look for both Hive and Spark at 
least).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
Note that the `supportsPartial` flag was dropped at Spark 2.2, not 2.4.

I'm not very familiar with Hive code so I don't clearly know how it is 
broken. The worst case is, Hive has some UDAF that don't support partial 
aggregate, and Spark needs to adjust its aggregate framework. Or it's just we 
incorrectly adapt Hive UDAF to Spark aggregation function, and we can simply 
work around it.

I shouldn't state is as a feature, it's an ability of Spark's aggregate 
framework to stop partial aggregate for some functions.

This fix is not ready. We should at least update the doc of 
`HiveUDAFFunction`, so that we can know where we misunderstand Hive UDAF 
framework.

If we were at Spark 2.2, we should definitely revert the PRs that caused 
this issue. But it's 2.4 now, reverting very old commits is not safe.

Personally I don't think this is a blocker that we have to fix it before 
releasing. It's not a correctness issue. it doesn't impact a lot of users, and 
it's there for nearly 2 years.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan  I disagree, it seems like you broke functionality that was 
there and you can't do that in a minor release.   3.0 would be fine to drop I 
think but we should fix it for 2.4, this PR was in a hope to get people to 
respond to see if the change was close to a fix, if we think this will work 
then he can add unit tests, otherwise I don't see a reason to waste time 
writing unit tests if the code changes are not going to be accepted.

If I'm missing something please explain as you state this is a new feature 
but it certainly seems like a broken feature to me.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
If we were going for 3.0, then I would definitely leave +1 and I agree that 
we should rather focus on Spark itself as a higher priority - we should do that 
when we go 3.0 and rather drop such cases IMHO. However, looks we are going for 
2.4 for now. 

It looks indeed a mistake and bringing a feature back that we had in 
previous releases, and sounds like the problem is not isolated (let's say, it's 
not specific to a few set UDAFs only), which, IMHO, looks a proper blocker.

How about we bring this back if the changes proposed here are all what we 
need? If the current change is not enough, we could at least mention this 
feature was dropped intentionally in migration guide.

About testing, this PR described "How was this patch tested?" although it 
has no test. @pgandhi999, are you able to write a test?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-10-23 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
My feeling is that, hive compatibility is not that important to Spark at 
this point. *ALL* aggregate functions in Spark (including Spark UDAF) support 
partial aggregate, but now we need to complicate the aggregation framework and 
support un-partial-able aggregate functions, only for a few Hive UDAFs.

Unless there is a simple workaround, or we can justify that Spark needs 
un-partial-able aggregate functions, IMO it's not worth to introduce this 
feature.

BTW this PR doesn't even have a test, so I'm not sure if we can have a 
simple workaround for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-27 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22144
  
hmm .. this is actually a regression and compatibility issue we should take 
a look into .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-18 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan would you have time to look at this to at least see if this is 
something important?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22144
  
hey, this looks important, could someone review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22144
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95763/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22144
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22144
  
**[Test build #95763 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95763/testReport)**
 for PR 22144 at commit 
[`173035d`](https://github.com/apache/spark/commit/173035dcd239299f9f93cb70f7bdd863053463ff).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22144
  
**[Test build #95763 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95763/testReport)**
 for PR 22144 at commit 
[`173035d`](https://github.com/apache/spark/commit/173035dcd239299f9f93cb70f7bdd863053463ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-06 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/22144
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-05 Thread pgandhi999
Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan I have updated the PR and added the configurable flags for 
Window UDAF and Hive UDAF for now. Would appreciate to hear your thoughts. 
Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-23 Thread pgandhi999
Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan I see a flag supportsPartial which was removed in PR 
https://github.com/apache/spark/pull/16461. Is this the flag you were talking 
about? If so, I was thinking that we make this flag configurable from the user 
end, so it does not change any existing stuff but will create a backward 
compatibility for users whose custom UDAF's might not support partial 
aggregation like the one in the example mentioned in the PR. Am I thinking in 
the right direction? Would really appreciate your guidance in this matter. 
Thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-21 Thread pgandhi999
Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/22144
  
@cloud-fan your comment makes sense, will work on it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22144
  
This should not be a global config but per-UDAF flag. IIRC we do have such 
a flag before, but get removed later. Maybe we should bring it back.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread pgandhi999
Github user pgandhi999 commented on the issue:

https://github.com/apache/spark/pull/22144
  
@dilipbiswal This property is by default set to true so it does not effect 
anything currently in the way UDAF's run. This property has been added purely 
for the purpose of maintaining backward compatibility with custom UDAF's that 
do not support partial aggregation yet. The method 
'planAggregateWithoutPartial' was previously present in version 2.1 but was 
removed from 2.2 onwards assuming that all UDAF's support partial aggregation 
which might not truly be the case as described in the JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22144
  
@pgandhi999 I have a basic question. Setting this property will have a 
global effect on all the aggregations ? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22144
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22144
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-08-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22144
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org