[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-04-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422529#comment-16422529
 ] 

Rui Li commented on HIVE-18955:
---

The failures don't seem related since they've been failing in previous runs. I 
tried some of the "likely timed out" cases and they passed locally.
[~jcamachorodriguez] what do you think about the patch?

As to downgrading netty, I guess that may introduce other conflicts because we 
upgraded netty to align with Spark. [~stakiar] please correct me if I 
misunderstand.

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch, HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420253#comment-16420253
 ] 

Rui Li commented on HIVE-18831:
---

[~stakiar], yeah I agree sending a Throwable makes it easier to retrieve the 
root cause. Let's just use RuntimeException in such cases. I don't think we 
need another wrapper class just to be serializable.

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, 
> HIVE-18831.7.patch, HIVE-18831.8.WIP.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-29 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420062#comment-16420062
 ] 

Rui Li commented on HIVE-18955:
---

Re-upload the patch for another ptest run.

[~stakiar], do you think we can downgrade netty as [~yuchaoran2011] suggested?

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch, HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-29 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18955:
--
Attachment: HIVE-18955.1.patch

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch, HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-29 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418971#comment-16418971
 ] 

Rui Li commented on HIVE-18831:
---

Besides, I think the {{JobResultSerializer}} has some problems:
# It shouldn't clear the Output when java serializer fails. The Output is 
maintained by kryo and may already contain some data in it. Instead, I think we 
can try (with the fallback) to serialize JobResult into a byte array, and then 
write the byte array into the Output. Deserializer can read the byte array and 
use an ObjectInputStream to deserialize the data.
# How about we just send the stack trace if the Throwable can't be serialized? 
We can set the Throwable to null in JobResult and keep the stack trace string 
as we do now. I think this is simpler and doesn't need the assumption that 
SparkException is always serializable.

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, 
> HIVE-18831.7.patch, HIVE-18831.8.WIP.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-29 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418871#comment-16418871
 ] 

Rui Li commented on HIVE-18955:
---

Hi [~jcamachorodriguez], [~bslim], do you have further comments about the 
patch? And could you let me know what's the use case when shading should be 
skipped? Thanks.

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-28 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418463#comment-16418463
 ] 

Rui Li commented on HIVE-18831:
---

[~stakiar], I think the right way to register the serializer is to call 
{{Kryo::register(Class type, Serializer serializer)}}.

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, 
> HIVE-18831.7.patch, HIVE-18831.8.WIP.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-27 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416858#comment-16416858
 ] 

Rui Li commented on HIVE-18831:
---

How about we register a custom serializer for {{JobResult}}, which can try to 
serialize the Throwable like {{JavaSerializer}} and fall back to stack trace 
string if that fails? And we can use an extra boolean to indicate whether the 
Throwable is serialized, so that the deserializer can deserialize accordingly.

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, HIVE-18831.7.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-27 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415746#comment-16415746
 ] 

Rui Li commented on HIVE-18831:
---

Hi [~stakiar], thanks for the explanation. I agree differentiating task 
exceptions is useful, as described by this JIRA. But I don't feel a strong need 
for the new wrapper class. I think it's more natural to expect a Throwable when 
something goes wrong. And that's usually what you need at the end of the day, 
e.g. when calling {{Task::setException}} and {{JobHandleImpl::setFailure}}. But 
returning a Throwable doesn't imply the Throwable has to come from the RPC -- 
that's implementation details.
A possible improvement similar to SPARK-8625 is try to transfer the original 
Throwable and fall back to current implementation when that fails. It can be 
done in a separate JIRA though. What do you think?

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, HIVE-18831.7.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18925) Hive doesn't work when JVM is America/Bahia_Banderas time zone

2018-03-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413785#comment-16413785
 ] 

Rui Li commented on HIVE-18925:
---

BTW, the patch should be named {{HIVE-18925.2.patch}}. [~findepi] please upload 
a patch with the proper name for testing. Refer to [this 
wiki|https://cwiki.apache.org/confluence/display/Hive/HowToContribute#HowToContribute-CreatingaPatch]
 for more details.

> Hive doesn't work when JVM is America/Bahia_Banderas time zone
> --
>
> Key: HIVE-18925
> URL: https://issues.apache.org/jira/browse/HIVE-18925
> Project: Hive
>  Issue Type: Bug
> Environment: JVM in America/Bahia_Banderas zone
>Reporter: Piotr Findeisen
>Assignee: Piotr Findeisen
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-18925-2.patch, HIVE-18925.patch
>
>
> Hive Server2 doesn't  work if started with 
> {{-Duser.timezone=America/Bahia_Banderas}}
>  
> Steps to reproduce
>  # use [https://github.com/big-data-europe/docker-hive]
>  # Add {{HADOOP_CLIENT_OPTS: '-Duser.timezone=America/Bahia_Banderas'}} to 
> {{hive-server}} docker container environment configuration
>  # {{docker-compose up}}
>  # 
> {code:java}
> host# docker-compose exec hive-server bash
> container# /opt/hive/bin/beeline -u jdbc:hive2://localhost:1 
> --verbose=true
> ...
> jdbc:hive2://localhost:1> select 1;{code}
> The above fails and prints
> {noformat}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas) (state=08S01,code=0)
> java.sql.SQLException: java.lang.IllegalStateException: Can't overwrite cause 
> with org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:323)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.java:997)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1205)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1134)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1314)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1178)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1033)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at java.lang.Throwable.initCause(Throwable.java:457)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:198)
> at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
> ... 15 more
> Caused by: java.lang.ExceptionInInitializerError: null
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:245)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:211)
> ... 21 more{noformat}
> From the above stacktrace it's not visible what is the cause, but i think 
> it's initialization of 
> {{org.apache.hive.common.util.TimestampParser#startingDateValue}}
>  



--
This message was sent by Atlassian JIRA

[jira] [Commented] (HIVE-18925) Hive doesn't work when JVM is America/Bahia_Banderas time zone

2018-03-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413779#comment-16413779
 ] 

Rui Li commented on HIVE-18925:
---

Thanks [~findepi] for the update. Currently we don't have a way to prevent 
users from specifying patterns like {{MM:dd:ss}} as {{timestamp.formats}} in 
the SerDe property. That's why I think we should handle the case.

The change looks good to me. [~sershe], [~ashutoshc] would you mind take a look 
at patch v2?

> Hive doesn't work when JVM is America/Bahia_Banderas time zone
> --
>
> Key: HIVE-18925
> URL: https://issues.apache.org/jira/browse/HIVE-18925
> Project: Hive
>  Issue Type: Bug
> Environment: JVM in America/Bahia_Banderas zone
>Reporter: Piotr Findeisen
>Assignee: Piotr Findeisen
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-18925-2.patch, HIVE-18925.patch
>
>
> Hive Server2 doesn't  work if started with 
> {{-Duser.timezone=America/Bahia_Banderas}}
>  
> Steps to reproduce
>  # use [https://github.com/big-data-europe/docker-hive]
>  # Add {{HADOOP_CLIENT_OPTS: '-Duser.timezone=America/Bahia_Banderas'}} to 
> {{hive-server}} docker container environment configuration
>  # {{docker-compose up}}
>  # 
> {code:java}
> host# docker-compose exec hive-server bash
> container# /opt/hive/bin/beeline -u jdbc:hive2://localhost:1 
> --verbose=true
> ...
> jdbc:hive2://localhost:1> select 1;{code}
> The above fails and prints
> {noformat}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas) (state=08S01,code=0)
> java.sql.SQLException: java.lang.IllegalStateException: Can't overwrite cause 
> with org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:323)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.java:997)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1205)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1134)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1314)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1178)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1033)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at java.lang.Throwable.initCause(Throwable.java:457)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:198)
> at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
> ... 15 more
> Caused by: java.lang.ExceptionInInitializerError: null
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:245)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:211)
> ... 21 more{noformat}
> From the above stacktrace it's not visible what is the cause, but i think 
> it's initialization of 
> 

[jira] [Commented] (HIVE-18831) Differentiate errors that are thrown by Spark tasks

2018-03-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413610#comment-16413610
 ] 

Rui Li commented on HIVE-18831:
---

Hey [~stakiar], I'm not sure about the benefit of this refactor. It seems you 
still need to serialize the Throwable as strings and parse the string to tell 
apart different kinds of errors. Is the improvement just to avoid the parsing 
logic of HIVE-15237?

> Differentiate errors that are thrown by Spark tasks
> ---
>
> Key: HIVE-18831
> URL: https://issues.apache.org/jira/browse/HIVE-18831
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18831.1.patch, HIVE-18831.2.patch, 
> HIVE-18831.3.patch, HIVE-18831.4.patch, HIVE-18831.6.patch, HIVE-18831.7.patch
>
>
> We propagate exceptions from Spark task failures to the client well, but we 
> don't differentiate between errors from HS2 / RSC vs. errors thrown by 
> individual tasks.
> Main motivation is that when the client sees a propagated Spark exception its 
> difficult to know what part of the excution threw the exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18925) Hive doesn't work when JVM is America/Bahia_Banderas time zone

2018-03-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411192#comment-16411192
 ] 

Rui Li commented on HIVE-18925:
---

I tried TestTimestampParser#testPattern2. We parsed {{05:06:07 (of pattern 
"MM:dd:ss")}} to {{1970-05-06 01:00:07.0}}. I don't think this is the expected 
result when user uses TimestampParser. Is there a way to prevent such a use 
case?

> Hive doesn't work when JVM is America/Bahia_Banderas time zone
> --
>
> Key: HIVE-18925
> URL: https://issues.apache.org/jira/browse/HIVE-18925
> Project: Hive
>  Issue Type: Bug
> Environment: JVM in America/Bahia_Banderas zone
>Reporter: Piotr Findeisen
>Assignee: Piotr Findeisen
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-18925.patch
>
>
> Hive Server2 doesn't  work if started with 
> {{-Duser.timezone=America/Bahia_Banderas}}
>  
> Steps to reproduce
>  # use [https://github.com/big-data-europe/docker-hive]
>  # Add {{HADOOP_CLIENT_OPTS: '-Duser.timezone=America/Bahia_Banderas'}} to 
> {{hive-server}} docker container environment configuration
>  # {{docker-compose up}}
>  # 
> {code:java}
> host# docker-compose exec hive-server bash
> container# /opt/hive/bin/beeline -u jdbc:hive2://localhost:1 
> --verbose=true
> ...
> jdbc:hive2://localhost:1> select 1;{code}
> The above fails and prints
> {noformat}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas) (state=08S01,code=0)
> java.sql.SQLException: java.lang.IllegalStateException: Can't overwrite cause 
> with org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:323)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.java:997)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1205)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1134)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1314)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1178)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1033)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at java.lang.Throwable.initCause(Throwable.java:457)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:198)
> at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
> ... 15 more
> Caused by: java.lang.ExceptionInInitializerError: null
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:245)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:211)
> ... 21 more{noformat}
> From the above stacktrace it's not visible what is the cause, but i think 
> it's initialization of 
> {{org.apache.hive.common.util.TimestampParser#startingDateValue}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411122#comment-16411122
 ] 

Rui Li commented on HIVE-18955:
---

Hi [~bslim], we have these dependency chains to {{async-http-client}}:
{noformat}
com.metamx:java-util -> org.asynchttpclient:async-http-client
io.druid:druid-server -> com.metamx:java-util -> 
org.asynchttpclient:async-http-client
io.druid:druid-processing -> com.metamx:java-util -> 
org.asynchttpclient:async-http-client
{noformat}

{{com.metamx:java-util}}, {{io.druid:druid-server}} and 
{{io.druid:druid-processing}} are already included in the shaded jar. By making 
them optional, we won't package these jars (and therefore neither the 
{{async-http-client}} jar) into the lib folder. I think this is OK because 
otherwise, we'll have both the original and relocated classes in our class 
path. Do you think that's the right way to go?

Besides, this solution doesn't work if shading is skipped. Any thoughts how we 
should handle that?

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-22 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18955:
--
Status: Patch Available  (was: Open)

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-22 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18955:
--
Attachment: HIVE-18955.1.patch

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
> Attachments: HIVE-18955.1.patch
>
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-22 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li reassigned HIVE-18955:
-

Assignee: Rui Li

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Blocker
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-21 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18955:
--
Priority: Blocker  (was: Major)

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Priority: Blocker
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18974) Wrong UTC time while converting from CST by to_utc_timestamp UDFs

2018-03-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405762#comment-16405762
 ] 

Rui Li commented on HIVE-18974:
---

Could be the same issue as HIVE-14305. Does the machine's system timezone use 
DST?

> Wrong UTC time while converting from CST by to_utc_timestamp UDFs 
> --
>
> Key: HIVE-18974
> URL: https://issues.apache.org/jira/browse/HIVE-18974
> Project: Hive
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 1.2.1
>Reporter: swayam
>Assignee: Hive QA
>Priority: Critical
>
> {color:#FF}Error on Daylight Saving 2017{color}
>  
> select to_utc_timestamp("2017-03-11 19:00:00",'CST');
> OK
> 2017-03-12 01:00:00 --> expected 6 hr difference 
> Time taken: 0.08 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-11 20:00:00",'CST');
> OK
> 2017-03-12 03:00:00 --> wrong 7 hr difference 
> Time taken: 0.088 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-11 21:00:00",'CST');
> OK
> 2017-03-12 04:00:00--> wrong 7 hr difference 
> Time taken: 2.884 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-11 22:00:00",'CST');
> OK
> 2017-03-12 05:00:00--> wrong 7 hr difference 
> Time taken: 0.075 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-11 23:00:00",'CST');
> OK
> 2017-03-12 06:00:00 --> wrong 7 hr difference 
> Time taken: 0.068 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-12 00:00:00",'CST');
> OK
> 2017-03-12 07:00:00 --> wrong 7 hr difference 
> Time taken: 4.769 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-12 01:00:00",'CST');
> OK
> 2017-03-12 08:00:00 --> wrong 7 hr difference 
> Time taken: 0.066 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-12 02:00:00",'CST');
> OK
> 2017-03-12 08:00:00 --> wrong 7 hr difference 
> Time taken: 0.066 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-12 03:00:00",'CST');
> OK
> 2017-03-12 08:00:00 --> expected 5 hr 
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select to_utc_timestamp("2017-03-12 04:00:00",'CST');
> OK
> 2017-03-12 09:00:00--> expected 5 hr 
> Time taken: 0.065 seconds, Fetched: 1 row(s)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-15 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401364#comment-16401364
 ] 

Rui Li commented on HIVE-18955:
---

Hi [~bslim], both {{netty-all-4.1.17.Final.jar}} and 
{{async-http-client-2.0.37.jar}} contains class 
{{io.netty.channel.DefaultChannelId}}. The two classes have different 
definitions of {{newInstance()}} method.

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Priority: Major
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-15 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16400408#comment-16400408
 ] 

Rui Li commented on HIVE-18955:
---

Found a similar issue here [https://github.com/impossibl/pgjdbc-ng/issues/332.] 
I tried removing {{async-http-client-2.0.37.jar}} from lib and it can fix the 
issue.

 

[~jcamachorodriguez], do you think we can add async-http-client to the shaded 
jar of hive-druid-handler and not include it in the lib folder? Any suggestions 
are appreciated.

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Priority: Major
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18955) HoS: Unable to create Channel from class NioServerSocketChannel

2018-03-14 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398614#comment-16398614
 ] 

Rui Li commented on HIVE-18955:
---

Maybe related to HIVE-18436.

> HoS: Unable to create Channel from class NioServerSocketChannel
> ---
>
> Key: HIVE-18955
> URL: https://issues.apache.org/jira/browse/HIVE-18955
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Priority: Major
>
> Hit the issue when trying launch spark job. Stack trace:
> {noformat}
> Caused by: java.lang.NoSuchMethodError: 
> io.netty.channel.DefaultChannelId.newInstance()Lio/netty/channel/DefaultChannelId;
> at io.netty.channel.AbstractChannel.newId(AbstractChannel.java:111) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at io.netty.channel.AbstractChannel.(AbstractChannel.java:83) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioChannel.(AbstractNioChannel.java:84) 
> ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.nio.AbstractNioMessageChannel.(AbstractNioMessageChannel.java:42)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:86)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.(NioServerSocketChannel.java:72)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_151]
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_151]
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_151]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_151]
> at 
> io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
>  ~[netty-all-4.1.17.Final.jar:4.1.17.Final]
> ... 32 more
> {noformat}
> It seems we have conflicts versions of class 
> {{io.netty.channel.DefaultChannelId}} from async-http-client.jar and 
> netty-all.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18925) Hive doesn't work when JVM is America/Bahia_Banderas time zone

2018-03-14 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398313#comment-16398313
 ] 

Rui Li commented on HIVE-18925:
---

Hi [~findepi], the patch looks good to me. You need to press submit patch to 
trigger tests.

> Hive doesn't work when JVM is America/Bahia_Banderas time zone
> --
>
> Key: HIVE-18925
> URL: https://issues.apache.org/jira/browse/HIVE-18925
> Project: Hive
>  Issue Type: Bug
> Environment: JVM in America/Bahia_Banderas zone
>Reporter: Piotr Findeisen
>Assignee: Piotr Findeisen
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-18925.patch
>
>
> Hive Server2 doesn't  work if started with 
> {{-Duser.timezone=America/Bahia_Banderas}}
>  
> Steps to reproduce
>  # use [https://github.com/big-data-europe/docker-hive]
>  # Add {{HADOOP_CLIENT_OPTS: '-Duser.timezone=America/Bahia_Banderas'}} to 
> {{hive-server}} docker container environment configuration
>  # {{docker-compose up}}
>  # 
> {code:java}
> host# docker-compose exec hive-server bash
> container# /opt/hive/bin/beeline -u jdbc:hive2://localhost:1 
> --verbose=true
> ...
> jdbc:hive2://localhost:1> select 1;{code}
> The above fails and prints
> {noformat}
> Error: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas) (state=08S01,code=0)
> java.sql.SQLException: java.lang.IllegalStateException: Can't overwrite cause 
> with org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:323)
> at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)
> at org.apache.hive.beeline.Commands.executeInternal(Commands.java:997)
> at org.apache.hive.beeline.Commands.execute(Commands.java:1205)
> at org.apache.hive.beeline.Commands.sql(Commands.java:1134)
> at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1314)
> at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1178)
> at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1033)
> at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:519)
> at org.apache.hive.beeline.BeeLine.main(BeeLine.java:501)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> Caused by: java.lang.IllegalStateException: Can't overwrite cause with 
> org.joda.time.IllegalInstantException: Illegal instant due to time zone 
> offset transition (daylight savings time 'gap'): 1970-01-01T00:00:00.000 
> (America/Bahia_Banderas)
> at java.lang.Throwable.initCause(Throwable.java:457)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:237)
> at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:198)
> at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
> at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)
> at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)
> at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)
> ... 15 more
> Caused by: java.lang.ExceptionInInitializerError: null
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:245)
> at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:211)
> ... 21 more{noformat}
> From the above stacktrace it's not visible what is the cause, but i think 
> it's initialization of 
> {{org.apache.hive.common.util.TimestampParser#startingDateValue}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18034) Improving logging with HoS executors spend lots of time in GC

2018-03-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396827#comment-16396827
 ] 

Rui Li commented on HIVE-18034:
---

+1

> Improving logging with HoS executors spend lots of time in GC
> -
>
> Key: HIVE-18034
> URL: https://issues.apache.org/jira/browse/HIVE-18034
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18034.1.patch, HIVE-18034.2.patch, 
> HIVE-18034.3.patch, HIVE-18034.4.patch, HIVE-18034.6.patch, HIVE-18034.7.patch
>
>
> There are times when Spark will spend lots of time doing GC. The Spark 
> History UI shows a bunch of red flags when too much time is spent in GC. It 
> would be nice if those warnings are propagated to Hive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18916) SparkClientImpl doesn't error out if spark-submit fails

2018-03-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395269#comment-16395269
 ] 

Rui Li commented on HIVE-18916:
---

[~stakiar], is this issue easy to reproduce? When the thread monitoring 
{{spark-submit}} finds it returns non-zero, the thread calls 
{{rpcServer.cancelClient}} which should ideally cancel the waiting for client 
to connect.

> SparkClientImpl doesn't error out if spark-submit fails
> ---
>
> Key: HIVE-18916
> URL: https://issues.apache.org/jira/browse/HIVE-18916
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Priority: Major
>
> If {{spark-submit}} returns a non-zero exit code, {{SparkClientImpl}} will 
> simply log the exit code, but won't throw an error. Eventually, the 
> connection timeout will get triggered and an exception like {{Timed out 
> waiting for client connection}} will be logged, which is pretty misleading.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-03-11 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks [~stakiar] for the review.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch, HIVE-17178.5.patch, HIVE-17178.6.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: 

[jira] [Commented] (HIVE-18436) Upgrade to Spark 2.3.0

2018-03-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392316#comment-16392316
 ] 

Rui Li commented on HIVE-18436:
---

Got it. Thanks for explaining. +1

> Upgrade to Spark 2.3.0
> --
>
> Key: HIVE-18436
> URL: https://issues.apache.org/jira/browse/HIVE-18436
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18436.1.patch, HIVE-18436.2.patch, 
> HIVE-18436.3.patch
>
>
> Branching has been completed. Release candidates should be published soon. 
> Might be a while before the actual release, but at least we get to identify 
> any issues early.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18436) Upgrade to Spark 2.3.0

2018-03-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391072#comment-16391072
 ] 

Rui Li commented on HIVE-18436:
---

Thanks [~stakiar] for tracking this. Why do we need the changes in tests?

> Upgrade to Spark 2.3.0
> --
>
> Key: HIVE-18436
> URL: https://issues.apache.org/jira/browse/HIVE-18436
> Project: Hive
>  Issue Type: Task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-18436.1.patch, HIVE-18436.2.patch, 
> HIVE-18436.3.patch
>
>
> Branching has been completed. Release candidates should be published soon. 
> Might be a while before the actual release, but at least we get to identify 
> any issues early.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-03-07 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.6.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch, HIVE-17178.5.patch, HIVE-17178.6.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> 

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

2018-03-06 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388915#comment-16388915
 ] 

Rui Li commented on HIVE-15104:
---

[~stakiar], thanks for trying this out.
bq. The HiveKryoRegistrator still seems to be serializing the hashCode so where 
are the actual savings coming from?
I didn't look deeply into kryo, but I think the reason is generic kryo SerDe 
has some overhead to store class meta info, while 
 in {{HiveKryoRegistrator}} we just store the data. My earlier 
[comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=16007788=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16007788]
 shows custom SerDe can bring improvements for BytesWritable too.

bq. I'm not sure I understand why the performance should improve when 
hive.spark.use.groupby.shuffle is set to false.
I guess the difference is due to the different shuffle we used -- if 
{{hive.spark.use.groupby.shuffle}} is false, group-by-key shuffle is replaced 
with repartition-and-sort-within-partition shuffle. And yes, the registrator is 
same for the two cases.

bq. why do we need the hashCode after deserializing the data?
For MR, the hash code is not needed for deserialized HiveKey (see 
HiveKey::hashCode), because when HiveKey is deserialized, it's already been 
distributed to the proper reducer. For Spark, RDDs may get cached during the 
execution. So if we deserialize a cached RDD and try to partition it to a 
downstream reducer, we'll need the hash code available after deserialization.

> Hive on Spark generate more shuffle data than hive on mr
> 
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Affects Versions: 1.2.1
>Reporter: wangwenli
>Assignee: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-15104.1.patch, HIVE-15104.10.patch, 
> HIVE-15104.2.patch, HIVE-15104.3.patch, HIVE-15104.4.patch, 
> HIVE-15104.5.patch, HIVE-15104.6.patch, HIVE-15104.7.patch, 
> HIVE-15104.8.patch, HIVE-15104.9.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-03-06 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387444#comment-16387444
 ] 

Rui Li commented on HIVE-17178:
---

[~stakiar], let me know if you have any further comments on the latest patch. 
Thanks.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch, HIVE-17178.5.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic 

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-26 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.5.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch, HIVE-17178.5.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: 

[jira] [Updated] (HIVE-18645) invalid url address in README.txt from module hbase-handler

2018-02-26 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18645:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Saijin for the fix and Zoltan for the review.

> invalid url address in README.txt from module hbase-handler
> ---
>
> Key: HIVE-18645
> URL: https://issues.apache.org/jira/browse/HIVE-18645
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Saijin Huang
>Assignee: Saijin Huang
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HIVE-18645.1.patch
>
>
> The url "http://wiki.apache.org/hadoop/Hive/HBaseIntegration; is invalid in 
> README.txt from module hbase-handler.
> Update the url and change .txt to .md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368740#comment-16368740
 ] 

Rui Li commented on HIVE-17178:
---

Latest failures are not related. [~stakiar], please take another look. Thanks.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column 

[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-17 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368421#comment-16368421
 ] 

Rui Li commented on HIVE-17178:
---

Update test to give table different sizes, so that query plan should be more 
deterministic.
Tried {{bucketizedhiveinputformat}} locally and it fails due to OOME. It fails 
in master too, so not related to the patch here.
{{bucketmapjoin6}} and {{dynamic_rdd_cache}} cannot be reproduced locally.
{{spark_opt_shuffle_serde}} should have already been fixed.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data 

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-17 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.4.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch, HIVE-17178.4.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
>   

[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-16 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Xuefu for the review.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-16 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
Attachment: (was: HIVE-18442.2.patch)

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361912#comment-16361912
 ] 

Rui Li commented on HIVE-17178:
---

Update to address RB comments.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: 

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-12 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.3.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch, 
> HIVE-17178.3.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> 

[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361645#comment-16361645
 ] 

Rui Li commented on HIVE-18442:
---

[~xuefuz], it doesn't change the class path, but it adds the binding to the 
configuration. So that FileSystem knows the implementing class for scheme 
"nullscan" -- FileSystem get FS instances either by ServiceLoader or 
configuration.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18350) load data should rename files consistent with insert statements

2018-02-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359812#comment-16359812
 ] 

Rui Li commented on HIVE-18350:
---

Thanks [~djaiswal], it works now.

> load data should rename files consistent with insert statements
> ---
>
> Key: HIVE-18350
> URL: https://issues.apache.org/jira/browse/HIVE-18350
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>Priority: Major
> Attachments: HIVE-18350.1.patch, HIVE-18350.10.patch, 
> HIVE-18350.11.patch, HIVE-18350.12.patch, HIVE-18350.13.patch, 
> HIVE-18350.14.patch, HIVE-18350.15.patch, HIVE-18350.16.patch, 
> HIVE-18350.2.patch, HIVE-18350.3.patch, HIVE-18350.4.patch, 
> HIVE-18350.5.patch, HIVE-18350.6.patch, HIVE-18350.7.patch, 
> HIVE-18350.8.patch, HIVE-18350.9.patch
>
>
> Insert statements create files of format ending with _0, 0001_0 etc. 
> However, the load data uses the input file name. That results in inconsistent 
> naming convention which makes SMB joins difficult in some scenarios and may 
> cause trouble for other types of queries in future.
> We need consistent naming convention.
> For non-bucketed table, hive renames all the files regardless of how they 
> were named by the user.
>  For bucketed table, hive relies on user to name the files matching the 
> bucket in non-strict mode. Hive assumes that the data belongs to same bucket 
> in a file. In strict mode, loading bucketed table is disabled.
> This will likely affect most of the tests which load data which is pretty 
> significant due to which it is further divided into two subtasks for smoother 
> merge.
> For existing tables in customer database, it is recommended to reload 
> bucketed tables otherwise if customer tries to run SMB join and there is a 
> bucket for which there is no split, then there is a possibility of getting 
> incorrect results. However, this is not a regression as it would happen even 
> without the patch.
> With this patch however, and reloading data, the results should be correct.
> For non-bucketed tables and external tables, there is no difference in 
> behavior and reloading data is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359802#comment-16359802
 ] 

Rui Li commented on HIVE-18442:
---

[~xuefuz], my concern about {{spark.yarn.user.classpath.first}} is it's not a 
public config. On the other hand, setting {{fs.SCHEME.impl}} in the 
configuration is somehow a more "official" solution -- at least the java doc 
implies the filesystem binding will be checked in the conf when getting a 
FileSystem instance:
https://github.com/apache/hadoop/blob/release-3.0.0-beta1-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3237

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-02-09 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358336#comment-16358336
 ] 

Rui Li commented on HIVE-17178:
---

[~xuefuz], could you also take a look? Thanks.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> 

[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-09 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358335#comment-16358335
 ] 

Rui Li commented on HIVE-18442:
---

Hi [~xuefuz], I still can't find a way to add hive-exec to system class path 
except the {{spark.yarn.user.classpath.first}} config. So I prefer to go with 
patch v1 instead of hacking around class paths. We can revisit this if there's 
indeed other class loading issues in the future. What do you think?

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17837) Explicitly check if the HoS Remote Driver has been lost in the RemoteSparkJobMonitor

2018-02-09 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358284#comment-16358284
 ] 

Rui Li commented on HIVE-17837:
---

+1, thanks [~stakiar] for the clarifications

> Explicitly check if the HoS Remote Driver has been lost in the 
> RemoteSparkJobMonitor 
> -
>
> Key: HIVE-17837
> URL: https://issues.apache.org/jira/browse/HIVE-17837
> Project: Hive
>  Issue Type: Sub-task
>  Components: Hive
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: HIVE-17837.1.patch, HIVE-17837.2.patch
>
>
> Right now the {{RemoteSparkJobMonitor}} implicitly checks if the connection 
> to the Spark remote driver is active. It does this everytime it triggers an 
> invocation of the {{Rpc#call}} method (so any call to {{SparkClient#run}}).
> There are scenarios where we have seen the {{RemoteSparkJobMonitor}} when the 
> connection to the driver dies, because the implicit call fails to be invoked 
> (see HIVE-15860).
> It would be ideal if we made this call explicit, so we fail as soon as we 
> know that the connection to the driver has died.
> The fix has the added benefit that it allows us to fail faster in the case 
> where the {{RemoteSparkJobMonitor}} is in the QUEUED / SENT state. If its 
> stuck in that state, it won't fail until it hits the monitor timeout (by 
> default 1 minute), even though we already know the connection has died. The 
> error message that is thrown is also a little imprecise, it says there could 
> be queue contention, even though we know the real reason is that the 
> connection was lost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18647) Cannot create table: Unknown column 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'

2018-02-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356778#comment-16356778
 ] 

Rui Li commented on HIVE-18647:
---

Seems related to HIVE-18350

> Cannot create table: Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'
> ---
>
> Key: HIVE-18647
> URL: https://issues.apache.org/jira/browse/HIVE-18647
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
>
> I'm using latest master branch code and mysql as metastore.
> Creating table hits this error:
> {noformat}
> 2018-02-07T22:04:55,438 ERROR [41f91bf4-bc49-4a73-baee-e2a1d79b8a4e main] 
> metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
> 10) with error: javax.jdo.JDODataStoreException: Insert of object 
> "org.apache.hadoop.hive.metastore.model.MTable@28d16af8" using statement 
> "INSERT INTO `TBLS` 
> (`TBL_ID`,`CREATE_TIME`,`CREATION_METADATA_MV_CREATION_METADATA_ID_OID`,`DB_ID`,`LAST_ACCESS_TIME`,`OWNER`,`RETENTION`,`IS_REWRITE_ENABLED`,`SD_ID`,`TBL_NAME`,`TBL_TYPE`,`VIEW_EXPANDED_TEXT`,`VIEW_ORIGINAL_TEXT`)
>  VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID' in 'field list'
> at 
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:543)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:729)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:749)
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:1125)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
> at com.sun.proxy.$Proxy36.createTable(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1506)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1412)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1614)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18350) load data should rename files consistent with insert statements

2018-02-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356776#comment-16356776
 ] 

Rui Li commented on HIVE-18350:
---

Hi [~djaiswal], with this change, I hit [this 
error|https://issues.apache.org/jira/browse/HIVE-18647?focusedCommentId=16356761=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356761]
 when creating table. Could you please take a look? Thanks.

> load data should rename files consistent with insert statements
> ---
>
> Key: HIVE-18350
> URL: https://issues.apache.org/jira/browse/HIVE-18350
> Project: Hive
>  Issue Type: Bug
>Reporter: Deepak Jaiswal
>Assignee: Deepak Jaiswal
>Priority: Major
> Attachments: HIVE-18350.1.patch, HIVE-18350.10.patch, 
> HIVE-18350.11.patch, HIVE-18350.12.patch, HIVE-18350.13.patch, 
> HIVE-18350.14.patch, HIVE-18350.15.patch, HIVE-18350.16.patch, 
> HIVE-18350.2.patch, HIVE-18350.3.patch, HIVE-18350.4.patch, 
> HIVE-18350.5.patch, HIVE-18350.6.patch, HIVE-18350.7.patch, 
> HIVE-18350.8.patch, HIVE-18350.9.patch
>
>
> Insert statements create files of format ending with _0, 0001_0 etc. 
> However, the load data uses the input file name. That results in inconsistent 
> naming convention which makes SMB joins difficult in some scenarios and may 
> cause trouble for other types of queries in future.
> We need consistent naming convention.
> For non-bucketed table, hive renames all the files regardless of how they 
> were named by the user.
>  For bucketed table, hive relies on user to name the files matching the 
> bucket in non-strict mode. Hive assumes that the data belongs to same bucket 
> in a file. In strict mode, loading bucketed table is disabled.
> This will likely affect most of the tests which load data which is pretty 
> significant due to which it is further divided into two subtasks for smoother 
> merge.
> For existing tables in customer database, it is recommended to reload 
> bucketed tables otherwise if customer tries to run SMB join and there is a 
> bucket for which there is no split, then there is a possibility of getting 
> incorrect results. However, this is not a regression as it would happen even 
> without the patch.
> With this patch however, and reloading data, the results should be correct.
> For non-bucketed tables and external tables, there is no difference in 
> behavior and reloading data is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HIVE-18647) Cannot create table: Unknown column 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'

2018-02-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356761#comment-16356761
 ] 

Rui Li edited comment on HIVE-18647 at 2/8/18 10:18 AM:


Hi [~jcamachorodriguez], with the latest code, I hit a different issue:
{noformat}
2018-02-08T18:13:54,913 ERROR [eeb906f4-bfb8-461f-ada9-fe1b3a8aa22c main] 
metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
10) with error: javax.jdo.JDOException: Exception thrown when executing query : 
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
`NUCLEUS_TYPE`,`A0`.`BUCKETING_VERSION`,`A0`.`CREATE_TIME`,`A0`.`LAST_ACCESS_TIME`,`A0`.`LOAD_IN_BUCKETED_TABLE`,`A0`.`OWNER`,`A0`.`RETENTION`,`A0`.`IS_REWRITE_ENABLED`,`A0`.`TBL_NAME`,`A0`.`TBL_TYPE`,`A0`.`TBL_ID`
 FROM `TBLS` `A0` LEFT OUTER JOIN `DBS` `B0` ON `A0`.`DB_ID` = `B0`.`DB_ID` 
WHERE `A0`.`TBL_NAME` = ? AND `B0`.`NAME` = ?
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:677)
at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:391)
at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:241)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1579)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1615)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getTable(ObjectStore.java:1333)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
at com.sun.proxy.$Proxy36.getTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.is_table_exists(HiveMetaStore.java:1922)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1462)

...

NestedThrowablesStackTrace:
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Unknown column 
'A0.BUCKETING_VERSION' in 'field list'
{noformat}
I re-initialized my metastore using the schema tool but the issue persists.
 


was (Author: lirui):
Hi [~jcamachorodriguez], with the latest code, I hit a different issue:
{noformat}
2018-02-08T18:13:54,913 ERROR [eeb906f4-bfb8-461f-ada9-fe1b3a8aa22c main] 
metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
10) with error: javax.jdo.JDOException: Exception thrown when executing query : 
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
`NUCLEUS_TYPE`,`A0`.`BUCKETING_VERSION`,`A0`.`CREATE_TIME`,`A0`.`LAST_ACCESS_TIME`,`A0`.`LOAD_IN_BUCKETED_TABLE`,`A0`.`OWNER`,`A0`.`RETENTION`,`A0`.`IS_REWRITE_ENABLED`,`A0`.`TBL_NAME`,`A0`.`TBL_TYPE`,`A0`.`TBL_ID`
 FROM `TBLS` `A0` LEFT OUTER JOIN `DBS` `B0` ON `A0`.`DB_ID` = `B0`.`DB_ID` 
WHERE `A0`.`TBL_NAME` = ? AND `B0`.`NAME` = ?
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:677)
at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:391)
at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:241)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1579)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1615)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getTable(ObjectStore.java:1333)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
at com.sun.proxy.$Proxy36.getTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.is_table_exists(HiveMetaStore.java:1922)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1462)
{noformat}
I re-initialized my metastore using the schema tool but the issue persists.
 

> Cannot create table: Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'
> ---
>
> Key: HIVE-18647
> URL: https://issues.apache.org/jira/browse/HIVE-18647
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
>
> I'm using latest master branch code and mysql 

[jira] [Commented] (HIVE-18647) Cannot create table: Unknown column 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'

2018-02-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356761#comment-16356761
 ] 

Rui Li commented on HIVE-18647:
---

Hi [~jcamachorodriguez], with the latest code, I hit a different issue:
{noformat}
2018-02-08T18:13:54,913 ERROR [eeb906f4-bfb8-461f-ada9-fe1b3a8aa22c main] 
metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
10) with error: javax.jdo.JDOException: Exception thrown when executing query : 
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
`NUCLEUS_TYPE`,`A0`.`BUCKETING_VERSION`,`A0`.`CREATE_TIME`,`A0`.`LAST_ACCESS_TIME`,`A0`.`LOAD_IN_BUCKETED_TABLE`,`A0`.`OWNER`,`A0`.`RETENTION`,`A0`.`IS_REWRITE_ENABLED`,`A0`.`TBL_NAME`,`A0`.`TBL_TYPE`,`A0`.`TBL_ID`
 FROM `TBLS` `A0` LEFT OUTER JOIN `DBS` `B0` ON `A0`.`DB_ID` = `B0`.`DB_ID` 
WHERE `A0`.`TBL_NAME` = ? AND `B0`.`NAME` = ?
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:677)
at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:391)
at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:241)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1579)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getMTable(ObjectStore.java:1615)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getTable(ObjectStore.java:1333)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
at com.sun.proxy.$Proxy36.getTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.is_table_exists(HiveMetaStore.java:1922)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1462)
{noformat}
I re-initialized my metastore using the schema tool but the issue persists.
 

> Cannot create table: Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'
> ---
>
> Key: HIVE-18647
> URL: https://issues.apache.org/jira/browse/HIVE-18647
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Priority: Major
> Fix For: 3.0.0
>
>
> I'm using latest master branch code and mysql as metastore.
> Creating table hits this error:
> {noformat}
> 2018-02-07T22:04:55,438 ERROR [41f91bf4-bc49-4a73-baee-e2a1d79b8a4e main] 
> metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
> 10) with error: javax.jdo.JDODataStoreException: Insert of object 
> "org.apache.hadoop.hive.metastore.model.MTable@28d16af8" using statement 
> "INSERT INTO `TBLS` 
> (`TBL_ID`,`CREATE_TIME`,`CREATION_METADATA_MV_CREATION_METADATA_ID_OID`,`DB_ID`,`LAST_ACCESS_TIME`,`OWNER`,`RETENTION`,`IS_REWRITE_ENABLED`,`SD_ID`,`TBL_NAME`,`TBL_TYPE`,`VIEW_EXPANDED_TEXT`,`VIEW_ORIGINAL_TEXT`)
>  VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID' in 'field list'
> at 
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:543)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:729)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:749)
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:1125)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
> at com.sun.proxy.$Proxy36.createTable(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1506)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1412)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1614)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18647) Cannot create table: Unknown column 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'

2018-02-07 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18647:
--
Description: 
I'm using latest master branch code and mysql as metastore.
Creating table hits this error:
{noformat}
2018-02-07T22:04:55,438 ERROR [41f91bf4-bc49-4a73-baee-e2a1d79b8a4e main] 
metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
10) with error: javax.jdo.JDODataStoreException: Insert of object 
"org.apache.hadoop.hive.metastore.model.MTable@28d16af8" using statement 
"INSERT INTO `TBLS` 
(`TBL_ID`,`CREATE_TIME`,`CREATION_METADATA_MV_CREATION_METADATA_ID_OID`,`DB_ID`,`LAST_ACCESS_TIME`,`OWNER`,`RETENTION`,`IS_REWRITE_ENABLED`,`SD_ID`,`TBL_NAME`,`TBL_TYPE`,`VIEW_EXPANDED_TEXT`,`VIEW_ORIGINAL_TEXT`)
 VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column 
'CREATION_METADATA_MV_CREATION_METADATA_ID_OID' in 'field list'
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:543)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:729)
at 
org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:749)
at 
org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:1125)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
at com.sun.proxy.$Proxy36.createTable(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1506)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1412)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1614)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{noformat}

> Cannot create table: Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID'
> ---
>
> Key: HIVE-18647
> URL: https://issues.apache.org/jira/browse/HIVE-18647
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Priority: Major
>
> I'm using latest master branch code and mysql as metastore.
> Creating table hits this error:
> {noformat}
> 2018-02-07T22:04:55,438 ERROR [41f91bf4-bc49-4a73-baee-e2a1d79b8a4e main] 
> metastore.RetryingHMSHandler: Retrying HMSHandler after 2000 ms (attempt 1 of 
> 10) with error: javax.jdo.JDODataStoreException: Insert of object 
> "org.apache.hadoop.hive.metastore.model.MTable@28d16af8" using statement 
> "INSERT INTO `TBLS` 
> (`TBL_ID`,`CREATE_TIME`,`CREATION_METADATA_MV_CREATION_METADATA_ID_OID`,`DB_ID`,`LAST_ACCESS_TIME`,`OWNER`,`RETENTION`,`IS_REWRITE_ENABLED`,`SD_ID`,`TBL_NAME`,`TBL_TYPE`,`VIEW_EXPANDED_TEXT`,`VIEW_ORIGINAL_TEXT`)
>  VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)" failed : Unknown column 
> 'CREATION_METADATA_MV_CREATION_METADATA_ID_OID' in 'field list'
> at 
> org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:543)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.jdoMakePersistent(JDOPersistenceManager.java:729)
> at 
> org.datanucleus.api.jdo.JDOPersistenceManager.makePersistent(JDOPersistenceManager.java:749)
> at 
> org.apache.hadoop.hive.metastore.ObjectStore.createTable(ObjectStore.java:1125)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:97)
> at com.sun.proxy.$Proxy36.createTable(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1506)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_core(HiveMetaStore.java:1412)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_table_with_environment_context(HiveMetaStore.java:1614)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-02-07 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355472#comment-16355472
 ] 

Rui Li commented on HIVE-18368:
---

+1. Thanks [~stakiar] for the update.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, HIVE-18368.4.patch, Job Ids.png, 
> Stage DAG 1.png, Stage DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-06 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354088#comment-16354088
 ] 

Rui Li commented on HIVE-18442:
---

I made a mistake about spark.driver.extraClassPath: it can only contain jars 
that are locally available to all nodes in the cluster, and thus not suitable 
to fix the issue. I'll explore other options.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-01 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
Attachment: HIVE-18442.2.patch

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch, HIVE-18442.2.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-02-01 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348266#comment-16348266
 ] 

Rui Li commented on HIVE-18442:
---

[~xuefuz], the two options achieve the same purpose: add hive-exec.jar to class 
path when launching the JVM. It means we'll add hive-exec.jar twice, which I 
think is OK because we've been doing this in our test.
I'll upload a patch to use extra driver class path, 
{{spark.yarn.user.classpath.first}} is not a documented property so I prefer 
not to use it.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346196#comment-16346196
 ] 

Rui Li commented on HIVE-18301:
---

Hi [~kellyzly], is the input path the only thing we need to store with cached 
RDD? The IOContext has quite a few other fields. I wonder whether they are 
available if the RDD is cached.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.1.patch, HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-01-30 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345025#comment-16345025
 ] 

Rui Li commented on HIVE-17178:
---

[~stakiar], could you take a look? Thanks.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> 

[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16340806#comment-16340806
 ] 

Rui Li commented on HIVE-18368:
---

Looks good to me overall. Left some minor comments on RB.

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Attachments: Completed Stages.png, HIVE-18368.1.patch, 
> HIVE-18368.2.patch, HIVE-18368.3.patch, Job Ids.png, Stage DAG 1.png, Stage 
> DAG 2.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-24 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16337632#comment-16337632
 ] 

Rui Li commented on HIVE-18301:
---

Hi [~kellyzly], does the proposed solution mean we need to cache the input path 
for each record of the table? I wonder whether we can reuse the Text for same 
input paths. Besides, it's not efficient to check the job conf each time we 
process a row. You can just check it once and remember the value. Anyway, it's 
better to get some measurement of the overhead.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
> Attachments: HIVE-18301.patch
>
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2018-01-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336898#comment-16336898
 ] 

Rui Li commented on HIVE-18301:
---

{quote}We need not to call MapOperator#cleanUpInputFileChanged because 
MapOperator#cleanUpInputFileChanged is only designed for one Mapper scanning 
multiple files
{quote}
When RDD is cached, mapper reads records from the cache. But I think those 
records may come from multiple underlying files right? And we won't be able to 
tell the file boundaries because they're cached.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Major
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-01-22 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334274#comment-16334274
 ] 

Rui Li commented on HIVE-17178:
---

Update with some refactor and a new test. A brief summary of the change:
# The main idea is we'll try to find an existing reusable DPP branch when new 
DPP branch is created in {{DynamicPartitionPruningOptimization}}. To be 
reusable, the existing DPP branch should have same operators as the new DPP 
branch (same comparison logic as in {{CombineEquivalentWorkResolver}}). If such 
reusable DPP exists, we just add the new target to it and discard the newly 
created DPP.
# The change breaks the assumption that a DPP sink must have single target at 
some point. So for existing code that accesses the target of a DPP sink, we 
need to change it to a loop to iterate over all the targets.
# Changes to how we clone sub-tree in {{SplitOpTreeForDPP}}. The current code 
has some problem to associate cloned DPP sinks with original ones. W/ the 
patch, we'll avoid cloning DPP branches and hopefully it makes the logic a 
little simpler.
# Changes to {{SparkDynamicPartitionPruner::processFiles}}. MapWork now uses 
DPP sink's unique ID as the source event ID, so if a unique ID exists in the 
source event maps, it must have output for all the associated target columns. 
And when we read from the output file, we update the values for all the columns 
associated with that unique ID.
# Some refactors to make code more reusable.

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-01-22 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.2.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch, HIVE-17178.2.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 3
> 

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-01-18 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Status: Patch Available  (was: Open)

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 3
> Local 

[jira] [Updated] (HIVE-17178) Spark Partition Pruning Sink Operator can't target multiple Works

2018-01-18 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17178:
--
Attachment: HIVE-17178.1.patch

> Spark Partition Pruning Sink Operator can't target multiple Works
> -
>
> Key: HIVE-17178
> URL: https://issues.apache.org/jira/browse/HIVE-17178
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-17178.1.patch
>
>
> A Spark Partition Pruning Sink Operator cannot be used to target multiple Map 
> Work objects. The entire DPP subtree (SEL-GBY-SPARKPRUNINGSINK) is duplicated 
> if a single table needs to be used to target multiple Map Works.
> The following query shows the issue:
> {code}
> set hive.spark.dynamic.partition.pruning=true;
> set hive.auto.convert.join=true;
> create table part_table_1 (col int) partitioned by (part_col int);
> create table part_table_2 (col int) partitioned by (part_col int);
> create table regular_table (col int);
> insert into table regular_table values (1);
> alter table part_table_1 add partition (part_col=1);
> insert into table part_table_1 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_1 add partition (part_col=2);
> insert into table part_table_1 partition (part_col=2) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=1);
> insert into table part_table_2 partition (part_col=1) values (1), (2), (3), 
> (4);
> alter table part_table_2 add partition (part_col=2);
> insert into table part_table_2 partition (part_col=2) values (1), (2), (3), 
> (4);
> explain select * from regular_table, part_table_1, part_table_2 where 
> regular_table.col = part_table_1.part_col and regular_table.col = 
> part_table_2.part_col;
> {code}
> The explain plan is
> {code}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
> Spark
>  A masked pattern was here 
>   Vertices:
> Map 1 
> Map Operator Tree:
> TableScan
>   alias: regular_table
>   Statistics: Num rows: 1 Data size: 1 Basic stats: COMPLETE 
> Column stats: NONE
>   Filter Operator
> predicate: col is not null (type: boolean)
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   expressions: col (type: int)
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark HashTable Sink Operator
> keys:
>   0 _col0 (type: int)
>   1 _col1 (type: int)
>   2 _col1 (type: int)
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 2
>   Select Operator
> expressions: _col0 (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> Group By Operator
>   keys: _col0 (type: int)
>   mode: hash
>   outputColumnNames: _col0
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
>   Spark Partition Pruning Sink Operator
> partition key expr: part_col
> Statistics: Num rows: 1 Data size: 1 Basic stats: 
> COMPLETE Column stats: NONE
> target column name: part_col
> target work: Map 3
> Local Work:
>   

[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-17 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16329909#comment-16329909
 ] 

Rui Li commented on HIVE-18442:
---

Hi [~xuefuz], I believe it's related to how hive-exec.jar is added to driver's 
classpath. FileSystem uses ServiceLoader to [load FS 
implementations|https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L2757].
 This method is only called once. For NullScanFileSystem to be loaded, we have 
to make sure hive-exec.jar is loaded when the method is called. Alternatively, 
we can set the implementation class in JobConf, which is what the patch is 
doing.
It seems hive-exec is added differently between yarn-client and yarn-cluster 
mode. I can do some more investigation into that.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-17 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16328530#comment-16328530
 ] 

Rui Li commented on HIVE-18442:
---

The failures are not related.

[~stakiar] [~xuefuz] could you take a look? Thanks.

 

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-15 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16326133#comment-16326133
 ] 

Rui Li commented on HIVE-18442:
---

I didn't add a qtest for this because we have hive-exec.jar in driver's extra 
class path in the test, which can solve the issue. But since we can't expect 
this in real deployment, code change is still needed.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-15 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
Status: Patch Available  (was: Open)

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-15 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
Attachment: HIVE-18442.1.patch

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
> Attachments: HIVE-18442.1.patch
>
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16323967#comment-16323967
 ] 

Rui Li commented on HIVE-18442:
---

The issue seems to related to class loading. FileSystem uses ServiceLoader to 
load FS implementations. But it will only load once. So if hive-exec.jar is not 
loaded by the context class loader when {{FileSystem::loadFileSystems}} is 
called, the NullScanFileSystem will not be loaded and thus the failure.

The issue is specific to cluster mode. I don't see it in client mode.

I think we can set {{fs.nullscan.impl}} in JobConf to fix it.

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-12 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18442:
--
Description: 
Hit the issue when I run following query in yarn-cluster mode:
{code}
select * from (select key from src where false) a left outer join (select key 
from srcpart limit 0) b on a.key=b.key;
{code}

Stack trace:
{noformat}
Job failed with java.io.IOException: No FileSystem for scheme: nullscan
at 
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at 
org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
at 
org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
at 
org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
at 
org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
at 
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
at 
org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}

> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>
> Hit the issue when I run following query in yarn-cluster mode:
> {code}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> Stack trace:
> {noformat}
> Job failed with java.io.IOException: No FileSystem for scheme: nullscan
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2799)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2810)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2605)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.isEmptyPath(Utilities.java:2601)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3409)
>   at 
> org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3347)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:299)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:222)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:109)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:354)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:358)
>   at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2018-01-11 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18148:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Sahil and Liyun for the review.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: 3.0.0
>
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch, HIVE-18148.4.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (HIVE-18442) HoS: No FileSystem for scheme: nullscan

2018-01-11 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li reassigned HIVE-18442:
-


> HoS: No FileSystem for scheme: nullscan
> ---
>
> Key: HIVE-18442
> URL: https://issues.apache.org/jira/browse/HIVE-18442
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18441) NullPointerException due to Hadoop23Shims doesn't compatible with Hadoop 2.2

2018-01-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16322182#comment-16322182
 ] 

Rui Li commented on HIVE-18441:
---

Hi [~hengyu.dai], do you know why the path has no scheme? I suppose we should 
have set the scheme in NullScanOptimizer?

> NullPointerException due to Hadoop23Shims doesn't compatible with Hadoop 2.2
> 
>
> Key: HIVE-18441
> URL: https://issues.apache.org/jira/browse/HIVE-18441
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 2.1.1, 2.2.0, 2.3.0
>Reporter: Hengyu Dai
> Attachments: HIVE-18441.01.patch, HIVE-18441.02.patch, 
> HIVE-18441.patch, hadoop2.2.jpg, hadoop2.9.jpg
>
>
> Hive 2.x is not compatible with hadoop 2.2 (maybe there is same problem in 
> other hadoop version too) when "nullscan" path is existed.
> here is the listStatus() method in Hadoop23Shims.java
> {code:java}
> protected List listStatus(JobContext job) throws IOException {
> List result = super.listStatus(job);
> Iterator it = result.iterator();
> while (it.hasNext()) {
>   FileStatus stat = it.next();
>   if (!stat.isFile() || (stat.getLen() == 0 && 
> !stat.getPath().toUri().getScheme().equals("nullscan"))) {
> it.remove();
>   }
> }
> return result;
>   }
> {code}
> the first line "super.listStatus(job)" get different FileStatus object from 
> Hadoop 2.2 and Hadoop 2.9
> I have tested Hive2.1 with Hadoop2.2, Hive2.1 with Hadoop2.9, and NPE occurs 
> in Hive2.1 with Hadoop2.2
> My test SQL is 
> {code:java}
> select * from (select key from src where false) a left outer join (select key 
> from srcpart limit 0) b on a.key=b.key;
> {code}
> it's from optimize_nullscan.q, table src and srcpart in the SQL is created by 
> q_test_init.sql.
> the problem is, in hadoop 2.2, super.listStatus(job) returns a FileStatus 
> object whose "Path" field doesn't contain a schema for "nullscan" path, so, 
> "stat.getPath().toUri().getScheme()" in the if statement get NULL, and call 
> null.equals("nullscan") will lead NPE.
> In contrast, super.listStatus(job) will get a valid Path whose schema is 
> "nullscan".
> the debug pictures from Hadoop 2.2 and Hadoop 2.9 is attached, we can see the 
> result list returned by super.listStatus(job) is different, Hadoop 2.2 gets 
> "/default.srcpart/part..." and Hadoop 2.9 get 
> "nullscan://null/default.srcpart/part..."
> (this bug is not happened with normal path like "hdfs://..." )
> we should take consideration of stat.getPath().toUri().getScheme() returns 
> null.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2018-01-11 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18148:
--
Attachment: HIVE-18148.4.patch

Moved the code into SparkUtilities

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch, HIVE-18148.4.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-12338) Add webui to HiveServer2

2018-01-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-12338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321859#comment-16321859
 ] 

Rui Li commented on HIVE-12338:
---

Hey [~jxiang], [~szehon], is there any way to programmatically retrieve the 
information displayed in the web UI?

> Add webui to HiveServer2
> 
>
> Key: HIVE-12338
> URL: https://issues.apache.org/jira/browse/HIVE-12338
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Reporter: Jimmy Xiang
>Assignee: Jimmy Xiang
> Attachments: HIVE-12338.1.patch, HIVE-12338.2.patch, 
> HIVE-12338.3.patch, HIVE-12338.4.patch, hs2-conf.png, hs2-logs.png, 
> hs2-metrics.png, hs2-webui.png
>
>
> A web ui for HiveServer2 can show some useful information such as:
>  
> 1. Sessions,
> 2. Queries that are executing on the HS2, their states, starting time, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2018-01-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319959#comment-16319959
 ] 

Rui Li edited comment on HIVE-18148 at 1/10/18 9:45 AM:


Hi [~stakiar],
bq. Is there any way to move the code changes into SplitOpTreeForDPP?
The added code runs during OP tree optimization, which is before 
SplitOpTreeForDPP. So we won't generate the malformed tree. I put it there as 
another kind of DPP to be removed, together with cyclic DPPs, too big DPPs, etc.
bq. I don't think this is an issue with map-joins correct?
Yeah the issue is not related to map join. And we won't remove nested DPP sink 
if it's with map join, because SplitOpTreeForDPP doesn't split the tree in this 
case.


was (Author: lirui):
Hi [~stakiar],
bq. Is there any way to move the code changes into SplitOpTreeForDPP?
The added code runs during OP tree optimization, which is before 
SplitOpTreeForDPP. So we won't generate the malformed tree.
bq. I don't think this is an issue with map-joins correct?
The issue is not related to map join. And we won't remove nested DPP sink if 
it's with map join, because SplitOpTreeForDPP doesn't split the tree in this 
case.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2018-01-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319959#comment-16319959
 ] 

Rui Li commented on HIVE-18148:
---

Hi [~stakiar],
bq. Is there any way to move the code changes into SplitOpTreeForDPP?
The added code runs during OP tree optimization, which is before 
SplitOpTreeForDPP. So we won't generate the malformed tree.
bq. I don't think this is an issue with map-joins correct?
The issue is not related to map join. And we won't remove nested DPP sink if 
it's with map join, because SplitOpTreeForDPP doesn't split the tree in this 
case.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18368) Improve Spark Debug RDD Graph

2018-01-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319913#comment-16319913
 ] 

Rui Li commented on HIVE-18368:
---

Hi [~stakiar], two questions regarding the screenshot:
# Why the num of partitions of MapInput is 0?
# It seems confusing to have 2 RDDs having the same work name, e.g. "Reducer 
3", "Map 11". Can we name the shuffled RDD as "ShuffleTran", and the Hadoop RDD 
as "MapInput"?

> Improve Spark Debug RDD Graph
> -
>
> Key: HIVE-18368
> URL: https://issues.apache.org/jira/browse/HIVE-18368
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-18368.1.patch, HIVE-18368.2.patch, Spark UI - Named 
> RDDs.png
>
>
> The {{SparkPlan}} class does some logging to show the mapping between 
> different {{SparkTran}}, what shuffle types are used, and what trans are 
> cached. However, there is room for improvement.
> When debug logging is enabled the RDD graph is logged, but there isn't much 
> information printed about each RDD.
> We should combine both of the graphs and improve them. We could even make the 
> Spark Plan graph part of the {{EXPLAIN EXTENDED}} output.
> Ideally, the final graph shows a clear relationship between Tran objects, 
> RDDs, and BaseWorks. Edge should include information about number of 
> partitions, shuffle types, Spark operations used, etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2018-01-09 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16318338#comment-16318338
 ] 

Rui Li commented on HIVE-18148:
---

[~xuefuz], [~stakiar], could you also take a look? This is somewhat blocking 
HIVE-17178.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317692#comment-16317692
 ] 

Rui Li commented on HIVE-16484:
---

bq. Hive wouldn't need a separate Spark installation to be able to launch Spark 
apps. It could ship with everything ready to run HoS out of the box.
Yeah I also believe that's the main benefit. But if SparkLauncher cannot give 
us that, why don't we just use {{InProcessLauncher}}?

Regarding the extra connection, I'm not sure how it impacts us 
performance-wise. My main concern is it brings extra chance of issues while the 
benefits are not quite clear. For example, we had several connection timeout 
issues with the RPC framework. And seems {{LauncherServer}}/{{LauncherBackend}} 
have very similar configs to tweak, like 
{{spark.launcher.childConnectionTimeout}}.

Regarding debug, I assume it's mainly for yarn-client mode right? Because the 
process we launched in yarn-cluster mode is only a light-weight client talking 
to RM. And by deault it exits once the app starts running(HIVE-13895). I agree 
it makes debugging easier, but again that require InProcessLauncher.

So my suggestion is we wait until InProcessLauncher is released and implement 
another SparkClient using it. We can decide whether to get rid of the current 
SparkClientImpl when InProcessLauncher is mature. Does that make sense?

BTW, is there any docs about the SparkLauncher implementation? I just want to 
have a better understanding about it.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316130#comment-16316130
 ] 

Rui Li commented on HIVE-16484:
---

Hi [~stakiar], got a higher level question: what's the main advantage of moving 
to SparkLauncher if it still requires spark-submit? One advantage I see is 
finer grained app state change, e.g. SUBMITTED.
On the other hand, the LauncherServer and LauncherBackend looks similar to our 
RPC framework and some functions are overlapping, e.g. we already have the 
EndSession to shut down the Spark app, not sure whether we still need 
{{SparkAppHandle::stop/kill}}.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.10.patch, 
> HIVE-16484.2.patch, HIVE-16484.3.patch, HIVE-16484.4.patch, 
> HIVE-16484.5.patch, HIVE-16484.6.patch, HIVE-16484.7.patch, 
> HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16484) Investigate SparkLauncher for HoS as alternative to bin/spark-submit

2018-01-05 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313162#comment-16313162
 ] 

Rui Li commented on HIVE-16484:
---

Thanks [~stakiar] for the work! I had a quick glance at the code and left some 
comments. Will try to find some time and look into more details tomorrow.

> Investigate SparkLauncher for HoS as alternative to bin/spark-submit
> 
>
> Key: HIVE-16484
> URL: https://issues.apache.org/jira/browse/HIVE-16484
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-16484.1.patch, HIVE-16484.2.patch, 
> HIVE-16484.3.patch, HIVE-16484.4.patch, HIVE-16484.5.patch, 
> HIVE-16484.6.patch, HIVE-16484.7.patch, HIVE-16484.8.patch, HIVE-16484.9.patch
>
>
> The {{SparkClientImpl#startDriver}} currently looks for the {{SPARK_HOME}} 
> directory and invokes the {{bin/spark-submit}} script, which spawns a 
> separate process to run the Spark application.
> {{SparkLauncher}} was added in SPARK-4924 and is a programatic way to launch 
> Spark applications.
> I see a few advantages:
> * No need to spawn a separate process to launch a HoS --> lower startup time
> * Simplifies the code in {{SparkClientImpl}} --> easier to debug
> * {{SparkLauncher#startApplication}} returns a {{SparkAppHandle}} which 
> contains some useful utilities for querying the state of the Spark job
> ** It also allows the launcher to specify a list of job listeners



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-17929) Use sessionId for HoS Remote Driver Client id

2018-01-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309247#comment-16309247
 ] 

Rui Li commented on HIVE-17929:
---

+1

> Use sessionId for HoS Remote Driver Client id
> -
>
> Key: HIVE-17929
> URL: https://issues.apache.org/jira/browse/HIVE-17929
> Project: Hive
>  Issue Type: Sub-task
>  Components: Spark
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
> Attachments: HIVE-17929.1.patch, HIVE-17929.2.patch, 
> HIVE-17929.3.patch
>
>
> Each {{SparkClientImpl}} creates a client connection using a client id. The 
> client id is created via {{UUID.randomUUID()}}.
> Since each HoS session has a single client connection we should just use the 
> sessionId instead (which is also a UUID). This should help simplify the code 
> and some of the client logging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-29 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306218#comment-16306218
 ] 

Rui Li commented on HIVE-18301:
---

I think SMB map join is one case where the "input file change" event is 
necessary -- whenever the big table input file changes, SMBMapJoinOperator 
needs to find corresponding input files for small tables in order to performa 
bucketed join. Maybe we can identify all such cases and make sure MapInput 
cache is disabled. For other cases, we can cache MapInput and just fix the NPE.
[~xuefuz], could you share your thoughts on this? Thanks.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-27 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16304383#comment-16304383
 ] 

Rui Li commented on HIVE-18301:
---

My understanding is if the HadoopRDD is cached, the records are not produced by 
record reader and IOContext is not populated. Therefore the information in 
IOContext will be unavailable, e.g. the input path. This may cause problem 
because some operators need to take certain actions when input file changes -- 
{{Operator::cleanUpInputFileChanged}}.
So basically my point is we have to figure out the scenarios where IOContext is 
necessary. Then decide whether we should disable caching in such cases.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303680#comment-16303680
 ] 

Rui Li commented on HIVE-18301:
---

I think we need to investigate how the input path is used, and what the 
operators need to do when input file changes, etc. My understanding is these 
information will be lost if the HadoopRDD is cached.

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-21 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18148:
--
Attachment: HIVE-18148.3.patch

Fix the check style.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch, 
> HIVE-18148.3.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297994#comment-16297994
 ] 

Rui Li commented on HIVE-18148:
---

Both the target table size and the DPP sink output size (smaller output means 
more partitions are pruned) should be taken into account, if we want to base 
the decision on statistics. Besides we also need to consider the cost of 
re-computing, as I mentioned above. Let's put that as follow up.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297915#comment-16297915
 ] 

Rui Li commented on HIVE-18148:
---

[~kellyzly], we can remove either DPP1 or DPP2 to fix the NPE. I keep the upper 
most DPP sink mainly for simplicity. Another rationale is the deeper the DPP 
sink, the more operators get re-computed.
We can implement more complicated rules based on statistics, which can be done 
as follow ups.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18301) Investigate to enable MapInput cache in Hive on Spark

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297894#comment-16297894
 ] 

Rui Li commented on HIVE-18301:
---

If we can cache MapInput, will it be simpler to dynamically identify same 
MapInputs and cache them, in order to achieve the purpose of HIVE-17486?

> Investigate to enable MapInput cache in Hive on Spark
> -
>
> Key: HIVE-18301
> URL: https://issues.apache.org/jira/browse/HIVE-18301
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> Before IOContext problem is found in MapTran when spark rdd cache is enabled 
> in HIVE-8920.
> so we disabled rdd cache in MapTran at 
> [SparkPlanGenerator|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java#L202].
>   The problem is IOContext seems not initialized correctly in the spark yarn 
> client/cluster mode and caused the exception like 
> {code}
> Job aborted due to stage failure: Task 93 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 93.3 in stage 0.0 (TID 616, bdpe48): 
> java.lang.RuntimeException: Error processing row: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:165)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.AbstractMapOperator.getNominalPath(AbstractMapOperator.java:101)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.cleanUpInputFileChangedOp(MapOperator.java:516)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.cleanUpInputFileChanged(Operator.java:1187)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:546)
>   at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:152)
>   ... 12 more
> Driver stacktrace:
> {code}
> in yarn client/cluster mode, sometimes 
> [ExecMapperContext#currentInputPath|https://github.com/kellyzly/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecMapperContext.java#L109]
>  is null when rdd cach is enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18289) Fix jar dependency when enable rdd cache in Hive on Spark

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297893#comment-16297893
 ] 

Rui Li commented on HIVE-18289:
---

It seems the reason is OrcStruct doesn't have an empty constructor.
[~owen.omalley], any thoughts on this? Thanks.

> Fix jar dependency when enable rdd cache in Hive on Spark
> -
>
> Key: HIVE-18289
> URL: https://issues.apache.org/jira/browse/HIVE-18289
> Project: Hive
>  Issue Type: Bug
>Reporter: liyunzhang
>Assignee: liyunzhang
>
> running DS/query28 when enabling HIVE-17486's 4th patch
> on tpcds_bin_partitioned_orc_10 whether on spark local or yarn mode
> command
> {code}
> set spark.local=yarn-client;
> echo 'use tpcds_bin_partitioned_orc_10;source query28.sql;'|hive --hiveconf 
> spark.app.name=query28.sql  --hiveconf hive.spark.optimize.shared.work=true 
> -i testbench.settings -i query28.sql.setting
> {code}
> the exception 
> {code}
> ava.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.io.orc.OrcStruct.()
> 748678 at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:134) 
> ~[hadoop-common-2.7.3.jar:?]
> 748679 at 
> org.apache.hadoop.io.WritableUtils.clone(WritableUtils.java:217) 
> ~[hadoop-common-2.7.3.jar:?]
> 748680 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:85)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748681 at 
> org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:72)
>  ~[hive-exec-3.0.0-SNAPSHOT.jar:3.0.   0-SNAPSHOT]
> 748682 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748683 at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1031)
>  ~[spark-core_2.11-2.   0.0.jar:2.0.0]
> 748684 at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) 
> ~[scala-library-2.11.8.jar:?]
> 748685 at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>  ~[spark-core_2.11-2.0.0.jar:2.   0.0]
> 748686 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748687 at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>  ~[spark-core_2.11-2.0.0.   jar:2.0.0]
> 748688 at 
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748689 at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748690 at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748691 at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748692 at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748693 at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748694 at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748695 at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
> ~[spark-core_2.11-2.0.0.jar:2.0.0]
> 748696 at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) 
> ~[spark-core_2.11-2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297883#comment-16297883
 ] 

Rui Li commented on HIVE-18148:
---

bq. If first tranverses JOIN, then remove DPP2.
No, it only collects DPP sinks in the downstream tree starting from a branching 
operator. So if it firstly traverses JOIN, it won't find any nested DPP sinks.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18282) Spark tar is downloaded every time for itest

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297869#comment-16297869
 ] 

Rui Li commented on HIVE-18282:
---

Thanks [~stakiar] for uploading the file. Do you think we can make the code 
change? So that we don't get similar problem in the future.

> Spark tar is downloaded every time for itest
> 
>
> Key: HIVE-18282
> URL: https://issues.apache.org/jira/browse/HIVE-18282
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
> Attachments: HIVE-18282.1.patch
>
>
> Seems we missed the md5 file for spark-2.2.0?
> cc [~kellyzly], [~stakiar]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18304) datediff() UDF returns a wrong result when dealing with a (date, string) input

2017-12-19 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297832#comment-16297832
 ] 

Rui Li commented on HIVE-18304:
---

I can't reproduce the issue on my side - the two queries return the same 
result, and my laptop is in UTC+8. Maybe it's fixed by HIVE-15338?
[~hengyu.dai], which Hive version are you using?

[~xuefuz], the timezone stuff I worked on is about the timestamptz type, so 
it's not related here.

> datediff() UDF returns a wrong result when dealing with a (date, string) input
> --
>
> Key: HIVE-18304
> URL: https://issues.apache.org/jira/browse/HIVE-18304
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Reporter: Hengyu Dai
>Assignee: Hengyu Dai
>Priority: Minor
> Attachments: 0001.patch
>
>
> for date type argument, datediff() use DateConverter to convert input to a 
> java Date object, 
> for example, a '2017-12-18' will get 2017-12-18T00:00:00.000+0800
> for string type argument, datediff() use TextConverter to convert a string to 
> date,
> for '2012-01-01' we will get 2012-01-01T08:00:00.000+0800
> now, datediff() will return a number less than the real date diff
> we should use TextConverter to deal with date input too.
> reproduce:
> {code:java}
> select datediff(cast('2017-12-18' as date), '2012-01-01'); --2177
> select datediff('2017-12-18', '2012-01-01'); --2178
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18111) Fix temp path for Spark DPP sink

2017-12-17 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18111:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Sahil for reviewing.

> Fix temp path for Spark DPP sink
> 
>
> Key: HIVE-18111
> URL: https://issues.apache.org/jira/browse/HIVE-18111
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Fix For: 3.0.0
>
> Attachments: HIVE-18111.1.patch, HIVE-18111.2.patch, 
> HIVE-18111.3.patch, HIVE-18111.4.patch, HIVE-18111.5.patch, HIVE-18111.5.patch
>
>
> Before HIVE-17877, each DPP sink has only one target work. The output path of 
> a DPP work is {{TMP_PATH/targetWorkId/dppWorkId}}. When we do the pruning, 
> each map work reads DPP outputs under {{TMP_PATH/targetWorkId}}.
> After HIVE-17877, each DPP sink can have multiple target works. It's possible 
> that a map work needs to read DPP outputs from multiple 
> {{TMP_PATH/targetWorkId}}. To solve this, I think we can have a DPP output 
> path specific to each query, e.g. {{QUERY_TMP_PATH/dpp_output}}. Each DPP 
> work outputs to {{QUERY_TMP_PATH/dpp_output/dppWorkId}}. And each map work 
> reads from {{QUERY_TMP_PATH/dpp_output}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-14 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18148:
--
Attachment: HIVE-18148.2.patch

Update patch v2 to fix a bug: we shouldn't stop searching at MJ operator. 
Instead, only nested DPP sinks that are with common join will be removed.
Another test is added and comments are updated accordingly.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch, HIVE-18148.2.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18282) Spark tar is downloaded every time for itest

2017-12-14 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18282:
--
Status: Patch Available  (was: Open)

> Spark tar is downloaded every time for itest
> 
>
> Key: HIVE-18282
> URL: https://issues.apache.org/jira/browse/HIVE-18282
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
> Attachments: HIVE-18282.1.patch
>
>
> Seems we missed the md5 file for spark-2.2.0?
> cc [~kellyzly], [~stakiar]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-18282) Spark tar is downloaded every time for itest

2017-12-14 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-18282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-18282:
--
Attachment: HIVE-18282.1.patch

Meanwhile, I think it's better to avoid the re-download when we can't download 
the checksum.

> Spark tar is downloaded every time for itest
> 
>
> Key: HIVE-18282
> URL: https://issues.apache.org/jira/browse/HIVE-18282
> Project: Hive
>  Issue Type: Test
>Reporter: Rui Li
> Attachments: HIVE-18282.1.patch
>
>
> Seems we missed the md5 file for spark-2.2.0?
> cc [~kellyzly], [~stakiar]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18148) NPE in SparkDynamicPartitionPruningResolver

2017-12-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288933#comment-16288933
 ] 

Rui Li commented on HIVE-18148:
---

[~kellyzly], that's OK. I didn't add it because it's already default.

> NPE in SparkDynamicPartitionPruningResolver
> ---
>
> Key: HIVE-18148
> URL: https://issues.apache.org/jira/browse/HIVE-18148
> Project: Hive
>  Issue Type: Bug
>  Components: Spark
>Reporter: Rui Li
>Assignee: Rui Li
> Attachments: HIVE-18148.1.patch
>
>
> The stack trace is:
> {noformat}
> 2017-11-27T10:32:38,752 ERROR [e6c8aab5-ddd2-461d-b185-a7597c3e7519 main] 
> ql.Driver: FAILED: NullPointerException null
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver$SparkDynamicPartitionPruningDispatcher.dispatch(SparkDynamicPartitionPruningResolver.java:100)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.dispatch(TaskGraphWalker.java:111)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.walk(TaskGraphWalker.java:180)
> at 
> org.apache.hadoop.hive.ql.lib.TaskGraphWalker.startWalking(TaskGraphWalker.java:125)
> at 
> org.apache.hadoop.hive.ql.optimizer.physical.SparkDynamicPartitionPruningResolver.resolve(SparkDynamicPartitionPruningResolver.java:74)
> at 
> org.apache.hadoop.hive.ql.parse.spark.SparkCompiler.optimizeTaskPlan(SparkCompiler.java:568)
> {noformat}
> At this stage, there shouldn't be a DPP sink whose target map work is null. 
> The root cause seems to be a malformed operator tree generated by 
> SplitOpTreeForDPP.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


<    1   2   3   4   5   6   7   8   9   10   >