[jira] [Comment Edited] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118
 ] 

Wenchen Fan edited comment on SPARK-25829 at 11/15/18 7:32 AM:
---

More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapFromString, MapConcat, 
TransformKeys, MapFilter, and also reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.


was (Author: cloud_fan):
More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapFromString, MapConcat, MapFilter, 
and also reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25829) Duplicated map keys are not handled consistently

2018-11-14 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663118#comment-16663118
 ] 

Wenchen Fan edited comment on SPARK-25829 at 11/15/18 7:28 AM:
---

More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapFromString, MapConcat, MapFilter, 
and also reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.


was (Author: cloud_fan):
More investigation on "later entry wins".

If we still allow duplicated keys in map physically, following functions need 
to be updated:
Explode, PosExplode, GetMapValue, MapKeys, MapValues, MapEntries, 
TransformKeys, TransformValues, MapZipWith

If we want to forbid duplicated keys in map, following functions need to be 
updated:
CreateMap, MapFromArrays, MapFromEntries, MapConcat, MapFilter, and also 
reading map from data sources.

So "later entry wins" semantic is more ideal but needs more works.

> Duplicated map keys are not handled consistently
> 
>
> Key: SPARK-25829
> URL: https://issues.apache.org/jira/browse/SPARK-25829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> In Spark SQL, we apply "earlier entry wins" semantic to duplicated map keys. 
> e.g.
> {code}
> scala> sql("SELECT map(1,2,1,3)[1]").show
> +--+
> |map(1, 2, 1, 3)[1]|
> +--+
> | 2|
> +--+
> {code}
> However, this handling is not applied consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26069:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687589#comment-16687589
 ] 

Apache Spark commented on SPARK-26069:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23041

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26069:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
> -
>
> Key: SPARK-26069
> URL: https://issues.apache.org/jira/browse/SPARK-26069
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
>   at 
> org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runners.Suite.runChild(Suite.java:128)
>   at org.junit.runners.Suite.runChild(Suite.java:27)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26069) Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures

2018-11-14 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-26069:


 Summary: Flaky test: RpcIntegrationSuite.sendRpcWithStreamFailures
 Key: SPARK-26069
 URL: https://issues.apache.org/jira/browse/SPARK-26069
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


{code}
sbt.ForkMain$ForkError: java.lang.AssertionError: expected:<1> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at 
org.apache.spark.network.RpcIntegrationSuite.assertErrorAndClosed(RpcIntegrationSuite.java:386)
at 
org.apache.spark.network.RpcIntegrationSuite.sendRpcWithStreamFailures(RpcIntegrationSuite.java:347)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at com.novocode.junit.JUnitRunner$1.execute(JUnitRunner.java:132)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687497#comment-16687497
 ] 

Apache Spark commented on SPARK-26068:
--

User 'linhong-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/23040

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26068:


Assignee: Apache Spark

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Assignee: Apache Spark
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687498#comment-16687498
 ] 

Apache Spark commented on SPARK-26068:
--

User 'linhong-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/23040

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26068:


Assignee: (was: Apache Spark)

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Liu, Linhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu, Linhong updated SPARK-26068:
-
Description: 
If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
ChunkedByteBufferInputStream will be truncated. All data behind the empty chunk 
will not be read.

The problematic code:
{code:java}
// ChunkedByteBuffer.scala
// Assume chunks.next returns an empty chunk, then we will reach
// else branch no matter chunks.hasNext = true or not. So some data is lost.
override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
  if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext){
currentChunk = chunks.next()
  }
  if (currentChunk != null && currentChunk.hasRemaining) {
val amountToGet = math.min(currentChunk.remaining(), length)
currentChunk.get(dest, offset, amountToGet)
amountToGet
  } else {
close()
-1
  }
} {code}

  was:
If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
ChunkedByteBufferInputStream will be truncated. All data behind the empty chunk 
will not be read.

The problematic code:
{code:java}
// ChunkedByteBuffer.scala
// Assume chunks.next returns an empty chunk, then we will reach
// else branch no matter chunks.hasNext = true or not. So some data is lost.
override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
  if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext){
currentChunk = chunks.next()
  }
  if (currentChunk != null && currentChunk.hasRemaining) {
val amountToGet = math.min(currentChunk.remaining(), length)
currentChunk.get(dest, offset, amountToGet)
amountToGet
  } else {
close()
-1
  }
}
{code}
 

 


> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Liu, Linhong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liu, Linhong updated SPARK-26068:
-
Description: 
If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
ChunkedByteBufferInputStream will be truncated. All data behind the empty chunk 
will not be read.

The problematic code:
{code:java}
// ChunkedByteBuffer.scala
// Assume chunks.next returns an empty chunk, then we will reach
// else branch no matter chunks.hasNext = true or not. So some data is lost.
override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
  if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext){
currentChunk = chunks.next()
  }
  if (currentChunk != null && currentChunk.hasRemaining) {
val amountToGet = math.min(currentChunk.remaining(), length)
currentChunk.get(dest, offset, amountToGet)
amountToGet
  } else {
close()
-1
  }
}
{code}
 

 

  was:
If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
ChunkedByteBufferInputStream will be truncated. All data behind the empty chunk 
will not be read.

The problematic code

 
{code:java}
// ChunkedByteBuffer.scala
// Assume chunks.next returns an empty chunk, then we will reach
// else branch no matter chunks.hasNext = true or not. So some data is lost.
override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
  if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext){
currentChunk = chunks.next()
  }
  if (currentChunk != null && currentChunk.hasRemaining) {
val amountToGet = math.min(currentChunk.remaining(), length)
currentChunk.get(dest, offset, amountToGet)
amountToGet
  } else {
close()
-1
  }
}
{code}
 

 


> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-14 Thread Liu, Linhong (JIRA)
Liu, Linhong created SPARK-26068:


 Summary: ChunkedByteBufferInputStream is truncated by empty chunk
 Key: SPARK-26068
 URL: https://issues.apache.org/jira/browse/SPARK-26068
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Liu, Linhong


If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
ChunkedByteBufferInputStream will be truncated. All data behind the empty chunk 
will not be read.

The problematic code

 
{code:java}
// ChunkedByteBuffer.scala
// Assume chunks.next returns an empty chunk, then we will reach
// else branch no matter chunks.hasNext = true or not. So some data is lost.
override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
  if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext){
currentChunk = chunks.next()
  }
  if (currentChunk != null && currentChunk.hasRemaining) {
val amountToGet = math.min(currentChunk.remaining(), length)
currentChunk.get(dest, offset, amountToGet)
amountToGet
  } else {
close()
-1
  }
}
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26036.
--
Resolution: Fixed

Issue resolved by pull request 23033
[https://github.com/apache/spark/pull/23033]

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26036:


Assignee: Hyukjin Kwon

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns

2018-11-14 Thread Abdeali Kothari (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abdeali Kothari updated SPARK-26067:

Description: 
When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
columns is greater than 255 in Pytohn 3.6 and lower.


{code:java}
import pyspark
from pyspark.sql import types as T, functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + str(i) 
for i in range(256)])

new_schema = T.StructType([
field for field in df.schema] + [T.StructField("new_row", T.DoubleType())])

def myfunc(df):
df['new_row'] = 1
return df

myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)

df2 = df.groupBy(["a1"]).apply(myfunc_udf)

print(df2.count())  # This FAILS
# ERROR:
# Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
219, in main
# func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
148, in read_udfs
# mapper = eval(mapper_str, udfs)
#   File "", line 1
# SyntaxError: more than 255 arguments
{code}

Note: In Python 3.7 the 255 limit was raised, but I have not tried with Pytohn 
3.7 ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes

I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my 
Hadoop Linux cluster and also on my Mac standalone spark installation.

  was:
When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
columns is greater than 255 in Pytohn 3.6 and lower.


{code:java}
import pyspark
from pyspark.sql import types as T, functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + str(i) 
for i in range(256)])

new_schema = T.StructType([
field for field in df.schema] + [T.StructField("new_row", T.DoubleType())])

def myfunc(df):
df['new_row'] = 1
return df

myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)

df2 = df.groupBy(["a1"]).apply(myfunc_udf)

print(df2.count())  # This FAILS
# ERROR:
# Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
219, in main
# func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
148, in read_udfs
# mapper = eval(mapper_str, udfs)
#   File "", line 1
# SyntaxError: more than 255 arguments
{code}


I believe thhis is happening because internally this creates a UDF with inputs 
as every column in the DF.
https://github.com/apache/spark/blob/41c2227a2318029709553a588e44dee28f106350/python/pyspark/sql/group.py#L274

Note: In Python 3.7 the 255 limit was raised, but I have not tried with Pytohn 
3.7 ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes

I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my 
Hadoop Linux cluster and also on my Mac standalone spark installation.


> Pandas GROUPED_MAP udf breaks if DF has >255 columns
> 
>
> Key: SPARK-26067
> URL: https://issues.apache.org/jira/browse/SPARK-26067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Abdeali Kothari
>Priority: Major
>
> When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
> pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
> columns is greater than 255 in Pytohn 3.6 and lower.
> {code:java}
> import pyspark
> from pyspark.sql import types as T, functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + 
> str(i) for i in range(256)])
> new_schema = T.StructType([
> field for field in df.schema] + [T.StructField("new_row", 
> T.DoubleType())])
> def myfunc(df):
> df['new_row'] = 1
> return df
> myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)
> df2 = df.groupBy(["a1"]).apply(myfunc_udf)
> print(df2.count())  # This FAILS
> # ERROR:
> # Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
> #   File 
> 

[jira] [Created] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns

2018-11-14 Thread Abdeali Kothari (JIRA)
Abdeali Kothari created SPARK-26067:
---

 Summary: Pandas GROUPED_MAP udf breaks if DF has >255 columns
 Key: SPARK-26067
 URL: https://issues.apache.org/jira/browse/SPARK-26067
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0, 2.3.2
Reporter: Abdeali Kothari


When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
columns is greater than 255 in Pytohn 3.6 and lower.


{code:java}
import pyspark
from pyspark.sql import types as T, functions as F

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + str(i) 
for i in range(256)])

new_schema = T.StructType([
field for field in df.schema] + [T.StructField("new_row", T.DoubleType())])

def myfunc(df):
df['new_row'] = 1
return df

myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)

df2 = df.groupBy(["a1"]).apply(myfunc_udf)

print(df2.count())  # This FAILS
# ERROR:
# Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
219, in main
# func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
eval_type)
#   File 
"/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
148, in read_udfs
# mapper = eval(mapper_str, udfs)
#   File "", line 1
# SyntaxError: more than 255 arguments
{code}


I believe thhis is happening because internally this creates a UDF with inputs 
as every column in the DF.
https://github.com/apache/spark/blob/41c2227a2318029709553a588e44dee28f106350/python/pyspark/sql/group.py#L274

Note: In Python 3.7 the 255 limit was raised, but I have not tried with Pytohn 
3.7 ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes

I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my 
Hadoop Linux cluster and also on my Mac standalone spark installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26017) SVD++ error rate is high in the test suite.

2018-11-14 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687361#comment-16687361
 ] 

shahid commented on SPARK-26017:


I am analyzing it

> SVD++ error rate is high in the test suite.
> ---
>
> Key: SPARK-26017
> URL: https://issues.apache.org/jira/browse/SPARK-26017
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.2
>Reporter: shahid
>Priority: Major
> Attachments: image-2018-11-12-20-41-49-370.png
>
>
> In the test suite, "{color:#008000}Test SVD++ with mean square error on 
> training set", {color}error rate is quite high, even for large number of 
> iterations.
>  
> !image-2018-11-12-20-41-49-370.png!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-14 Thread Nagaram Prasad Addepally (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687337#comment-16687337
 ] 

Nagaram Prasad Addepally commented on SPARK-25957:
--

Thanks [~vanzin]... we can do skip flags instead. I think we can auto detect R 
installation by checking for presence of  "$SPARK_HOME/R/lib" folder. Correct 
me if I am wrong. 

I will work on this change and post a PR. Can you assign this Jira to me. I do 
not seem to have permission to assign this Jira to myself.

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25956) Make Scala 2.12 as default Scala version in Spark 3.0

2018-11-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25956.
---
   Resolution: Fixed
 Assignee: DB Tsai
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22967

> Make Scala 2.12 as default Scala version in Spark 3.0
> -
>
> Key: SPARK-25956
> URL: https://issues.apache.org/jira/browse/SPARK-25956
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Scala 2.11 will unlikely support Java 11 
> https://github.com/scala/scala-dev/issues/559#issuecomment-436160166; hence, 
> we will make Scala 2.12 as default Scala version in Spark 3.0 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687306#comment-16687306
 ] 

Marcelo Vanzin edited comment on SPARK-25957 at 11/15/18 12:16 AM:
---

I prefer to keep the current behavior and add options to disable specific 
images (e.g. "\-\-skip-r", "\-\-skip-pyspark"). If "\-\-skip-r" could be 
auto-detected, even better.


was (Author: vanzin):
I prefer to keep the current behavior and add options to disable specific 
images (e.g. "--skip-r", "--skip-pyspark"). If "--skip-r" could be 
auto-detected, even better.

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687306#comment-16687306
 ] 

Marcelo Vanzin commented on SPARK-25957:


I prefer to keep the current behavior and add options to disable specific 
images (e.g. "--skip-r", "--skip-pyspark"). If "--skip-r" could be 
auto-detected, even better.

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-14 Thread Nagaram Prasad Addepally (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687276#comment-16687276
 ] 

Nagaram Prasad Addepally commented on SPARK-25957:
--

I think we can parameterize what images we want to build and publish using 
[docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh].
 By default, we can build and publish all images (to keep existing behavior 
intact) and provide an override option to specify which images we want to build 
explicitly. Note that we will always build base spark (JVM) docker image.

For example,
{noformat}
./docker-image-tool.sh -r  -t  build|publish # Builds/publishes all 
docker images

./docker-image-tool.sh -r  -t  --select [p,R] build|publish # 
Builds/publishes docker images specified in select param. We will always build 
spark base (JVM) docker image.{noformat}
Does this approach sound reasonable? Or anyone has a better suggestion?

 

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687169#comment-16687169
 ] 

Apache Spark commented on SPARK-26066:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23039

> Moving truncatedString to sql/catalyst
> --
>
> Key: SPARK-26066
> URL: https://issues.apache.org/jira/browse/SPARK-26066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The truncatedString method is used to convert elements of TreeNodes and 
> expressions to strings, and called only from sql.* packages. The ticket aims 
> to move the method out from core. Also need to introduce SQL config to 
> control maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26066:


Assignee: (was: Apache Spark)

> Moving truncatedString to sql/catalyst
> --
>
> Key: SPARK-26066
> URL: https://issues.apache.org/jira/browse/SPARK-26066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The truncatedString method is used to convert elements of TreeNodes and 
> expressions to strings, and called only from sql.* packages. The ticket aims 
> to move the method out from core. Also need to introduce SQL config to 
> control maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26066:


Assignee: Apache Spark

> Moving truncatedString to sql/catalyst
> --
>
> Key: SPARK-26066
> URL: https://issues.apache.org/jira/browse/SPARK-26066
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The truncatedString method is used to convert elements of TreeNodes and 
> expressions to strings, and called only from sql.* packages. The ticket aims 
> to move the method out from core. Also need to introduce SQL config to 
> control maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25451:


Assignee: Apache Spark

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687130#comment-16687130
 ] 

Apache Spark commented on SPARK-25451:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23038

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25451:


Assignee: (was: Apache Spark)

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25451) Stages page doesn't show the right number of the total tasks

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687128#comment-16687128
 ] 

Apache Spark commented on SPARK-25451:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23038

> Stages page doesn't show the right number of the total tasks
> 
>
> Key: SPARK-25451
> URL: https://issues.apache.org/jira/browse/SPARK-25451
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: zuotingbing
>Priority: Major
> Attachments: mshot.png
>
>
>  
> See the attached pic.
>   !mshot.png!
> The executor 1 has 7 tasks, but in the Stages Page the total tasks of 
> executor is 6.
>  
> to reproduce this simply start a shell:
> {code:java}
> $SPARK_HOME/bin/spark-shell --executor-cores 1 --executor-memory 1g 
> --total-executor-cores 2 --master spark://localhost.localdomain:7077{code}
> Run job as fellows:
> {code:java}
> sc.parallelize(1 to 1, 3).map{ x => throw new RuntimeException("Bad 
> executor")}.collect() {code}
>  
> Go to the stages page and you will see the Total Tasks  is not right in
> {code:java}
> Aggregated Metrics by Executor{code}
> table. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25986) Banning throw new Errors

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25986.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22989
[https://github.com/apache/spark/pull/22989]

> Banning throw new Errors
> 
>
> Key: SPARK-25986
> URL: https://issues.apache.org/jira/browse/SPARK-25986
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Adding a linter rule to ban the construction of new Errors and then make sure 
> that we throw the correct exceptions. See the PR 
> https://github.com/apache/spark/pull/22969



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25986) Banning throw new Errors

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25986:
--
Docs Text: 
Release notes text:

Certain methods in Spark MLlib would throw NotImplementedError or UnknownError 
on illegal input. These have been changed to more standard 
UnsupportedOperationException and IllegalArgumentException.

> Banning throw new Errors
> 
>
> Key: SPARK-25986
> URL: https://issues.apache.org/jira/browse/SPARK-25986
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Yuanjian Li
>Priority: Major
>  Labels: release-notes
>
> Adding a linter rule to ban the construction of new Errors and then make sure 
> that we throw the correct exceptions. See the PR 
> https://github.com/apache/spark/pull/22969



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-11-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25778:
--

Assignee: Greg Senia

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Assignee: Greg Senia
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
> Attachments: SPARK-25778.2.patch, SPARK-25778.4.patch, 
> SPARK-25778.patch
>
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
>   at 
> 

[jira] [Resolved] (SPARK-25778) WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to tmpDir from $PWD to HDFS

2018-11-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25778.

   Resolution: Fixed
Fix Version/s: 2.4.1
   3.0.0

Issue resolved by pull request 22867
[https://github.com/apache/spark/pull/22867]

> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> tmpDir from $PWD to HDFS
> -
>
> Key: SPARK-25778
> URL: https://issues.apache.org/jira/browse/SPARK-25778
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming, YARN
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.1, 
> 2.3.2
>Reporter: Greg Senia
>Priority: Major
> Fix For: 3.0.0, 2.4.1
>
> Attachments: SPARK-25778.2.patch, SPARK-25778.4.patch, 
> SPARK-25778.patch
>
>
> WriteAheadLogBackedBlockRDD in YARN Cluster Mode Fails due lack of access to 
> HDFS path due to it using a similar name was $PWD folder from YARN AM Cluster 
> Mode for Spark
> While attempting to use Spark Streaming and WriteAheadLogs. I noticed the 
> following errors after the driver attempted to recovery the already read data 
> that was being written to HDFS in the checkpoint folder. After spending many 
> hours looking at the cause of the following error below due to the fact the 
> parent folder /hadoop exists in our HDFS FS..  I am wonder if its possible to 
> make an option configurable to choose an alternate bogus directory that will 
> never be used.
> hadoop fs -ls /
> drwx--   - dsadmdsadm   0 2017-06-20 13:20 /hadoop
> hadoop fs -ls /hadoop/apps
> drwx--   - dsadm dsadm  0 2017-06-20 13:20 /hadoop/apps
> streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
>   val nonExistentDirectory = new File(
>   System.getProperty("java.io.tmpdir"), 
> UUID.randomUUID().toString).getAbsolutePath
> writeAheadLog = WriteAheadLogUtils.createLogForReceiver(
>   SparkEnv.get.conf, nonExistentDirectory, hadoopConf)
> dataRead = writeAheadLog.read(partition.walRecordHandle)
> 18/10/19 00:03:03 DEBUG YarnSchedulerBackend$YarnDriverEndpoint: Launching 
> task 72 on executor id: 1 hostname: ha20t5002dn.tech.hdp.example.com.
> 18/10/19 00:03:03 DEBUG BlockManager: Getting local block broadcast_4_piece0 
> as bytes
> 18/10/19 00:03:03 DEBUG BlockManager: Level for block broadcast_4_piece0 is 
> StorageLevel(disk, memory, 1 replicas)
> 18/10/19 00:03:03 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory 
> on ha20t5002dn.tech.hdp.example.com:32768 (size: 33.7 KB, free: 912.2 MB)
> 18/10/19 00:03:03 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 71, 
> ha20t5002dn.tech.hdp.example.com, executor 1): 
> org.apache.spark.SparkException: Could not read data from write ahead log 
> record 
> FileBasedWriteAheadLogSegment(hdfs://tech/user/hdpdevspark/sparkstreaming/Spark_Streaming_MQ_IDMS/receivedData/0/log-1539921695606-1539921755606,0,1017)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.AccessControlException: Permission 
> denied: user=hdpdevspark, access=EXECUTE, 
> inode="/hadoop/diskc/hadoop/yarn/local/usercache/hdpdevspark/appcache/application_1539554105597_0338/container_e322_1539554105597_0338_01_02/tmp/170f36b8-9202-4556-89a4-64587c7136b6":dsadm:dsadm:drwx--
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>   at 
> 

[jira] [Resolved] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24421.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22993
[https://github.com/apache/spark/pull/22993]

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24421) Accessing sun.misc.Cleaner in JDK11

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-24421:
-

Assignee: Sean Owen

> Accessing sun.misc.Cleaner in JDK11
> ---
>
> Key: SPARK-24421
> URL: https://issues.apache.org/jira/browse/SPARK-24421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Many internal APIs such as unsafe are encapsulated in JDK9+, see 
> http://openjdk.java.net/jeps/260 for detail.
> To use Unsafe, we need to add *jdk.unsupported* to our code’s module 
> declaration:
> {code:java}
> module java9unsafe {
> requires jdk.unsupported;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26066) Moving truncatedString to sql/catalyst

2018-11-14 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26066:
--

 Summary: Moving truncatedString to sql/catalyst
 Key: SPARK-26066
 URL: https://issues.apache.org/jira/browse/SPARK-26066
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The truncatedString method is used to convert elements of TreeNodes and 
expressions to strings, and called only from sql.* packages. The ticket aims to 
move the method out from core. Also need to introduce SQL config to control 
maximum number of fields by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Environment: 
Spark 2.3.2 

Hadoop 2.6

When we materialize one of intermediate dataframes as a parquet table, and read 
it back in, this error doesn't happen (exact same downflow queries ). 

 

  was:
Spark 2.3.2 

PySpark 2.7.15 + Hadoop 2.6

When we materialize one of intermediate dataframes as a parquet table, and read 
it back in, this error doesn't happen (exact same downflow queries ). 

 


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode

2018-11-14 Thread Ramandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687034#comment-16687034
 ] 

Ramandeep Singh commented on SPARK-25982:
-

Sure,

a) The setting for scheduler is fair scheduler

--conf 'spark.scheduler.mode'='FAIR'

b) There are independent jobs at one stage that are scheduled. This is okay, 
all of them block on dataframe write to complete. 

```

val futures = steps.par.map(stepId => Future {
 processWrite(stepsMap(stepId))
}).par
futures.foreach(Await.result(_, Duration.create(timeout, TimeUnit.MINUTES)))

```

Here, the processWrite processes write operations in parallel and awaits on 
each of them to complete, but the persist or write operation returns before it 
has written all the partitions of the dataframes, so other jobs at a later 
stage end up being run.

 

> Dataframe write is non blocking in fair scheduling mode
> ---
>
> Key: SPARK-25982
> URL: https://issues.apache.org/jira/browse/SPARK-25982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Ramandeep Singh
>Priority: Major
>
> Hi,
> I have noticed that expected behavior of dataframe write operation to block 
> is not working in fair scheduling mode.
> Ideally when a dataframe write is occurring and a future is blocking on 
> AwaitResult, no other job should be started, but this is not the case. I have 
> noticed that other jobs are started when the partitions are being written.  
>  
> Regards,
> Ramandeep Singh
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26051) Can't create table with column name '22222d'

2018-11-14 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687025#comment-16687025
 ] 

Dilip Biswal commented on SPARK-26051:
--

I would like to take a look at this one.

> Can't create table with column name '2d'
> 
>
> Key: SPARK-26051
> URL: https://issues.apache.org/jira/browse/SPARK-26051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xie Juntao
>Priority: Minor
>
> I can't create table in which the column name is '2d' when I use 
> spark-sql. It seems a SQL parser bug because it's ok for creating table with 
> the column name ''2m".
> {code:java}
> spark-sql> create table t1(2d int);
> Error in query:
> no viable alternative at input 'create table t1(2d'(line 1, pos 16)
> == SQL ==
> create table t1(2d int)
> ^^^
> spark-sql> create table t1(2m int);
> 18/11/14 09:13:53 INFO HiveMetaStore: 0: get_database: global_temp
> 18/11/14 09:13:53 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> global_temp
> 18/11/14 09:13:53 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_table : db=default tbl=t1
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table : 
> db=default tbl=t1
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: get_database: default
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: 
> default
> 18/11/14 09:13:55 INFO HiveMetaStore: 0: create_table: Table(tableName:t1, 
> dbName:default, owner:root, createTime:1542158033, lastAccessTime:0, 
> retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, 
> comment:null)], 
> location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], 
> parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]},
>  spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))
> 18/11/14 09:13:55 INFO audit: ugi=root ip=unknown-ip-addr cmd=create_table: 
> Table(tableName:t1, dbName:default, owner:root, createTime:1542158033, 
> lastAccessTime:0, retention:0, 
> sd:StorageDescriptor(cols:[FieldSchema(name:2m, type:int, comment:null)], 
> location:file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1,
>  inputFormat:org.apache.hadoop.mapred.TextInputFormat, 
> outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, 
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, 
> serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[], 
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], 
> skewedColValueLocationMaps:{})), partitionKeys:[], 
> parameters:{spark.sql.sources.schema.part.0={"type":"struct","fields":[{"name":"2m","type":"integer","nullable":true,"metadata":{}}]},
>  spark.sql.sources.schema.numParts=1, spark.sql.create.version=2.3.1}, 
> viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, 
> privileges:PrincipalPrivilegeSet(userPrivileges:{}, groupPrivileges:null, 
> rolePrivileges:null))
> 18/11/14 09:13:55 WARN HiveMetaStore: Location: 
> file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1 
> specified for non-external table:t1
> 18/11/14 09:13:55 INFO FileUtils: Creating directory if it doesn't exist: 
> file:/opt/UQuery/spark_/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t1
> Time taken: 2.15 seconds
> 18/11/14 09:13:56 INFO SparkSQLCLIDriver: Time taken: 2.15 seconds{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Assigned] (SPARK-25965) Add read benchmark for Avro

2018-11-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25965:
-

Assignee: Gengliang Wang

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25965) Add read benchmark for Avro

2018-11-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25965.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/22966

> Add read benchmark for Avro
> ---
>
> Key: SPARK-25965
> URL: https://issues.apache.org/jira/browse/SPARK-25965
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add read benchmark for Avro, which is missing for a period.
> The benchmark is similar to DataSourceReadBenchmark and OrcReadBenchmark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26065) Change query hint from a `LogicalPlan` to a field

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26065:


Assignee: Apache Spark

> Change query hint from a `LogicalPlan` to a field
> -
>
> Key: SPARK-26065
> URL: https://issues.apache.org/jira/browse/SPARK-26065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Major
>
> The existing query hint implementation relies on a logical plan node 
> {{ResolvedHint}} to store query hints in logical plans, and on {{Statistics}} 
> in physical plans. Since {{ResolvedHint}} is not really a logical operator 
> and can break the pattern matching for existing and future optimization 
> rules, it is a issue to the Optimizer as the old {{AnalysisBarrier}} to the 
> Analyzer.
> Given the fact that all our query hints are either 1) a join hint, i.e., 
> broadcast hint; or 2) a re-partition hint, which is indeed an operator, we 
> only need to add a hint field on the {{Join}} plan and that will be a good 
> enough solution for current hint usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26065) Change query hint from a `LogicalPlan` to a field

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26065:


Assignee: (was: Apache Spark)

> Change query hint from a `LogicalPlan` to a field
> -
>
> Key: SPARK-26065
> URL: https://issues.apache.org/jira/browse/SPARK-26065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Major
>
> The existing query hint implementation relies on a logical plan node 
> {{ResolvedHint}} to store query hints in logical plans, and on {{Statistics}} 
> in physical plans. Since {{ResolvedHint}} is not really a logical operator 
> and can break the pattern matching for existing and future optimization 
> rules, it is a issue to the Optimizer as the old {{AnalysisBarrier}} to the 
> Analyzer.
> Given the fact that all our query hints are either 1) a join hint, i.e., 
> broadcast hint; or 2) a re-partition hint, which is indeed an operator, we 
> only need to add a hint field on the {{Join}} plan and that will be a good 
> enough solution for current hint usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26065) Change query hint from a `LogicalPlan` to a field

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687003#comment-16687003
 ] 

Apache Spark commented on SPARK-26065:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/23036

> Change query hint from a `LogicalPlan` to a field
> -
>
> Key: SPARK-26065
> URL: https://issues.apache.org/jira/browse/SPARK-26065
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maryann Xue
>Priority: Major
>
> The existing query hint implementation relies on a logical plan node 
> {{ResolvedHint}} to store query hints in logical plans, and on {{Statistics}} 
> in physical plans. Since {{ResolvedHint}} is not really a logical operator 
> and can break the pattern matching for existing and future optimization 
> rules, it is a issue to the Optimizer as the old {{AnalysisBarrier}} to the 
> Analyzer.
> Given the fact that all our query hints are either 1) a join hint, i.e., 
> broadcast hint; or 2) a re-partition hint, which is indeed an operator, we 
> only need to add a hint field on the {{Join}} plan and that will be a good 
> enough solution for current hint usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26042) KafkaContinuousSourceTopicDeletionSuite may hang forever

2018-11-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-26042.
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.1

> KafkaContinuousSourceTopicDeletionSuite may hang forever
> 
>
> Key: SPARK-26042
> URL: https://issues.apache.org/jira/browse/SPARK-26042
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> Saw the following thread dump in some build:
> {code}
> "stream execution thread for [id = 1c13482e-1edf-4b5c-b63a-d652738c8a48, 
> runId = 10667ce9-7eac-4cef-a525-f1bd08eb50f1]" #4406 daemon prio=5 os_prio=0 
> tid=0x7fab1d3c5000 nid=0x7f4b waiting on condition [0x7fa96efcb000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070a904cf8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> ...
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:109)
>   - locked <0x00070a913ee8> (a 
> org.apache.spark.sql.execution.streaming.IncrementalExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:270)
> ,,,
> "pool-1-thread-1-ScalaTest-running-KafkaContinuousSourceTopicDeletionSuite" 
> #20 prio=5 os_prio=0 tid=0x7fabc4e78800 nid=0x23be waiting for monitor 
> entry [0x7fab3dbff000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:100)
>   - waiting to lock <0x00070a913ee8> (a 
> org.apache.spark.sql.execution.streaming.IncrementalExecution)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:100)
>   at 
> org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:210)
>   at 
> org.apache.spark.sql.kafka010.KafkaContinuousSourceTopicDeletionSuite$$anonfun$3$$anonfun$apply$mcV$sp$12$$anonfun$apply$15.apply(KafkaContinuousSourceSuite.scala:209)
> ...
> {code}
> It hung forever because the test main thread was trying to access 
> `executedPlan` but the lock was held by the streaming thread.
> This is a pretty common issue when using lazy vals as all lazy vals share the 
> same lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov edited comment on SPARK-26041 at 11/14/18 5:45 PM:
-

thanks for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.



was (Author: tagar):
thank for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> 

[jira] [Comment Edited] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov edited comment on SPARK-26041 at 11/14/18 5:45 PM:
-

thank for checking this [~mgaido] 

just attached txt file that shows sequence of dataframe creation and last 
failing dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.



was (Author: tagar):
thank for checking this [~mgaido] 

just attached sql that shows sequence of dataframe creation and last failing 
dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686907#comment-16686907
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

thank for checking this [~mgaido] 

just attached sql that shows sequence of dataframe creation and last failing 
dataframe too 

All SparkSQL 

It always reproduces this issue for us.

Let us know what you find out.


> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  

[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Attachment: SPARK-26041.txt

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Resolved] (SPARK-23067) Allow for easier debugging of the docker container

2018-11-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23067.

Resolution: Duplicate

This was added in SPARK-24534. Yay for searching before filing a new bug...

> Allow for easier debugging of the docker container
> --
>
> Key: SPARK-23067
> URL: https://issues.apache.org/jira/browse/SPARK-23067
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> `docker run -it foxish/spark:v2.3.0 /bin/bash` fails because we don't accept 
> any command except (driver, executor and init). Consider piping the unknown 
> commands through when they're unknown.
> It is still possible to do something like:
> `docker run -it --entrypoint=/bin/bash foxish/spark:v2.3.0` now for debugging 
> but it's common to try and run a different command as specified above. Also 
> consider documenting how to debug/inspect the docker images.
> [~vanzin] [~kimoonkim]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26065) Change query hint from a `LogicalPlan` to a field

2018-11-14 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-26065:
---

 Summary: Change query hint from a `LogicalPlan` to a field
 Key: SPARK-26065
 URL: https://issues.apache.org/jira/browse/SPARK-26065
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maryann Xue


The existing query hint implementation relies on a logical plan node 
{{ResolvedHint}} to store query hints in logical plans, and on {{Statistics}} 
in physical plans. Since {{ResolvedHint}} is not really a logical operator and 
can break the pattern matching for existing and future optimization rules, it 
is a issue to the Optimizer as the old {{AnalysisBarrier}} to the Analyzer.

Given the fact that all our query hints are either 1) a join hint, i.e., 
broadcast hint; or 2) a re-partition hint, which is indeed an operator, we only 
need to add a hint field on the {{Join}} plan and that will be a good enough 
solution for current hint usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686870#comment-16686870
 ] 

Marco Gaido commented on SPARK-26041:
-

No, it is not, for 2.3 we would need a dedicated fix.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686857#comment-16686857
 ] 

Marco Gaido commented on SPARK-26041:
-

Then it'd help if you could provide a reproducer for this... Thanks.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686851#comment-16686851
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

Thanks for referencing that jira [~mgaido]

SPARK-26057 seems Spark 2.4 specific only from description . 
We see this problem in Spark 2.3.1 and in Spark 2.3.2 .. 

Can you check if https://github.com/apache/spark/pull/23035 is applicable to 
Spark 2.3 too? Thanks

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> 

[jira] [Updated] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-26041:
--
Affects Version/s: 2.3.0
   2.3.1

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686828#comment-16686828
 ] 

Marco Gaido commented on SPARK-26041:
-

I think this may be a duplicate of SPARK-26057 (or the other way around). Could 
you please try if the fix for SPARK-26057 I just submitted fixes your case too? 
Thanks.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> PySpark 2.7.15 + Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> 

[jira] [Resolved] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode

2018-11-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25118.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22504
[https://github.com/apache/spark/pull/22504]

> Need a solution to persist Spark application console outputs when running in 
> shell/yarn client mode
> ---
>
> Key: SPARK-25118
> URL: https://issues.apache.org/jira/browse/SPARK-25118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Major
> Fix For: 3.0.0
>
>
> We execute Spark applications in YARN Client mode a lot of time. When we do 
> so the Spark Driver logs are printed to the console.
> We need a solution to persist the console outputs for later usage. This can 
> be either for doing some troubleshooting or for some another log analysis. 
> Ideally, we would like to persist these along with Yarn logs (when 
> application is run in Yarn Client mode). Also, this has to be end-user 
> agnostic, so that the logs are available for later usage without requiring 
> the end-user to make some configuration changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25118) Need a solution to persist Spark application console outputs when running in shell/yarn client mode

2018-11-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25118:
--

Assignee: Ankur Gupta

> Need a solution to persist Spark application console outputs when running in 
> shell/yarn client mode
> ---
>
> Key: SPARK-25118
> URL: https://issues.apache.org/jira/browse/SPARK-25118
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: Ankur Gupta
>Assignee: Ankur Gupta
>Priority: Major
>
> We execute Spark applications in YARN Client mode a lot of time. When we do 
> so the Spark Driver logs are printed to the console.
> We need a solution to persist the console outputs for later usage. This can 
> be either for doing some troubleshooting or for some another log analysis. 
> Ideally, we would like to persist these along with Yarn logs (when 
> application is run in Yarn Client mode). Also, this has to be end-user 
> agnostic, so that the logs are available for later usage without requiring 
> the end-user to make some configuration changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26057) Table joining is broken in Spark 2.4

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686715#comment-16686715
 ] 

Apache Spark commented on SPARK-26057:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23035

> Table joining is broken in Spark 2.4
> 
>
> Key: SPARK-26057
> URL: https://issues.apache.org/jira/browse/SPARK-26057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Pavel Parkhomenko
>Priority: Major
>
> This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
> {code:java}
> import java.util.Arrays.asList
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> spark.createDataFrame(
>   asList(
>     Row("1-1", "sp", 6),
>     Row("1-1", "pc", 5),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 3),
>     Row("2-2", "pc", 2),
>     Row("2-2", "sp", 1)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("n", IntegerType)))
> ).createOrReplaceTempView("cc")
> spark.createDataFrame(
>   asList(
>     Row("sp", 1),
>     Row("sp", 1),
>     Row("sp", 2),
>     Row("sp", 3),
>     Row("sp", 3),
>     Row("sp", 4),
>     Row("sp", 5),
>     Row("sp", 5),
>     Row("pc", 1),
>     Row("pc", 2),
>     Row("pc", 2),
>     Row("pc", 3),
>     Row("pc", 4),
>     Row("pc", 4),
>     Row("pc", 5)
>   ),
>   StructType(List(StructField("layout", StringType), StructField("ts", 
> IntegerType)))
> ).createOrReplaceTempView("p")
> spark.createDataFrame(
>  asList(
>     Row("1-1", "sp", 1),
>     Row("1-1", "sp", 2),
>     Row("1-1", "pc", 3),
>     Row("1-2", "pc", 3),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 4),
>     Row("2-1", "sp", 5),
>     Row("2-2", "pc", 6),
>     Row("2-2", "sp", 6)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("c")
> spark.sql("""
> SELECT cc.id, cc.layout, count(*) as m
>   FROM cc
>   JOIN p USING(layout)
>   WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout 
> AND c.ts > p.ts)
>   GROUP BY cc.id, cc.layout
> """).createOrReplaceTempView("pcc")
> spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
> spark.sql("""
> SELECT cc.id, cc.layout, n, m
>   FROM cc
>   LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
> """).createOrReplaceTempView("k")
> spark.sql("SELECT * FROM k ORDER BY id, layout").show
> {code}
> Actually I tried to catch another bug: similar calculations with joins and 
> nested queries have different results in Spark 2.3.1 and 2.4.0, but when I 
> tried to create a minimal example I received exception
> {code:java}
> java.lang.RuntimeException: Couldn't find id#0 in 
> [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26057) Table joining is broken in Spark 2.4

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26057:


Assignee: (was: Apache Spark)

> Table joining is broken in Spark 2.4
> 
>
> Key: SPARK-26057
> URL: https://issues.apache.org/jira/browse/SPARK-26057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Pavel Parkhomenko
>Priority: Major
>
> This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
> {code:java}
> import java.util.Arrays.asList
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> spark.createDataFrame(
>   asList(
>     Row("1-1", "sp", 6),
>     Row("1-1", "pc", 5),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 3),
>     Row("2-2", "pc", 2),
>     Row("2-2", "sp", 1)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("n", IntegerType)))
> ).createOrReplaceTempView("cc")
> spark.createDataFrame(
>   asList(
>     Row("sp", 1),
>     Row("sp", 1),
>     Row("sp", 2),
>     Row("sp", 3),
>     Row("sp", 3),
>     Row("sp", 4),
>     Row("sp", 5),
>     Row("sp", 5),
>     Row("pc", 1),
>     Row("pc", 2),
>     Row("pc", 2),
>     Row("pc", 3),
>     Row("pc", 4),
>     Row("pc", 4),
>     Row("pc", 5)
>   ),
>   StructType(List(StructField("layout", StringType), StructField("ts", 
> IntegerType)))
> ).createOrReplaceTempView("p")
> spark.createDataFrame(
>  asList(
>     Row("1-1", "sp", 1),
>     Row("1-1", "sp", 2),
>     Row("1-1", "pc", 3),
>     Row("1-2", "pc", 3),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 4),
>     Row("2-1", "sp", 5),
>     Row("2-2", "pc", 6),
>     Row("2-2", "sp", 6)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("c")
> spark.sql("""
> SELECT cc.id, cc.layout, count(*) as m
>   FROM cc
>   JOIN p USING(layout)
>   WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout 
> AND c.ts > p.ts)
>   GROUP BY cc.id, cc.layout
> """).createOrReplaceTempView("pcc")
> spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
> spark.sql("""
> SELECT cc.id, cc.layout, n, m
>   FROM cc
>   LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
> """).createOrReplaceTempView("k")
> spark.sql("SELECT * FROM k ORDER BY id, layout").show
> {code}
> Actually I tried to catch another bug: similar calculations with joins and 
> nested queries have different results in Spark 2.3.1 and 2.4.0, but when I 
> tried to create a minimal example I received exception
> {code:java}
> java.lang.RuntimeException: Couldn't find id#0 in 
> [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26057) Table joining is broken in Spark 2.4

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686710#comment-16686710
 ] 

Apache Spark commented on SPARK-26057:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23035

> Table joining is broken in Spark 2.4
> 
>
> Key: SPARK-26057
> URL: https://issues.apache.org/jira/browse/SPARK-26057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Pavel Parkhomenko
>Priority: Major
>
> This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
> {code:java}
> import java.util.Arrays.asList
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> spark.createDataFrame(
>   asList(
>     Row("1-1", "sp", 6),
>     Row("1-1", "pc", 5),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 3),
>     Row("2-2", "pc", 2),
>     Row("2-2", "sp", 1)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("n", IntegerType)))
> ).createOrReplaceTempView("cc")
> spark.createDataFrame(
>   asList(
>     Row("sp", 1),
>     Row("sp", 1),
>     Row("sp", 2),
>     Row("sp", 3),
>     Row("sp", 3),
>     Row("sp", 4),
>     Row("sp", 5),
>     Row("sp", 5),
>     Row("pc", 1),
>     Row("pc", 2),
>     Row("pc", 2),
>     Row("pc", 3),
>     Row("pc", 4),
>     Row("pc", 4),
>     Row("pc", 5)
>   ),
>   StructType(List(StructField("layout", StringType), StructField("ts", 
> IntegerType)))
> ).createOrReplaceTempView("p")
> spark.createDataFrame(
>  asList(
>     Row("1-1", "sp", 1),
>     Row("1-1", "sp", 2),
>     Row("1-1", "pc", 3),
>     Row("1-2", "pc", 3),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 4),
>     Row("2-1", "sp", 5),
>     Row("2-2", "pc", 6),
>     Row("2-2", "sp", 6)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("c")
> spark.sql("""
> SELECT cc.id, cc.layout, count(*) as m
>   FROM cc
>   JOIN p USING(layout)
>   WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout 
> AND c.ts > p.ts)
>   GROUP BY cc.id, cc.layout
> """).createOrReplaceTempView("pcc")
> spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
> spark.sql("""
> SELECT cc.id, cc.layout, n, m
>   FROM cc
>   LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
> """).createOrReplaceTempView("k")
> spark.sql("SELECT * FROM k ORDER BY id, layout").show
> {code}
> Actually I tried to catch another bug: similar calculations with joins and 
> nested queries have different results in Spark 2.3.1 and 2.4.0, but when I 
> tried to create a minimal example I received exception
> {code:java}
> java.lang.RuntimeException: Couldn't find id#0 in 
> [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26057) Table joining is broken in Spark 2.4

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26057:


Assignee: Apache Spark

> Table joining is broken in Spark 2.4
> 
>
> Key: SPARK-26057
> URL: https://issues.apache.org/jira/browse/SPARK-26057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Pavel Parkhomenko
>Assignee: Apache Spark
>Priority: Major
>
> This sample works in spark-shell 2.3.1 and throws an exception in 2.4.0
> {code:java}
> import java.util.Arrays.asList
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> spark.createDataFrame(
>   asList(
>     Row("1-1", "sp", 6),
>     Row("1-1", "pc", 5),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 3),
>     Row("2-2", "pc", 2),
>     Row("2-2", "sp", 1)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("n", IntegerType)))
> ).createOrReplaceTempView("cc")
> spark.createDataFrame(
>   asList(
>     Row("sp", 1),
>     Row("sp", 1),
>     Row("sp", 2),
>     Row("sp", 3),
>     Row("sp", 3),
>     Row("sp", 4),
>     Row("sp", 5),
>     Row("sp", 5),
>     Row("pc", 1),
>     Row("pc", 2),
>     Row("pc", 2),
>     Row("pc", 3),
>     Row("pc", 4),
>     Row("pc", 4),
>     Row("pc", 5)
>   ),
>   StructType(List(StructField("layout", StringType), StructField("ts", 
> IntegerType)))
> ).createOrReplaceTempView("p")
> spark.createDataFrame(
>  asList(
>     Row("1-1", "sp", 1),
>     Row("1-1", "sp", 2),
>     Row("1-1", "pc", 3),
>     Row("1-2", "pc", 3),
>     Row("1-2", "pc", 4),
>     Row("2-1", "sp", 4),
>     Row("2-1", "sp", 5),
>     Row("2-2", "pc", 6),
>     Row("2-2", "sp", 6)
>   ),
>   StructType(List(StructField("id", StringType), StructField("layout", 
> StringType), StructField("ts", IntegerType)))
> ).createOrReplaceTempView("c")
> spark.sql("""
> SELECT cc.id, cc.layout, count(*) as m
>   FROM cc
>   JOIN p USING(layout)
>   WHERE EXISTS(SELECT 1 FROM c WHERE c.id = cc.id AND c.layout = cc.layout 
> AND c.ts > p.ts)
>   GROUP BY cc.id, cc.layout
> """).createOrReplaceTempView("pcc")
> spark.sql("SELECT * FROM pcc ORDER BY id, layout").show
> spark.sql("""
> SELECT cc.id, cc.layout, n, m
>   FROM cc
>   LEFT OUTER JOIN pcc ON pcc.id = cc.id AND pcc.layout = cc.layout
> """).createOrReplaceTempView("k")
> spark.sql("SELECT * FROM k ORDER BY id, layout").show
> {code}
> Actually I tried to catch another bug: similar calculations with joins and 
> nested queries have different results in Spark 2.3.1 and 2.4.0, but when I 
> tried to create a minimal example I received exception
> {code:java}
> java.lang.RuntimeException: Couldn't find id#0 in 
> [id#38,layout#39,ts#7,id#10,layout#11,ts#12]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686661#comment-16686661
 ] 

Apache Spark commented on SPARK-26035:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23034

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25868) One part of Spark MLlib Kmean Logic Performance problem

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25868.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22893
[https://github.com/apache/spark/pull/22893]

> One part of Spark MLlib Kmean Logic Performance problem
> ---
>
> Key: SPARK-25868
> URL: https://issues.apache.org/jira/browse/SPARK-25868
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.2
>Reporter: Liang Li
>Assignee: Liang Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> In function fastSquaredDistance, there is a low performance logic:
> the sqDist = Vectors.sqdist(v1, v2) is better than sqDist = sumSquaredNorm - 
> 2.0 * dot(v1, v2) in calculation performance
> So get rid of the low performance login in function fastSquaredDistance.
> More test(End-to-End, function) situation can be found in 
> https://github.com/apache/spark/pull/22893
> Already update a patch #22893 for merge
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25868) One part of Spark MLlib Kmean Logic Performance problem

2018-11-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25868:
-

Assignee: Liang Li

> One part of Spark MLlib Kmean Logic Performance problem
> ---
>
> Key: SPARK-25868
> URL: https://issues.apache.org/jira/browse/SPARK-25868
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.2
>Reporter: Liang Li
>Assignee: Liang Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> In function fastSquaredDistance, there is a low performance logic:
> the sqDist = Vectors.sqdist(v1, v2) is better than sqDist = sumSquaredNorm - 
> 2.0 * dot(v1, v2) in calculation performance
> So get rid of the low performance login in function fastSquaredDistance.
> More test(End-to-End, function) situation can be found in 
> https://github.com/apache/spark/pull/22893
> Already update a patch #22893 for merge
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686677#comment-16686677
 ] 

Apache Spark commented on SPARK-26054:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23035

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686671#comment-16686671
 ] 

Apache Spark commented on SPARK-26054:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/23035

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26035:


Assignee: (was: Apache Spark)

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26035:


Assignee: Apache Spark

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23831) Add org.apache.derby to IsolatedClientLoader

2018-11-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686640#comment-16686640
 ] 

Hyukjin Kwon commented on SPARK-23831:
--

[~marmbrus], what did you make you came here? did reverting this actually break 
something?

> Add org.apache.derby to IsolatedClientLoader
> 
>
> Key: SPARK-23831
> URL: https://issues.apache.org/jira/browse/SPARK-23831
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Add org.apache.derby to IsolatedClientLoader,otherwise it may throw an 
> exception:
> {noformat}
> [info] Cause: java.sql.SQLException: Failed to start database 'metastore_db' 
> with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2439ab23, see 
> the next exception for details.
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown 
> Source)
> [info] at org.apache.derby.impl.jdbc.EmbedConnection.(Unknown Source)
> [info] at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
> {noformat}
> How to reproduce:
> {noformat}
> sed 's/HiveExternalCatalogSuite/HiveExternalCatalog2Suite/g' 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalogSuite.scala
>  > 
> sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveExternalCatalog2Suite.scala
> build/sbt -Phive "hive/test-only *.HiveExternalCatalogSuite 
> *.HiveExternalCatalog2Suite"
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26064) Unable to fetch jar from remote repo while running spark-submit on kubernetes

2018-11-14 Thread Bala Bharath Reddy Resapu (JIRA)
Bala Bharath Reddy Resapu created SPARK-26064:
-

 Summary: Unable to fetch jar from remote repo while running 
spark-submit on kubernetes
 Key: SPARK-26064
 URL: https://issues.apache.org/jira/browse/SPARK-26064
 Project: Spark
  Issue Type: Question
  Components: Kubernetes
Affects Versions: 2.3.2
Reporter: Bala Bharath Reddy Resapu


I am trying to run spark on kubernetes with a docker image. My requirement is 
to download the jar from the external repo while running spark-submit. I am 
able to download the jar using wget in the container but it doesn't work when 
inputting in the spark-submit command. I am not packaging the jar with docker 
image. It works fine when I input the jar file inside the docker image. 

 

./bin/spark-submit \

--master k8s://[https://ip:port|https://ipport/] \

--deploy-mode cluster \

--name test3 \

--class hello \

--conf spark.kubernetes.container.image.pullSecrets=abcd \

--conf spark.kubernetes.container.image=spark:h2.0 \

[https://devops.com/artifactory/local/testing/testing_2.11/h|https://bala.bharath.reddy.resapu%40ibm.com:akcp5bcbktykg2ti28sju4gtebsqwkg2mqkaf9w6g5rdbo3iwrwx7qb1m5dokgd54hdru2...@na.artifactory.swg-devops.com/artifactory/txo-cedp-garage-artifacts-sbt-local/testing/testing_2.11/arithmetic.jar]ello.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26054:

Affects Version/s: (was: 2.4.0)
   2.2.0

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26054:

Component/s: (was: Spark Core)
 SQL

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-26054.
-
Resolution: Cannot Reproduce

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686395#comment-16686395
 ] 

Marco Gaido commented on SPARK-26054:
-

Then the affected version is 2.2.0, not 2.4.0. I am updating this. I'll also 
close this ticket as it is fixed in the current version. Please fill more 
carefully the JIRA next time. Thanks.

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Jaya Krishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686386#comment-16686386
 ] 

Jaya Krishna commented on SPARK-26054:
--

Are you not seeing the issue even with BigDecimal data type?  I am using the 
embedded Spark in Zeppelin. In our product we use Spark 2.2.0.. May be the 
issue is fixed in later version of Spark. I will check with the latest spark 
release. Thanks for the quick response.

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686378#comment-16686378
 ] 

Marco Gaido commented on SPARK-26054:
-

Yes, sorry, I forgot to copy its definition. It is:

{code}
case class AA(id: String, amount: BigDecimal)
{code}

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26063) CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString

2018-11-14 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-26063:
---

 Summary: CatalystDataToAvro gives "UnresolvedException: Invalid 
call to dataType on unresolved object" when requested for numberedTreeString
 Key: SPARK-26063
 URL: https://issues.apache.org/jira/browse/SPARK-26063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jacek Laskowski


The following gives 
{{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'id}}:
{code:java}
// ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0
scala> spark.version
res0: String = 2.4.0

import org.apache.spark.sql.avro._
val q = spark.range(1).withColumn("to_avro_id", to_avro('id))
val logicalPlan = q.queryExecution.logical

scala> logicalPlan.expressions.drop(1).head.numberedTreeString
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: 'id
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at 
org.apache.spark.sql.avro.CatalystDataToAvro.simpleString(CatalystDataToAvro.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.Expression.verboseString(Expression.scala:233)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:569)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:469)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.numberedTreeString(TreeNode.scala:483)
... 51 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Jaya Krishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686373#comment-16686373
 ] 

Jaya Krishna commented on SPARK-26054:
--

Sorry for the confusion. I actually joined screen shots of several sections of 
the zeppelin workbook and I changed the value in between. Attached the correct 
picture now. Have you tried by defining the case class AA as "case class AA 
(id: String, amount: BigDecimal)"?

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Jaya Krishna (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaya Krishna updated SPARK-26054:
-
Attachment: (was: sparksql-rounding.png)

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Jaya Krishna (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaya Krishna updated SPARK-26054:
-
Attachment: sparksql-rounding.png

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26035:
-
Target Version/s: 3.0.0
   Fix Version/s: (was: 3.0.0)

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26062) Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)

2018-11-14 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-26062:
---

 Summary: Rename spark-avro external module to spark-sql-avro (to 
match spark-sql-kafka)
 Key: SPARK-26062
 URL: https://issues.apache.org/jira/browse/SPARK-26062
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Jacek Laskowski


Given the name of {{spark-sql-kafka}} external module it seems appropriate (and 
consistent) to rename {{spark-avro}} external module to {{spark-sql-avro}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26036:


Assignee: (was: Apache Spark)

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686333#comment-16686333
 ] 

Marco Gaido commented on SPARK-26054:
-

{code}
val data = Seq(AA("0101", "2500.98".toDouble), AA("0102", 
"5690.9876".toDouble))
val rdd = sparkContext.parallelize(data);

val df = rdd.toDF
df.select($"id", $"amount", round($"amount", 2)).show()
{code}

returns

{code}
++-++
|  id|   amount|round(amount, 2)|
++-++
|0101|  2500.98| 2500.98|
|0102|5690.9876| 5690.99|
++-++
{code}

Please check what you are doing...your example seems pretty strange: the values 
returned in the double example are very different from the string values...

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686328#comment-16686328
 ] 

Apache Spark commented on SPARK-26036:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23033

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26036:


Assignee: Apache Spark

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26036) Break large tests.py files into smaller files

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686327#comment-16686327
 ] 

Apache Spark commented on SPARK-26036:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23033

> Break large tests.py files into smaller files
> -
>
> Key: SPARK-26036
> URL: https://issues.apache.org/jira/browse/SPARK-26036
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Jaya Krishna (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686318#comment-16686318
 ] 

Jaya Krishna commented on SPARK-26054:
--

Hmm. There seems to be an issue if we start with a RDD - convert it to 
DataFrame and then do these operations. Can you try as follows:

 

case class AA (id: String, amount: BigDecimal)
val data = Seq(AA("0101", "2500.98".toDouble), AA("0102", "5690.9876".toDouble))
var rdd = sc.parallelize(data);

//val df = Seq(AA("0101", "2500.98".toDouble), AA("0102", 
"5690.9876".toDouble)).toDF
val df = rdd.toDF
df.select($"id", $"amount", round($"amount", 2)).show()

 

 

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26061:


Assignee: Apache Spark

> Reduce the number of unused UnsafeRowWriters created in whole-stage codegen
> ---
>
> Key: SPARK-26061
> URL: https://issues.apache.org/jira/browse/SPARK-26061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Trivial
>
> Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
> code.
> They come from the CodegenSupport.consume() calling prepareRowVar(), which 
> uses GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter 
> mutable state, regardless of whether or not the downstream (parent) operator 
> will use the rowVar or not.
> Even when the downstream doConsume function doesn't use the rowVar (i.e. 
> doesn't put row.code as a part of this operator's codegen template), the 
> registered UnsafeRowWriter stays there, which makes the init function of the 
> generated code a bit bloated.
> This ticket doesn't track the root issue, but makes it slightly less painful: 
> when the doConsume function is split out, the prepareRowVar() function is 
> called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
> simply moves the original call to prepareRowVar() down into the doConsume 
> split/no-split branch so that we're back to just 1x the pain.
> To fix the root issue, something that allows the CodegenSupport operators to 
> indicate whether or not they're going to use the rowVar would be needed. 
> That's a much more elaborate change so I'd like to just make a minor fix 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686299#comment-16686299
 ] 

Apache Spark commented on SPARK-26061:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/23032

> Reduce the number of unused UnsafeRowWriters created in whole-stage codegen
> ---
>
> Key: SPARK-26061
> URL: https://issues.apache.org/jira/browse/SPARK-26061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kris Mok
>Priority: Trivial
>
> Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
> code.
> They come from the CodegenSupport.consume() calling prepareRowVar(), which 
> uses GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter 
> mutable state, regardless of whether or not the downstream (parent) operator 
> will use the rowVar or not.
> Even when the downstream doConsume function doesn't use the rowVar (i.e. 
> doesn't put row.code as a part of this operator's codegen template), the 
> registered UnsafeRowWriter stays there, which makes the init function of the 
> generated code a bit bloated.
> This ticket doesn't track the root issue, but makes it slightly less painful: 
> when the doConsume function is split out, the prepareRowVar() function is 
> called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
> simply moves the original call to prepareRowVar() down into the doConsume 
> split/no-split branch so that we're back to just 1x the pain.
> To fix the root issue, something that allows the CodegenSupport operators to 
> indicate whether or not they're going to use the rowVar would be needed. 
> That's a much more elaborate change so I'd like to just make a minor fix 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686300#comment-16686300
 ] 

Apache Spark commented on SPARK-26061:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/23032

> Reduce the number of unused UnsafeRowWriters created in whole-stage codegen
> ---
>
> Key: SPARK-26061
> URL: https://issues.apache.org/jira/browse/SPARK-26061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kris Mok
>Priority: Trivial
>
> Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
> code.
> They come from the CodegenSupport.consume() calling prepareRowVar(), which 
> uses GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter 
> mutable state, regardless of whether or not the downstream (parent) operator 
> will use the rowVar or not.
> Even when the downstream doConsume function doesn't use the rowVar (i.e. 
> doesn't put row.code as a part of this operator's codegen template), the 
> registered UnsafeRowWriter stays there, which makes the init function of the 
> generated code a bit bloated.
> This ticket doesn't track the root issue, but makes it slightly less painful: 
> when the doConsume function is split out, the prepareRowVar() function is 
> called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
> simply moves the original call to prepareRowVar() down into the doConsume 
> split/no-split branch so that we're back to just 1x the pain.
> To fix the root issue, something that allows the CodegenSupport operators to 
> indicate whether or not they're going to use the rowVar would be needed. 
> That's a much more elaborate change so I'd like to just make a minor fix 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26061:


Assignee: (was: Apache Spark)

> Reduce the number of unused UnsafeRowWriters created in whole-stage codegen
> ---
>
> Key: SPARK-26061
> URL: https://issues.apache.org/jira/browse/SPARK-26061
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Kris Mok
>Priority: Trivial
>
> Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
> code.
> They come from the CodegenSupport.consume() calling prepareRowVar(), which 
> uses GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter 
> mutable state, regardless of whether or not the downstream (parent) operator 
> will use the rowVar or not.
> Even when the downstream doConsume function doesn't use the rowVar (i.e. 
> doesn't put row.code as a part of this operator's codegen template), the 
> registered UnsafeRowWriter stays there, which makes the init function of the 
> generated code a bit bloated.
> This ticket doesn't track the root issue, but makes it slightly less painful: 
> when the doConsume function is split out, the prepareRowVar() function is 
> called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
> simply moves the original call to prepareRowVar() down into the doConsume 
> split/no-split branch so that we're back to just 1x the pain.
> To fix the root issue, something that allows the CodegenSupport operators to 
> indicate whether or not they're going to use the rowVar would be needed. 
> That's a much more elaborate change so I'd like to just make a minor fix 
> first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26061) Reduce the number of unused UnsafeRowWriters created in whole-stage codegen

2018-11-14 Thread Kris Mok (JIRA)
Kris Mok created SPARK-26061:


 Summary: Reduce the number of unused UnsafeRowWriters created in 
whole-stage codegen
 Key: SPARK-26061
 URL: https://issues.apache.org/jira/browse/SPARK-26061
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1, 2.3.0
Reporter: Kris Mok


Reduce the number of unused UnsafeRowWriters created in whole-stage generated 
code.
They come from the CodegenSupport.consume() calling prepareRowVar(), which uses 
GenerateUnsafeProjection.createCode() and registers an UnsafeRowWriter mutable 
state, regardless of whether or not the downstream (parent) operator will use 
the rowVar or not.
Even when the downstream doConsume function doesn't use the rowVar (i.e. 
doesn't put row.code as a part of this operator's codegen template), the 
registered UnsafeRowWriter stays there, which makes the init function of the 
generated code a bit bloated.

This ticket doesn't track the root issue, but makes it slightly less painful: 
when the doConsume function is split out, the prepareRowVar() function is 
called twice, so it's double the pain of unused UnsafeRowWriters. This fix 
simply moves the original call to prepareRowVar() down into the doConsume 
split/no-split branch so that we're back to just 1x the pain.

To fix the root issue, something that allows the CodegenSupport operators to 
indicate whether or not they're going to use the rowVar would be needed. That's 
a much more elaborate change so I'd like to just make a minor fix first.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26054) Creating a computed column applying the spark sql rounding on a column of type decimal affects the orginal column as well.

2018-11-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686293#comment-16686293
 ] 

Marco Gaido commented on SPARK-26054:
-

I cannot reproduce this:

{code}
val df = Seq(AA("0101", "2500.98".toDouble), AA("0102", 
"5690.9876".toDouble)).toDF
df.select($"id", $"amount", round($"amount", 2)).show()
{code}

returned

{code}
++++
|  id|  amount|round(amount, 2)|
++++
|0101|2500.9800...| 2500.98|
|0102|5690.9876...| 5690.99|
++++
{code}

Moreover in the image you posted the values are pretty weird... I mean also the 
double ones are very different from what is represented in the strings...

> Creating a computed column applying the spark sql rounding on a column of 
> type decimal affects the orginal column as well.
> --
>
> Key: SPARK-26054
> URL: https://issues.apache.org/jira/browse/SPARK-26054
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jaya Krishna
>Priority: Minor
> Attachments: sparksql-rounding.png
>
>
> When a computed column that rounds the value is added to a data frame, it is 
> affecting the value of the original column as well. The behavior depends on 
> the database column type - If it is either float or double, the result is as 
> expected - the original column will have its own formatting and the computed 
> column will be rounded as per the rounding definition specified for it. 
> However if the column type in the database is decimal, then Spark SQL is 
> applying the rounding even to the original column. Attached image has the 
> spark sql code that shows the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26060:


Assignee: Apache Spark

> Track SparkConf entries and make SET command reject such entries.
> -
>
> Key: SPARK-26060
> URL: https://issues.apache.org/jira/browse/SPARK-26060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Currently the {{SET}} command works without any warnings even if the 
> specified key is for {{SparkConf}} entries and it has no effect because the 
> command does not update {{SparkConf}}, but the behavior might confuse users. 
> We should track {{SparkConf}} entries and make the command reject for such 
> entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686233#comment-16686233
 ] 

Apache Spark commented on SPARK-26060:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23031

> Track SparkConf entries and make SET command reject such entries.
> -
>
> Key: SPARK-26060
> URL: https://issues.apache.org/jira/browse/SPARK-26060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently the {{SET}} command works without any warnings even if the 
> specified key is for {{SparkConf}} entries and it has no effect because the 
> command does not update {{SparkConf}}, but the behavior might confuse users. 
> We should track {{SparkConf}} entries and make the command reject for such 
> entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26060:


Assignee: (was: Apache Spark)

> Track SparkConf entries and make SET command reject such entries.
> -
>
> Key: SPARK-26060
> URL: https://issues.apache.org/jira/browse/SPARK-26060
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently the {{SET}} command works without any warnings even if the 
> specified key is for {{SparkConf}} entries and it has no effect because the 
> command does not update {{SparkConf}}, but the behavior might confuse users. 
> We should track {{SparkConf}} entries and make the command reject for such 
> entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26060) Track SparkConf entries and make SET command reject such entries.

2018-11-14 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26060:
-

 Summary: Track SparkConf entries and make SET command reject such 
entries.
 Key: SPARK-26060
 URL: https://issues.apache.org/jira/browse/SPARK-26060
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Takuya Ueshin


Currently the {{SET}} command works without any warnings even if the specified 
key is for {{SparkConf}} entries and it has no effect because the command does 
not update {{SparkConf}}, but the behavior might confuse users. We should track 
{{SparkConf}} entries and make the command reject for such entries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >