[jira] [Assigned] (SPARK-23129) Lazy init DiskMapIterator#deserializeStream to reduce memory usage when ExternalAppendOnlyMap spill too much times

2018-01-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23129:
---

Assignee: zhoukang

> Lazy init DiskMapIterator#deserializeStream to reduce memory usage when 
> ExternalAppendOnlyMap spill  too much times
> ---
>
> Key: SPARK-23129
> URL: https://issues.apache.org/jira/browse/SPARK-23129
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init 
> when DiskMapIterator instance created.This will cause memory use overhead 
> when ExternalAppendOnlyMap spill too much times.
> We can avoid this by making deserializeStream init when it is used the first 
> time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23129) Lazy init DiskMapIterator#deserializeStream to reduce memory usage when ExternalAppendOnlyMap spill too much times

2018-01-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23129.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20292
[https://github.com/apache/spark/pull/20292]

> Lazy init DiskMapIterator#deserializeStream to reduce memory usage when 
> ExternalAppendOnlyMap spill  too much times
> ---
>
> Key: SPARK-23129
> URL: https://issues.apache.org/jira/browse/SPARK-23129
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently,the deserializeStream in ExternalAppendOnlyMap#DiskMapIterator init 
> when DiskMapIterator instance created.This will cause memory use overhead 
> when ExternalAppendOnlyMap spill too much times.
> We can avoid this by making deserializeStream init when it is used the first 
> time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23211) SparkR MLlib randomFroest parameter problem

2018-01-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23211.
---
Resolution: Invalid

I can't make out what you're asking. Please put this to the mailing list first.

> SparkR MLlib randomFroest  parameter problem
> 
>
> Key: SPARK-23211
> URL: https://issues.apache.org/jira/browse/SPARK-23211
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
> Environment: {code:R}
> sdf_list <- randomSplit(train_data, rep(7, 3), 10086) 
> model <- spark.randomForest(
>   sdf_list[[1]],  
>   forward_count ~ .,   
>   type  = "regression",   
>   path  = paste0("./predict/model/randomForest_", x),   
>   overwrite = TRUE,  
>   newData   = sdf_list[[2]])
> {code}
> train_data is a SparkDataFrame
> The notes of parameter newData is "a SparkDataFrame for testing."
> The notes of parameter path is "The directory where the model is saved."
> These all don't work normaly.
> why?
>Reporter: 黄龙龙
>Priority: Major
>  Labels: documentation, usability
>
> spark.randomForest() and randomSplit() problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23211) SparkR MLlib randomFroest parameter problem

2018-01-24 Thread JIRA
黄龙龙 created SPARK-23211:
---

 Summary: SparkR MLlib randomFroest  parameter problem
 Key: SPARK-23211
 URL: https://issues.apache.org/jira/browse/SPARK-23211
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
 Environment: {code:R}
sdf_list <- randomSplit(train_data, rep(7, 3), 10086) 

model <- spark.randomForest(
  sdf_list[[1]],  
  forward_count ~ .,   
  type  = "regression",   
  path  = paste0("./predict/model/randomForest_", x),   
  overwrite = TRUE,  
  newData   = sdf_list[[2]])
{code}
train_data is a SparkDataFrame
The notes of parameter newData is "a SparkDataFrame for testing."
The notes of parameter path is "The directory where the model is saved."
These all don't work normaly.

why?

Reporter: 黄龙龙


spark.randomForest() and randomSplit() problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23187) Accumulator object can not be sent from Executor to Driver

2018-01-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338737#comment-16338737
 ] 

Saisai Shao commented on SPARK-23187:
-

Actually heartbeat report is OK according to my investigation, registered 
accumulator will report back current updates from executors to driver, no need 
to wait task end.

The only thing is that in Spark UI, accumulator will only be displayed when 
task is finished, but for internal metric accumulators they will display in 
live. So I guess it is because UI doesn't display your registered accumulator 
in time, which makes you think that accumulator is not reported in heartbeat.

> Accumulator object can not be sent from Executor to Driver
> --
>
> Key: SPARK-23187
> URL: https://issues.apache.org/jira/browse/SPARK-23187
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In the Executor.scala->reportHeartBeat(), task Metrics value can not be sent 
> to Driver (In receive side all values are zero).
> I write an UT for explanation.
> {code}
> diff --git 
> a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala 
> b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> index f9481f8..57fb096 100644
> --- a/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> +++ b/core/src/test/scala/org/apache/spark/rpc/netty/NettyRpcEnvSuite.scala
> @@ -17,11 +17,16 @@
>  package org.apache.spark.rpc.netty
> +import scala.collection.mutable.ArrayBuffer
> +
>  import org.scalatest.mockito.MockitoSugar
>  import org.apache.spark._
>  import org.apache.spark.network.client.TransportClient
>  import org.apache.spark.rpc._
> +import org.apache.spark.util.AccumulatorContext
> +import org.apache.spark.util.AccumulatorV2
> +import org.apache.spark.util.LongAccumulator
>  class NettyRpcEnvSuite extends RpcEnvSuite with MockitoSugar {
> @@ -83,5 +88,21 @@ class NettyRpcEnvSuite extends RpcEnvSuite with 
> MockitoSugar {
>  assertRequestMessageEquals(
>msg3,
>RequestMessage(nettyEnv, client, msg3.serialize(nettyEnv)))
> +
> +val acc = new LongAccumulator
> +val sc = SparkContext.getOrCreate(new 
> SparkConf().setMaster("local").setAppName("testAcc"));
> +sc.register(acc, "testAcc")
> +acc.setValue(1)
> +//val msg4 = new RequestMessage(senderAddress, receiver, acc)
> +//assertRequestMessageEquals(
> +//  msg4,
> +//  RequestMessage(nettyEnv, client, msg4.serialize(nettyEnv)))
> +
> +val accbuf = new ArrayBuffer[AccumulatorV2[_, _]]()
> +accbuf += acc
> +val msg5 = new RequestMessage(senderAddress, receiver, accbuf)
> +assertRequestMessageEquals(
> +  msg5,
> +  RequestMessage(nettyEnv, client, msg5.serialize(nettyEnv)))
>}
>  }
> {code}
> msg4 and msg5 are all going to failed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23210) Introduce the concept of default value to schema

2018-01-24 Thread LvDongrong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338675#comment-16338675
 ] 

LvDongrong commented on SPARK-23210:


 Can we set the  default value to be null, like hive? @maropu @gatorsmile  
thank you!

> Introduce the concept of default value to schema
> 
>
> Key: SPARK-23210
> URL: https://issues.apache.org/jira/browse/SPARK-23210
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: LvDongrong
>Priority: Major
>
> There is no concept of DEFAULT VALUE for schema in spark now.
> Our team want to support insert into serial columns of table,like "insert 
> into (a, c) values ("value1", "value2") for our use case, but the default 
> vaule of column is not definited. In hive, the default vaule of column is 
> NULL if we don't specify.
> So I think maybe it is necessary to introduce the concept of default value to 
> schema in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23210) Introduce the concept of default value to schema

2018-01-24 Thread LvDongrong (JIRA)
LvDongrong created SPARK-23210:
--

 Summary: Introduce the concept of default value to schema
 Key: SPARK-23210
 URL: https://issues.apache.org/jira/browse/SPARK-23210
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: LvDongrong


There is no concept of DEFAULT VALUE for schema in spark now.
Our team want to support insert into serial columns of table,like "insert into 
(a, c) values ("value1", "value2") for our use case, but the default vaule of 
column is not definited. In hive, the default vaule of column is NULL if we 
don't specify.
So I think maybe it is necessary to introduce the concept of default value to 
schema in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338559#comment-16338559
 ] 

Edwina Lu commented on SPARK-23206:
---

Thanks, [~zsxwing]. Making the new metrics available in the metrics system 
would be very useful. Could this be done separately?

We would also like the metrics to be available in the Spark web UI, since it is 
easy for users to see and use, and the REST API, which is used by [Dr. 
Elephant|https://github.com/linkedin/dr-elephant] (an open source tool for 
analyzing and tuning Hadoop and now Spark) and other metrics gathering and 
analysis projects which we have.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338553#comment-16338553
 ] 

Saisai Shao commented on SPARK-23206:
-

I think this Jira duplicates SPARK-9103. Also seems some metrics already 
existed in Spark, right?

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23209) HiveDelegationTokenProvider throws an exception if Hive jars are not the classpath

2018-01-24 Thread Sahil Takiar (JIRA)
Sahil Takiar created SPARK-23209:


 Summary: HiveDelegationTokenProvider throws an exception if Hive 
jars are not the classpath
 Key: SPARK-23209
 URL: https://issues.apache.org/jira/browse/SPARK-23209
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: OSX, Java(TM) SE Runtime Environment (build 
1.8.0_92-b14), Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)
Reporter: Sahil Takiar


While doing some Hive-on-Spark testing against the Spark 2.3.0 release 
candidates we came across a bug (see HIVE-18436).

Stack-trace:

{code}
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/conf/HiveConf
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager.getDelegationTokenProviders(HadoopDelegationTokenManager.scala:68)
at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager.(HadoopDelegationTokenManager.scala:54)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager.(YARNHadoopDelegationTokenManager.scala:44)
at org.apache.spark.deploy.yarn.Client.(Client.scala:123)
at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1502)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hive.conf.HiveConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more
{code}

Looks like the bug was introduced by SPARK-20434. SPARK-20434 changed 
{{HiveDelegationTokenProvider}} so that it constructs 
{{o.a.h.hive.conf.HiveConf}} inside {{HiveCredentialProvider#hiveConf}} rather 
than trying to manually load the class via the class loader. Looks like with 
the new code the JVM tries to load {{HiveConf}} as soon as 
{{HiveDelegationTokenProvider}} is referenced. Since there is no try-catch 
around the construction of {{HiveDelegationTokenProvider}} a 
{{ClassNotFoundException}} is thrown, which causes spark-submit to crash. 
Spark's {{docs/running-on-yarn.md}} says "a Hive token will be obtained if Hive 
is on the classpath". This behavior would seem to contradict that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23201) Cannot create view when duplicate columns exist in subquery

2018-01-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338551#comment-16338551
 ] 

Dongjoon Hyun commented on SPARK-23201:
---

Apache Spark 1.6.3 (released on November 7, 2016) also has this issue.
I don't think there will be Apache Spark 1.6.4 for this bug.

> Cannot create view when duplicate columns exist in subquery
> ---
>
> Key: SPARK-23201
> URL: https://issues.apache.org/jira/browse/SPARK-23201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.3
>Reporter: Johannes Mayer
>Priority: Critical
>
> I have two tables A(colA, col2, col3), B(colB, col3, col5)
> If i join them in a subquery on A.colA = B.colB i can select the non 
> duplicate columns, but i cannot create a view (col3 is duplicate, but not 
> selected)
>  
> {code:java}
> create view testview as select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  
> This works:
>  
> {code:java}
> select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23201) Cannot create view when duplicate columns exist in subquery

2018-01-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23201:
--
Affects Version/s: 1.6.3

> Cannot create view when duplicate columns exist in subquery
> ---
>
> Key: SPARK-23201
> URL: https://issues.apache.org/jira/browse/SPARK-23201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.3
>Reporter: Johannes Mayer
>Priority: Critical
>
> I have two tables A(colA, col2, col3), B(colB, col3, col5)
> If i join them in a subquery on A.colA = B.colB i can select the non 
> duplicate columns, but i cannot create a view (col3 is duplicate, but not 
> selected)
>  
> {code:java}
> create view testview as select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  
> This works:
>  
> {code:java}
> select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21717) Decouple the generated codes of consuming rows in operators under whole-stage codegen

2018-01-24 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-21717:
---
Target Version/s: 2.3.0
Priority: Critical  (was: Major)

> Decouple the generated codes of consuming rows in operators under whole-stage 
> codegen
> -
>
> Key: SPARK-21717
> URL: https://issues.apache.org/jira/browse/SPARK-21717
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Priority: Critical
>
> It has been observed in SPARK-21603 that whole-stage codegen suffers 
> performance degradtion, if generated functions are too long to be optimized 
> by JIT.
> We basically produce a single function to incorporate generated codes from 
> all physical operators in whole-stage. Thus, it is possibly to grow the size 
> of generated function over a threshold that we can't have JIT optimization 
> for it anymore.
> This ticket is trying to decouple the logic of consuming rows in physical 
> operators to avoid a giant function processing rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22221) Add User Documentation for Working with Arrow in Spark

2018-01-24 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-1:
--
Target Version/s: 2.3.0

> Add User Documentation for Working with Arrow in Spark
> --
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> There needs to be user facing documentation that will show how to enable/use 
> Arrow with Spark, what the user should expect, and describe any differences 
> with similar existing functionality.
> A comment from Xiao Li on https://github.com/apache/spark/pull/18664
> Given the users/applications contain the Timestamp in their Dataset and their 
> processing algorithms also need to have the codes based on the corresponding 
> time-zone related assumptions.
> * For the new users/applications, they first enabled Arrow and later hit an 
> Arrow bug? Can they simply turn off spark.sql.execution.arrow.enable? If not, 
> what should they do?
> * For the existing users/applications, they want to utilize Arrow for better 
> performance. Can they just turn on spark.sql.execution.arrow.enable? What 
> should they do?
> Note Hopefully, the guides/solutions are user-friendly. That means, it must 
> be very simple to understand for most users.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23208) GenArrayData produces illegal code

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338537#comment-16338537
 ] 

Apache Spark commented on SPARK-23208:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/20391

> GenArrayData produces illegal code
> --
>
> Key: SPARK-23208
> URL: https://issues.apache.org/jira/browse/SPARK-23208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Blocker
>
> The GenArrayData.genCodeToCreateArrayData produces illegal java code when 
> code splitting is enabled. This is caused by a typo on the following line: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23208) GenArrayData produces illegal code

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23208:


Assignee: Apache Spark  (was: Herman van Hovell)

> GenArrayData produces illegal code
> --
>
> Key: SPARK-23208
> URL: https://issues.apache.org/jira/browse/SPARK-23208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Blocker
>
> The GenArrayData.genCodeToCreateArrayData produces illegal java code when 
> code splitting is enabled. This is caused by a typo on the following line: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23208) GenArrayData produces illegal code

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23208:


Assignee: Herman van Hovell  (was: Apache Spark)

> GenArrayData produces illegal code
> --
>
> Key: SPARK-23208
> URL: https://issues.apache.org/jira/browse/SPARK-23208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Blocker
>
> The GenArrayData.genCodeToCreateArrayData produces illegal java code when 
> code splitting is enabled. This is caused by a typo on the following line: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal reassigned SPARK-23207:
--

Assignee: Jiang Xingbo

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Major
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23207:
---
Priority: Blocker  (was: Major)

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23201) Cannot create view when duplicate columns exist in subquery

2018-01-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338525#comment-16338525
 ] 

Dongjoon Hyun edited comment on SPARK-23201 at 1/25/18 1:11 AM:


Hi, [~joha0123].

It seems to work in the latest Apache Spark (2.2.1) and (2.1.2). Do you really 
want to report at *1.6.0*?

{code}
scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 
from (select * from A left join B on (A.colA = B.colB)) tmp").show

scala> sql("select * from v1").show
+++++
|colA|col2|colB|col5|
+++++
+++++

scala> spark.version
res7: String = 2.2.1
{code}


was (Author: dongjoon):
Hi, [~joha0123].

It seems to work in the latest Apache Spark (2.2.1). Do you really want to 
report at *1.6.0*?

{code}
scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 
from (select * from A left join B on (A.colA = B.colB)) tmp").show

scala> sql("select * from v1").show
+++++
|colA|col2|colB|col5|
+++++
+++++

scala> spark.version
res7: String = 2.2.1
{code}

> Cannot create view when duplicate columns exist in subquery
> ---
>
> Key: SPARK-23201
> URL: https://issues.apache.org/jira/browse/SPARK-23201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johannes Mayer
>Priority: Critical
>
> I have two tables A(colA, col2, col3), B(colB, col3, col5)
> If i join them in a subquery on A.colA = B.colB i can select the non 
> duplicate columns, but i cannot create a view (col3 is duplicate, but not 
> selected)
>  
> {code:java}
> create view testview as select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  
> This works:
>  
> {code:java}
> select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23208) GenArrayData produces illegal code

2018-01-24 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-23208:
--
Target Version/s: 2.3.0

> GenArrayData produces illegal code
> --
>
> Key: SPARK-23208
> URL: https://issues.apache.org/jira/browse/SPARK-23208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Blocker
>
> The GenArrayData.genCodeToCreateArrayData produces illegal java code when 
> code splitting is enabled. This is caused by a typo on the following line: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23201) Cannot create view when duplicate columns exist in subquery

2018-01-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338525#comment-16338525
 ] 

Dongjoon Hyun commented on SPARK-23201:
---

Hi, [~joha0123].

It seems to work in the latest Apache Spark (2.2.1). Do you really want to 
report at *1.6.0*?

{code}
scala> sql("create view v1 as select tmp.colA, tmp.col2, tmp.colB, tmp.col5 
from (select * from A left join B on (A.colA = B.colB)) tmp").show

scala> sql("select * from v1").show
+++++
|colA|col2|colB|col5|
+++++
+++++

scala> spark.version
res7: String = 2.2.1
{code}

> Cannot create view when duplicate columns exist in subquery
> ---
>
> Key: SPARK-23201
> URL: https://issues.apache.org/jira/browse/SPARK-23201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johannes Mayer
>Priority: Critical
>
> I have two tables A(colA, col2, col3), B(colB, col3, col5)
> If i join them in a subquery on A.colA = B.colB i can select the non 
> duplicate columns, but i cannot create a view (col3 is duplicate, but not 
> selected)
>  
> {code:java}
> create view testview as select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  
> This works:
>  
> {code:java}
> select
> tmp.colA, tmp.col2, tmp.colB, tmp.col5
> from (
> select * from A left join B
> on (A.colA = B.colB)
> )
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23081) Add colRegex API to PySpark

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23081:


Assignee: (was: Apache Spark)

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23081) Add colRegex API to PySpark

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23081:


Assignee: Apache Spark

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23081) Add colRegex API to PySpark

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338486#comment-16338486
 ] 

Apache Spark commented on SPARK-23081:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20390

> Add colRegex API to PySpark
> ---
>
> Key: SPARK-23081
> URL: https://issues.apache.org/jira/browse/SPARK-23081
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23208) GenArrayData produces illegal code

2018-01-24 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-23208:
-

 Summary: GenArrayData produces illegal code
 Key: SPARK-23208
 URL: https://issues.apache.org/jira/browse/SPARK-23208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


The GenArrayData.genCodeToCreateArrayData produces illegal java code when code 
splitting is enabled. This is caused by a typo on the following line: 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L114




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338427#comment-16338427
 ] 

Shixiong Zhu commented on SPARK-23206:
--

We can also just add more information to metrics system and let the external 
system stores the metrics data and display them.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338421#comment-16338421
 ] 

Shixiong Zhu commented on SPARK-23206:
--

Also cc [~jerryshao] since you were working on metrics system.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338419#comment-16338419
 ] 

Jiang Xingbo commented on SPARK-23207:
--

I'm working on this.

> Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23207) Shuffle+Repartition on an RDD/DataFrame could lead to Data Loss

2018-01-24 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-23207:


 Summary: Shuffle+Repartition on an RDD/DataFrame could lead to 
Data Loss
 Key: SPARK-23207
 URL: https://issues.apache.org/jira/browse/SPARK-23207
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Jiang Xingbo


Currently shuffle repartition uses RoundRobinPartitioning, the generated result 
is nondeterministic since the sequence of input rows are not determined.

The bug can be triggered when there is a repartition call following a shuffle 
(which would lead to non-deterministic row ordering), as the pattern shows 
below:
upstream stage -> repartition stage -> result stage
(-> indicate a shuffle)
When one of the executors process goes down, some tasks on the repartition 
stage will be retried and generate inconsistent ordering, and some tasks of the 
result stage will be retried generating different data.

The following code returns 931532, instead of 100:
{code}
import scala.sys.process._

import org.apache.spark.TaskContext
val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
  x
}.repartition(200).map { x =>
  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
throw new Exception("pkill -f java".!!)
  }
  x
}
res.distinct().count()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2018-01-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338402#comment-16338402
 ] 

Marcelo Vanzin commented on SPARK-20641:


[~rxin] sorry I missed you comment. As I explained in the spec, I chose LevelDB 
because Spark already depends on it. We can definitely add a RocksDB 
implementation if it's better, and I tried to create the abstraction so that 
it's not too hard to add these things later on. The HDFS support sounds like an 
interesting things to have.

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)

2018-01-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-20650:
--

Assignee: Marcelo Vanzin

> Remove JobProgressListener (and other unneeded classes)
> ---
>
> Key: SPARK-20650
> URL: https://issues.apache.org/jira/browse/SPARK-20650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks removing JobProgressListener and other classes that will be 
> made obsolete by the other changes in this project, and making adjustments to 
> parts of the code that still rely on them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20641) Key-value store abstraction and implementation for storing application data

2018-01-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-20641:
--

Assignee: Marcelo Vanzin

> Key-value store abstraction and implementation for storing application data
> ---
>
> Key: SPARK-20641
> URL: https://issues.apache.org/jira/browse/SPARK-20641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding a key-value store abstraction and initial LevelDB 
> implementation to be used to store application data for building the UI and 
> REST API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338396#comment-16338396
 ] 

Shixiong Zhu commented on SPARK-23206:
--

cc [~vanzin] 

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Ye Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338378#comment-16338378
 ] 

Ye Zhou commented on SPARK-23206:
-

[~zsxwing] Hi, Can you help find some one who can help review this design doc?  
Thanks.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Edwina Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23206:
--
Description: 
At LinkedIn, we have multiple clusters, running thousands of Spark 
applications, and these numbers are growing rapidly. We need to ensure that 
these Spark applications are well tuned – cluster resources, including memory, 
should be used efficiently so that the cluster can support running more 
applications concurrently, and applications should run quickly and reliably.

Currently there is limited visibility into how much memory executors are using, 
and users are guessing numbers for executor and driver memory sizing. These 
estimates are often much larger than needed, leading to memory wastage. 
Examining the metrics for one cluster for a month, the average percentage of 
used executor memory (max JVM used memory across executors /  
spark.executor.memory) is 35%, leading to an average of 591GB unused memory per 
application (number of executors * (spark.executor.memory - max JVM used 
memory)). Spark has multiple memory regions (user memory, execution memory, 
storage memory, and overhead memory), and to understand how memory is being 
used and fine-tune allocation between regions, it would be useful to have 
information about how much memory is being used for the different regions.

To improve visibility into memory usage for the driver and executors and 
different memory regions, the following additional memory metrics can be be 
tracked for each executor and driver:
 * JVM used memory: the JVM heap size for the executor/driver.
 * Execution memory: memory used for computation in shuffles, joins, sorts and 
aggregations.
 * Storage memory: memory used caching and propagating internal data across the 
cluster.
 * Unified memory: sum of execution and storage memory.

The peak values for each memory metric can be tracked for each executor, and 
also per stage. This information can be shown in the Spark UI and the REST 
APIs. Information for peak JVM used memory can help with determining 
appropriate values for spark.executor.memory and spark.driver.memory, and 
information about the unified memory region can help with determining 
appropriate values for spark.memory.fraction and spark.memory.storageFraction. 
Stage memory information can help identify which stages are most memory 
intensive, and users can look into the relevant code to determine if it can be 
optimized.

The memory metrics can be gathered by adding the current JVM used memory, 
execution memory and storage memory to the heartbeat. SparkListeners are 
modified to collect the new metrics for the executors, stages and Spark history 
log. Only interesting values (peak values per stage per executor) are recorded 
in the Spark history log, to minimize the amount of additional logging.

We have attached our design documentation with this ticket and would like to 
receive feedback from the community for this proposal.

  was:
At LinkedIn, we have multiple clusters, running thousands of Spark 
applications, and these numbers are growing rapidly. We need to ensure that 
these Spark applications are well tuned -- cluster resources, including memory, 
should be used efficiently so that the cluster can support running more 
applications concurrently, and applications should run quickly and reliably.

Currently there is limited visibility into how much memory executors are using, 
and users are guessing numbers for executor and driver memory sizing. These 
estimates are often much larger than needed, leading to memory wastage. 
Examining the metrics for one cluster for a month, the average percentage of 
used executor memory (max JVM used memory across executors /  
spark.executor.memory) is 35%, leading to an average of 591GB unused memory per 
application (number of executors * (spark.executor.memory - max JVM used 
memory)). Spark has multiple memory regions (user memory, execution memory, 
storage memory, and overhead memory), and to understand how memory is being 
used and fine-tune allocation between regions, it would be useful to have 
information about how much memory is being used for the different regions.

To improve visibility into memory usage for the driver and executors and 
different memory regions, the following additional memory metrics can be be 
tracked for each executor and driver:
 * JVM used memory: the JVM heap size for the executor/driver. 
 * Execution memory: memory used for computation in shuffles, joins, sorts and 
aggregations.
 * Storage memory: memory used caching and propagating internal data across the 
cluster.
 * Unified memory: sum of execution and storage memory.

The peak values for each memory metric can be tracked for each executor, and 
also per stage. This information can be shown in the Spark UI and the REST 
APIs. Information for peak JVM used memory can help with determining 
appropriate values for 

[jira] [Updated] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Edwina Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edwina Lu updated SPARK-23206:
--
Attachment: MemoryTuningMetricsDesignDoc.pdf

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: MemoryTuningMetricsDesignDoc.pdf
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned -- cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver. 
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23206) Additional Memory Tuning Metrics

2018-01-24 Thread Edwina Lu (JIRA)
Edwina Lu created SPARK-23206:
-

 Summary: Additional Memory Tuning Metrics
 Key: SPARK-23206
 URL: https://issues.apache.org/jira/browse/SPARK-23206
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Edwina Lu


At LinkedIn, we have multiple clusters, running thousands of Spark 
applications, and these numbers are growing rapidly. We need to ensure that 
these Spark applications are well tuned -- cluster resources, including memory, 
should be used efficiently so that the cluster can support running more 
applications concurrently, and applications should run quickly and reliably.

Currently there is limited visibility into how much memory executors are using, 
and users are guessing numbers for executor and driver memory sizing. These 
estimates are often much larger than needed, leading to memory wastage. 
Examining the metrics for one cluster for a month, the average percentage of 
used executor memory (max JVM used memory across executors /  
spark.executor.memory) is 35%, leading to an average of 591GB unused memory per 
application (number of executors * (spark.executor.memory - max JVM used 
memory)). Spark has multiple memory regions (user memory, execution memory, 
storage memory, and overhead memory), and to understand how memory is being 
used and fine-tune allocation between regions, it would be useful to have 
information about how much memory is being used for the different regions.

To improve visibility into memory usage for the driver and executors and 
different memory regions, the following additional memory metrics can be be 
tracked for each executor and driver:
 * JVM used memory: the JVM heap size for the executor/driver. 
 * Execution memory: memory used for computation in shuffles, joins, sorts and 
aggregations.
 * Storage memory: memory used caching and propagating internal data across the 
cluster.
 * Unified memory: sum of execution and storage memory.

The peak values for each memory metric can be tracked for each executor, and 
also per stage. This information can be shown in the Spark UI and the REST 
APIs. Information for peak JVM used memory can help with determining 
appropriate values for spark.executor.memory and spark.driver.memory, and 
information about the unified memory region can help with determining 
appropriate values for spark.memory.fraction and spark.memory.storageFraction. 
Stage memory information can help identify which stages are most memory 
intensive, and users can look into the relevant code to determine if it can be 
optimized.

The memory metrics can be gathered by adding the current JVM used memory, 
execution memory and storage memory to the heartbeat. SparkListeners are 
modified to collect the new metrics for the executors, stages and Spark history 
log. Only interesting values (peak values per stage per executor) are recorded 
in the Spark history log, to minimize the amount of additional logging.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23205:


Assignee: Apache Spark

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Assignee: Apache Spark
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338362#comment-16338362
 ] 

Apache Spark commented on SPARK-23205:
--

User 'smurching' has created a pull request for this issue:
https://github.com/apache/spark/pull/20389

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23205:


Assignee: (was: Apache Spark)

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338360#comment-16338360
 ] 

Siddharth Murching commented on SPARK-23205:


Working on a PR to address this issue

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338360#comment-16338360
 ] 

Siddharth Murching edited comment on SPARK-23205 at 1/24/18 10:40 PM:
--

I'm working on a PR to address this issue if that's alright :)


was (Author: siddharth murching):
Working on a PR to address this issue

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-23205:
--

 Summary: ImageSchema.readImages incorrectly sets alpha channel to 
255 for four-channel images
 Key: SPARK-23205
 URL: https://issues.apache.org/jira/browse/SPARK-23205
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Siddharth Murching


When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
 that sets alpha = 255, even for four-channel images.

See the offending line here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172

A fix is to simply update the line to: 

val color = new Color(img.getRGB(w, h), nChannels == 4)


instead of

val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338317#comment-16338317
 ] 

Apache Spark commented on SPARK-23020:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20388

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23198) Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution

2018-01-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-23198:


Assignee: Dongjoon Hyun

> Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test 
> ContinuousExecution
> -
>
> Key: SPARK-23198
> URL: https://issues.apache.org/jira/browse/SPARK-23198
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on 
> `MicroBatchExecution`. It should test `ContinuousExecution`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23198) Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test ContinuousExecution

2018-01-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-23198.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20374
[https://github.com/apache/spark/pull/20374]

> Fix KafkaContinuousSourceStressForDontFailOnDataLossSuite to test 
> ContinuousExecution
> -
>
> Key: SPARK-23198
> URL: https://issues.apache.org/jira/browse/SPARK-23198
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, `KafkaContinuousSourceStressForDontFailOnDataLossSuite` runs on 
> `MicroBatchExecution`. It should test `ContinuousExecution`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338120#comment-16338120
 ] 

Marcelo Vanzin commented on SPARK-23203:


Ah, cool, wasn't aware that it was still experimental. Was just wondering since 
I saw this before I meant to reply to the rc1 e-mail. (I'm also concerned about 
the amount of patches going in after rcs started, but I generally count that as 
standard operation procedure during initial rcs...)

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-24 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338116#comment-16338116
 ] 

Bryan Cutler commented on SPARK-22711:
--

Yes, normally you would not need to import inside the functions.  I think it is 
because the wordnet module initializes lazily, so when cloudpickle tries to 
pickle it, it is not complete and something is missed.  I tried calling 
{{wordnet.ensure_loaded()}} in main that forces the module to load and that 
looks it allows it to be pickled properly, but then it has a problem trying to 
unpickle it.  So I think it is best to use the workaround above for this case.

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 

[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338112#comment-16338112
 ] 

Ryan Blue commented on SPARK-23203:
---

[~vanzin], given that DataSourceV2 is experimental, I don't think it should 
block the release.

 

But I also don't think it was a great idea to rush patches lately to get them 
into 2.3.0. We are shipping an experimental implementation that will no doubt 
change a lot as we add basics, like checking a data source table's schema 
against the data frame that is being written (which is completely by-passed 
right now).

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338089#comment-16338089
 ] 

Apache Spark commented on SPARK-23204:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20387

> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338088#comment-16338088
 ] 

Apache Spark commented on SPARK-23203:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20387

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23203:


Assignee: (was: Apache Spark)

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23204:


Assignee: Apache Spark

> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23203:


Assignee: Apache Spark

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23204:


Assignee: (was: Apache Spark)

> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-24 Thread Prateek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338086#comment-16338086
 ] 

Prateek commented on SPARK-22711:
-

Thanks [~bryanc]. I will check. Do we have to import the libraries in all 
functions dependent functions? Its uncommon right?

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr

[jira] [Updated] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23204:
--
Description: 
DataSourceV2 is currently only configured with a path, passed in options as 
{{path}}. For many data sources, like JDBC, a table name is more appropriate. I 
propose testing the "location" passed to load(String) and save(String) to see 
if it is a path and if not, parsing it as a table name and passing "database" 
and "table" options to readers and writers.

This also creates a way to pass the table identifier when using DataSourceV2 
tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
{{UnresolvedRelation(db,table)}} that could be resolved using the default 
source, passing the db and table name using the same options. Similarly, we can 
add a table property for the datasource implementation to metastore tables and 
add a rule to convert them to DataSourceV2 relations.

  was:
DataSourceV2 is currently only configured with a path, passed in options as 
{{path}}. For many data sources, like JDBC, a table name is more appropriate. I 
propose testing the "location" passed to load(String) and save(String) to see 
if it is a path and if not, parsing it as a table name and passing "database" 
and "table" options to readers and writers.

This also creates a way to pass the table identifier when using DataSourceV2 
tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
{{UnresolvedRelation(db,table)}} that could be resolved using the default 
source, passing the db and table name using the same options.


> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options. Similarly, we 
> can add a table property for the datasource implementation to metastore 
> tables and add a rule to convert them to DataSourceV2 relations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab

2018-01-24 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338083#comment-16338083
 ] 

Imran Rashid commented on SPARK-23189:
--

OK, since nobody feels strongly and to avoid bike-shedding here, my suggestion 
to move forward is Attila goes ahead with this since it should be a small 
change, but if it complicated for whatever reason its probably not worth 
sinking a lot of time into.

I'd rather we just fixed the stage page to be better -- I don't like adding 
this if we plan on ripping it out later.  But also realize we don't have a ton 
of contributors working on the UI now, so who knows when the other UI stuff 
will happen, so we can make this small change, since I think you probably have 
a pretty good sense of what would be useful in these pages.


> reflect stage level blacklisting on executor tab 
> -
>
> Key: SPARK-23189
> URL: https://issues.apache.org/jira/browse/SPARK-23189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This issue is the came during working on SPARK-22577 where the conclusion was 
> not only stage tab should reflect stage and application level backlisting but 
> also the executor tab should be extended with stage level backlisting 
> information.
> As [~irashid] and [~tgraves] are discussed the backlisted stages should be 
> listed for an executor like "*stage[ , ,...]*". One idea was to list only the 
> most recent 3 of the blacklisted stages another was list all the active 
> stages which are blacklisted.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23204:
--
Description: 
DataSourceV2 is currently only configured with a path, passed in options as 
{{path}}. For many data sources, like JDBC, a table name is more appropriate. I 
propose testing the "location" passed to load(String) and save(String) to see 
if it is a path and if not, parsing it as a table name and passing "database" 
and "table" options to readers and writers.

This also creates a way to pass the table identifier when using DataSourceV2 
tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
{{UnresolvedRelation(db,table)}} that could be resolved using the default 
source, passing the db and table name using the same options.

  was:DataSourceV2 is currently only configured with a path, passed in options 
as `path`. For many data sources, like JDBC, a table name is more appropriate. 
I propose testing the "location" passed to load(String) and save(String) to see 
if it is a path and if not, parsing it as a table name and passing "database" 
and "table" options to readers and writers.


> DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter
> 
>
> Key: SPARK-23204
> URL: https://issues.apache.org/jira/browse/SPARK-23204
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 is currently only configured with a path, passed in options as 
> {{path}}. For many data sources, like JDBC, a table name is more appropriate. 
> I propose testing the "location" passed to load(String) and save(String) to 
> see if it is a path and if not, parsing it as a table name and passing 
> "database" and "table" options to readers and writers.
> This also creates a way to pass the table identifier when using DataSourceV2 
> tables from SQL. For example, {{SELECT * FROM db.table}} creates an 
> {{UnresolvedRelation(db,table)}} that could be resolved using the default 
> source, passing the db and table name using the same options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23204) DataSourceV2 should support named tables in DataFrameReader, DataFrameWriter

2018-01-24 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-23204:
-

 Summary: DataSourceV2 should support named tables in 
DataFrameReader, DataFrameWriter
 Key: SPARK-23204
 URL: https://issues.apache.org/jira/browse/SPARK-23204
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue


DataSourceV2 is currently only configured with a path, passed in options as 
`path`. For many data sources, like JDBC, a table name is more appropriate. I 
propose testing the "location" passed to load(String) and save(String) to see 
if it is a path and if not, parsing it as a table name and passing "database" 
and "table" options to readers and writers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22386) Data Source V2 improvements

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338071#comment-16338071
 ] 

Apache Spark commented on SPARK-22386:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20387

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22386) Data Source V2 improvements

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22386:


Assignee: (was: Apache Spark)

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22386) Data Source V2 improvements

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22386:


Assignee: Apache Spark

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-24 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338070#comment-16338070
 ] 

Bryan Cutler commented on SPARK-22711:
--

Hi [~PrateekRM], here is your code trimmed down to where the problem is.  It 
seems like CloudPickle in pyspark is having trouble with wordnet

{code}
from pyspark import SparkContext
from nltk.corpus import wordnet as wn

def to_synset(word):
return str(wn.synsets(word))

sc = SparkContext(appName="Text Rank")
rdd = sc.parallelize(["cat", "dog"])
print(rdd.map(to_synset).collect())
{code}

I can look into it, but as a workaround if you import wordnet in your function, 
it seems to work fine

{code}
def to_synset(word):
from nltk.corpus import wordnet as wn
return str(wn.synsets(word))
{code}

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in sav

[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338063#comment-16338063
 ] 

Marcelo Vanzin commented on SPARK-23203:


[~rdblue] is this something that affects the API or is it just an internal 
change? (Or, in other words, "should it block 2.3"?)

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-23203:
-

 Summary: DataSourceV2 should use immutable trees.
 Key: SPARK-23203
 URL: https://issues.apache.org/jira/browse/SPARK-23203
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
 Environment: The DataSourceV2 integration doesn't use [immutable 
trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
 which is a basic requirement of Catalyst. The v2 relation should not wrap a 
mutable reader and change the logical plan by pushing projections and filters.
Reporter: Ryan Blue






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23203:
--
Environment: (was: The DataSourceV2 integration doesn't use [immutable 
trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
 which is a basic requirement of Catalyst. The v2 relation should not wrap a 
mutable reader and change the logical plan by pushing projections and filters.)

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-24 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated SPARK-23203:
--
Description: The DataSourceV2 integration doesn't use [immutable 
trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
 which is a basic requirement of Catalyst. The v2 relation should not wrap a 
mutable reader and change the logical plan by pushing projections and filters.

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23117) SparkR 2.3 QA: Check for new R APIs requiring example code

2018-01-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338043#comment-16338043
 ] 

Felix Cheung commented on SPARK-23117:
--

I'm ok to sign off if we don't have example for SPARK-20307 or SPARK-21381.

Perhaps something we should explain more in ML guide - since changes go into 
python and scala APIs as well.

> SparkR 2.3 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-23117
> URL: https://issues.apache.org/jira/browse/SPARK-23117
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17147) Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction)

2018-01-24 Thread Justin Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338032#comment-16338032
 ] 

Justin Miller commented on SPARK-17147:
---

I'm also seeing this behavior on a topic that has cleanup.policy=delete. The 
volume on this topic is very large, > 10 billion messages per day, and it seems 
to happen about once per day. Another topic with lower volume but larger 
messages happens every few days.

18/01/23 18:30:10 WARN TaskSetManager: Lost task 28.0 in stage 26.0 (TID 861, 
,executor 15): java.lang.AssertionError: assertion failed: Got wrong record 
for -8002  124 even after seeking to offset 1769485661 
got back record.offset 1769485662
18/01/23 18:30:12 INFO TaskSetManager: Lost task 28.1 in stage 26.0 (TID 865) 
on ,executor 24: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 1]
18/01/23 18:30:14 INFO TaskSetManager: Lost task 28.2 in stage 26.0 (TID 866) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 2]
18/01/23 18:30:15 INFO TaskSetManager: Lost task 28.3 in stage 26.0 (TID 867) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 3]
18/01/23 18:30:18 WARN TaskSetManager: Lost task 28.0 in stage 27.0 (TID 898, 
,executor 6): java.lang.AssertionError: assertion failed: Got wrong record 
for -8002  124 even after seeking to offset 1769485661 
got back record.offset 1769485662
18/01/23 18:30:19 INFO TaskSetManager: Lost task 28.1 in stage 27.0 (TID 900) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 1]
18/01/23 18:30:20 INFO TaskSetManager: Lost task 28.2 in stage 27.0 (TID 901) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 2]
18/01/23 18:30:21 INFO TaskSetManager: Lost task 28.3 in stage 27.0 (TID 902) 
on ,executor 15: java.lang.AssertionError (assertion failed: Got wrong 
record for -8002  124 even after seeking to offset 
1769485661 got back record.offset 1769485662) [duplicate 3]

When checked with kafka-simple-consumer-shell, the offset is in fact missing:

next offset = 1769485661
next offset = 1769485663
next offset = 1769485664
next offset = 1769485665

I'm currently testing out this branch in the persister and will post if it 
crashes again over the next few days (I currently have the kafka-10 source from 
the branch with a few extra log lines deployed). We're currently on log format 
0.10.2 (upgraded yesterday) but saw the same issue on 0.9.0.0.

chao.wu - Is this behavior similar to what you're seeing?

> Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
> --
>
> Key: SPARK-17147
> URL: https://issues.apache.org/jira/browse/SPARK-17147
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Robert Conrad
>Priority: Major
>
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset. 
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
> {{nextOffset = offset + 1}}
> to:
> {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
> {{requestOffset += 1}}
> to:
> {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2018-01-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22297:
--

Assignee: Mark Petruska

> Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts 
> conf"
> -
>
> Key: SPARK-22297
> URL: https://issues.apache.org/jira/browse/SPARK-22297
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Mark Petruska
>Priority: Minor
> Fix For: 2.4.0
>
>
> Ran into this locally; the test code seems to use timeouts which generally 
> end up in flakiness like this.
> {noformat}
> [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
> working *** FAILED *** (1 second, 203 milliseconds)
> [info]   "Unable to register with external shuffle server due to : 
> java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
> contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22297) Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts conf"

2018-01-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22297.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19671
[https://github.com/apache/spark/pull/19671]

> Flaky test: BlockManagerSuite "Shuffle registration timeout and maxAttempts 
> conf"
> -
>
> Key: SPARK-22297
> URL: https://issues.apache.org/jira/browse/SPARK-22297
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Ran into this locally; the test code seems to use timeouts which generally 
> end up in flakiness like this.
> {noformat}
> [info] - SPARK-20640: Shuffle registration timeout and maxAttempts conf are 
> working *** FAILED *** (1 second, 203 milliseconds)
> [info]   "Unable to register with external shuffle server due to : 
> java.util.concurrent.TimeoutException: Timeout waiting for task." did not 
> contain "test_spark_20640_try_again" (BlockManagerSuite.scala:1370)
> [info]   org.scalatest.exceptions.TestFailedException:
> [info]   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
> [info]   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
> [info]   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply$mcV$sp(BlockManagerSuite.scala:1370)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> [info]   at 
> org.apache.spark.storage.BlockManagerSuite$$anonfun$14.apply(BlockManagerSuite.scala:1323)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338023#comment-16338023
 ] 

Marcelo Vanzin commented on SPARK-23020:


Argh. Feel free to disable it in branch-2.3; please leave it on on master so we 
can get more info while I look at it.

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-24 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338016#comment-16338016
 ] 

Sameer Agarwal commented on SPARK-23020:


FYI The {{SparkLauncherSuite}} test is still failing occasionally (a lot less 
common though): 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.6/142/testReport/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23152:
-

Assignee: Matthew Tovbin

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Assignee: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.4.0
>
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23152.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20321
[https://github.com/apache/spark/pull/20321]

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Assignee: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.4.0
>
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22837) Session timeout checker does not work in SessionManager

2018-01-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22837.
-
   Resolution: Fixed
 Assignee: zuotingbing
Fix Version/s: 2.3.0

> Session timeout checker does not work in SessionManager
> ---
>
> Key: SPARK-22837
> URL: https://issues.apache.org/jira/browse/SPARK-22837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.1
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, 
> {code:java}
> SessionManager.init
> {code}
>  will not be called, the config 
> {code:java}
> HIVE_SERVER2_SESSION_CHECK_INTERVAL HIVE_SERVER2_IDLE_SESSION_TIMEOUT 
> HIVE_SERVER2_IDLE_SESSION_CHECK_OPERATION
> {code}
> of session timeout checker can not be loaded, it cause the session timeout 
> checker does not work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab

2018-01-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338005#comment-16338005
 ] 

Thomas Graves commented on SPARK-23189:
---

for large jobs the specific stage page is a pain to navigate.  Most of the time 
I go  there i immediately shrink the executors table just to be able to see 
task table more easily.  All of those pages are a pain to navigate and find 
things in my opinion, which is why we had a pr to change to datatables but got 
back burnered based on vanzin request because changes in history server. that 
is done now so I think could be done now but don't have time at the moment. But 
that is just one reason.  

Most of the time I can take a quick look at the job or stages tab (list of all 
stages) to see what is happening at a high level,  jump to executors tab to see 
what is going there at a high level, and then only if needed load the specific 
stages page.  It might just be the way I use it.  It also depends on what the 
user is complaining about as to where I go first.  I think if we fixed the 
stage page to be more usable that might change.   Like I already said I'm fine 
with waiting on this.

> reflect stage level blacklisting on executor tab 
> -
>
> Key: SPARK-23189
> URL: https://issues.apache.org/jira/browse/SPARK-23189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This issue is the came during working on SPARK-22577 where the conclusion was 
> not only stage tab should reflect stage and application level backlisting but 
> also the executor tab should be extended with stage level backlisting 
> information.
> As [~irashid] and [~tgraves] are discussed the backlisted stages should be 
> listed for an executor like "*stage[ , ,...]*". One idea was to list only the 
> most recent 3 of the blacklisted stages another was list all the active 
> stages which are blacklisted.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23189) reflect stage level blacklisting on executor tab

2018-01-24 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337957#comment-16337957
 ] 

Imran Rashid commented on SPARK-23189:
--

[~tgraves] -- why do you use the executors page instead of the stage page for 
the active stage?  Is it because you're typically running lots of jobs 
simultaneously?  Or is it just because its easier to navigate directly to the 
executors page -- no need to have a jobid / stageId?

If its just having a convenient spot to navigate to, we could easily at a 
"lastStage" / "lastJob" redirect, that would be pretty trivial.  Or maybe there 
is some other summary view we're missing to capture the current state of the 
cluster -- the "executors" page may be close to what you want know but perhaps 
we're actually missing something else.

I'm not trying to block this change or anything, just want to avoid UI clutter 
and think a bit about the right way to add this.

> reflect stage level blacklisting on executor tab 
> -
>
> Key: SPARK-23189
> URL: https://issues.apache.org/jira/browse/SPARK-23189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This issue is the came during working on SPARK-22577 where the conclusion was 
> not only stage tab should reflect stage and application level backlisting but 
> also the executor tab should be extended with stage level backlisting 
> information.
> As [~irashid] and [~tgraves] are discussed the backlisted stages should be 
> listed for an executor like "*stage[ , ,...]*". One idea was to list only the 
> most recent 3 of the blacklisted stages another was list all the active 
> stages which are blacklisted.  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-23115.
--
Resolution: Fixed
  Assignee: Felix Cheung

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23115) SparkR 2.3 QA: New R APIs and API docs

2018-01-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337947#comment-16337947
 ] 

Felix Cheung commented on SPARK-23115:
--

done

> SparkR 2.3 QA: New R APIs and API docs
> --
>
> Key: SPARK-23115
> URL: https://issues.apache.org/jira/browse/SPARK-23115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public R APIs.  Take note of:
> * Correctness and uniformity of API
> * Documentation: Missing?  Bad links or formatting?
> ** Check both the generated docs linked from the user guide and the R command 
> line docs `?read.df`. These are generated using roxygen.
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting

2018-01-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-22577:


Assignee: Attila Zsolt Piros

> executor page blacklist status should update with TaskSet level blacklisting
> 
>
> Key: SPARK-22577
> URL: https://issues.apache.org/jira/browse/SPARK-22577
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, 
> stage_blacklisting.png
>
>
> right now the executor blacklist status only updates with the 
> BlacklistTracker after a task set has finished and propagated the 
> blacklisting to the application level. We should change that to show at the 
> taskset level as well. Without this it can be very confusing to the user why 
> things aren't running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22577) executor page blacklist status should update with TaskSet level blacklisting

2018-01-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-22577.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20203
[https://github.com/apache/spark/pull/20203]

> executor page blacklist status should update with TaskSet level blacklisting
> 
>
> Key: SPARK-22577
> URL: https://issues.apache.org/jira/browse/SPARK-22577
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: app_blacklisting.png, node_blacklisting_for_stage.png, 
> stage_blacklisting.png
>
>
> right now the executor blacklist status only updates with the 
> BlacklistTracker after a task set has finished and propagated the 
> blacklisting to the application level. We should change that to show at the 
> taskset level as well. Without this it can be very confusing to the user why 
> things aren't running.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337927#comment-16337927
 ] 

Apache Spark commented on SPARK-23202:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/20386

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23202:


Assignee: (was: Apache Spark)

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23202:


Assignee: Apache Spark

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-24 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-23202:
--

 Summary: Break down DataSourceV2Writer.commit into two phase
 Key: SPARK-23202
 URL: https://issues.apache.org/jira/browse/SPARK-23202
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Gengliang Wang


Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 

writing job with a list of commit messages.

It makes sense in some scenarios, e.g. MicroBatchExecution.

However, on receiving commit message, driver can start processing messages(e.g. 
persist messages into files) before all the messages are collected.

The proposal is to Break down DataSourceV2Writer.commit into two phase:
 # add(WriterCommitMessage message): Handles a commit message produced by 
\{@link DataWriter#commit()}.
 # commit():  Commits the writing job.

This should make the API more flexible, and more reasonable for implementing 
some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22711) _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from cloudpickle.py

2018-01-24 Thread Prateek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337875#comment-16337875
 ] 

Prateek commented on SPARK-22711:
-

I don't understand what you mean. updated code is in attachment [~seemab]

> _pickle.PicklingError: args[0] from __newobj__ args has the wrong class from 
> cloudpickle.py
> ---
>
> Key: SPARK-22711
> URL: https://issues.apache.org/jira/browse/SPARK-22711
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.2.0, 2.2.1
> Environment: Ubuntu pseudo distributed installation of Spark 2.2.0
>Reporter: Prateek
>Priority: Major
> Attachments: Jira_Spark_minimized_code.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> When I submit a Pyspark program with spark-submit command this error is 
> thrown.
> It happens when for code like below
> RDD2 = RDD1.map(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or 
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduceByKey(lambda c,v :c+v)
> or
> RDD2 = RDD1.flatMap(lambda m: function_x(m)).reduce(lambda c,v :c+v)
> Traceback (most recent call last):
>   File "/home/prateek/Project/textrank.py", line 299, in 
> summaryRDD = sentenceTokensReduceRDD.map(lambda m: 
> get_summary(m)).reduceByKey(lambda c,v :c+v)
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1608, 
> in reduceByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1846, 
> in combineByKey
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1783, 
> in partitionBy
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, 
> in _jrdd
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2388, 
> in _wrap_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2374, 
> in _prepare_for_python_RDD
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 
> 460, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 704, in dumps
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 148, in dump
>   File "/usr/lib/python3.5/pickle.py", line 408, in dump
> self.save(obj)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 740, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 770, in save_list
> self._batch_appends(obj)
>   File "/usr/lib/python3.5/pickle.py", line 794, in _batch_appends
> save(x)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 255, in save_function
>   File "/usr/local/spark/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 
> 292, in save_function_tuple
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f(self, obj) # Call unbound method with explicit self
>   File "/usr/lib/python3.5/pickle.py", line 725, in save_tuple
> save(element)
>   File "/usr/lib/python3.5/pickle.py", line 475, in save
> f

[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21396:


Assignee: Apache Spark

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23195) Hint of cached data is lost

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23195:


Assignee: Apache Spark  (was: Xiao Li)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23195) Hint of cached data is lost

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23195:


Assignee: Xiao Li  (was: Apache Spark)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21396:


Assignee: (was: Apache Spark)

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>Priority: Major
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337820#comment-16337820
 ] 

Apache Spark commented on SPARK-21396:
--

User 'atallahhezbor' has created a pull request for this issue:
https://github.com/apache/spark/pull/20385

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>Priority: Major
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>   ... 18 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23195) Hint of cached data is lost

2018-01-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-23195:
---

Since this is reverted, let's keep this open until the on-going PR is merged 
back.

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23195) Hint of cached data is lost

2018-01-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23195:
--
Fix Version/s: (was: 2.3.1)

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13108) Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)

2018-01-24 Thread Rafael Cavazin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337775#comment-16337775
 ] 

Rafael Cavazin commented on SPARK-13108:


[~hyukjin.kwon] was this issue fixed? the PR was closed

> Encoding not working with non-ascii compatible encodings (UTF-16/32 etc.)
> -
>
> Key: SPARK-13108
> URL: https://issues.apache.org/jira/browse/SPARK-13108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This library uses Hadoop's 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java],
>  which uses 
> [{{LineRecordReader}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java].
> According to 
> [MAPREDUCE-232|https://issues.apache.org/jira/browse/MAPREDUCE-232], it looks 
> [{{TextInputFormat}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.java]
>  does not guarantee all encoding types but officially only UTF-8 (as 
> commented in 
> [{{LineRecordReader#L147}}|https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java#L147]).
> According to 
> [MAPREDUCE-232#comment-13183601|https://issues.apache.org/jira/browse/MAPREDUCE-232?focusedCommentId=13183601&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13183601],
>  it still looks fine with most encodings though but without UTF-16/32.
> In more details, 
> I tested this in Max OS. I converted `cars_iso-8859-1.csv` into 
> `cars_utf-16.csv` as below:
> {code}
> iconv -f iso-8859-1 -t utf-16 < cars_iso-8859-1.csv > cars_utf-16.csv
> {code}
> and run the codes below:
> {code}
> val cars = "cars_utf-16.csv"
> sqlContext.read
>   .format("csv")
>   .option("charset", "utf-16")
>   .option("delimiter", 'þ')
>   .load(cars)
>   .show()
> {code}
> This produces a wrong results below:
> {code}
> ++-+-++--+
> |year| make|model| comment|blank�|
> ++-+-++--+
> |2012|Tesla|S|  No comment| �|
> |   �| null| null|null|  null|
> |1997| Ford| E350|Go get one now th...| �|
> |2015|Chevy|Volt�|null|  null|
> |   �| null| null|null|  null|
> ++-+-++--+
> {code}
> Instead of the correct results below:
> {code}
> ++-+-++-+
> |year| make|model| comment|blank|
> ++-+-++-+
> |2012|Tesla|S|  No comment| |
> |1997| Ford| E350|Go get one now th...| |
> |2015|Chevy| Volt|null| null|
> ++-+-++-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23195) Hint of cached data is lost

2018-01-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337771#comment-16337771
 ] 

Apache Spark commented on SPARK-23195:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20384

> Hint of cached data is lost
> ---
>
> Key: SPARK-23195
> URL: https://issues.apache.org/jira/browse/SPARK-23195
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.1
>
>
> {noformat}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
> "value")
>   val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
> "value")
>   broadcast(df2).cache()
>   df2.collect()
>   val df3 = df1.join(df2, Seq("key"), "inner")
>   val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
> case b: BroadcastHashJoinExec => b
>   }.size
>   assert(numBroadCastHashJoin === 1)
> }
> {noformat}
> The broadcast hint is not respected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2018-01-24 Thread Arvind Jajoo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16337760#comment-16337760
 ] 

Arvind Jajoo commented on SPARK-15348:
--

I think in order to have an end to end streaming ETL implementation within 
spark , this feature needs to be supported in spark sql now, specially after 
structured streaming. 

i.e. MERGE INTO statement can be run directly from spark sql for batch or 
microbatch incremental updates. 

Currently , this needs to be done outside of spark using hive but then it 
breaks end to end streaming ETL semantics.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22784) Configure reading buffer size in Spark History Server

2018-01-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22784.
---
Resolution: Won't Fix

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of the backfill time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with a new buffer size parameter. 
> Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
> speedup in a local test.
> Result:
> Backfill of Spark History and reading to the cache will be up to 30% faster 
> after tuning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >