[jira] [Commented] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058869#comment-16058869
 ] 

Apache Spark commented on SPARK-21174:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18387

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21167.
--
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21174:


Assignee: Apache Spark

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21174:


Assignee: (was: Apache Spark)

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21167:
-
Comment: was deleted

(was: User 'dijingran' has created a pull request for this issue:
https://github.com/apache/spark/pull/18383)

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21174:

Component/s: (was: Optimizer)
 SQL

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.

2017-06-22 Thread jin xing (JIRA)
jin xing created SPARK-21175:


 Summary: Slow down "open blocks" on shuffle service when memory 
shortage to avoid OOM.
 Key: SPARK-21175
 URL: https://issues.apache.org/jira/browse/SPARK-21175
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.1.1
Reporter: jin xing


A shuffle service can serves blocks from multiple apps/tasks. Thus the shuffle 
service can suffers high memory usage when lots of {{shuffle-read}} happen at 
the same time. In my cluster, OOM always happens on shuffle service. Analyzing 
heap dump, memory cost by Netty(chunks) can be up to 2~3G. It might make sense 
to reject "open blocks" request when memory usage is high on shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21175:


Assignee: Apache Spark

> Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
> -
>
> Key: SPARK-21175
> URL: https://issues.apache.org/jira/browse/SPARK-21175
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>
> A shuffle service can serves blocks from multiple apps/tasks. Thus the 
> shuffle service can suffers high memory usage when lots of {{shuffle-read}} 
> happen at the same time. In my cluster, OOM always happens on shuffle 
> service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. 
> It might make sense to reject "open blocks" request when memory usage is high 
> on shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21175:


Assignee: (was: Apache Spark)

> Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
> -
>
> Key: SPARK-21175
> URL: https://issues.apache.org/jira/browse/SPARK-21175
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> A shuffle service can serves blocks from multiple apps/tasks. Thus the 
> shuffle service can suffers high memory usage when lots of {{shuffle-read}} 
> happen at the same time. In my cluster, OOM always happens on shuffle 
> service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. 
> It might make sense to reject "open blocks" request when memory usage is high 
> on shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058892#comment-16058892
 ] 

Apache Spark commented on SPARK-21175:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18388

> Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
> -
>
> Key: SPARK-21175
> URL: https://issues.apache.org/jira/browse/SPARK-21175
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> A shuffle service can serves blocks from multiple apps/tasks. Thus the 
> shuffle service can suffers high memory usage when lots of {{shuffle-read}} 
> happen at the same time. In my cluster, OOM always happens on shuffle 
> service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. 
> It might make sense to reject "open blocks" request when memory usage is high 
> on shuffle service.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058917#comment-16058917
 ] 

Andrew Ash commented on SPARK-19700:


Found another potential implementation: Facebook's in-house scheduler mentioned 
by [~tejasp] at 
https://github.com/apache/spark/pull/18209#issuecomment-307538061

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930
 ] 

Andrew Ash edited comment on SPARK-19700 at 6/22/17 7:47 AM:
-

Public implementation that's been around a while (Hamel is familiar with this): 
Two Sigma's integration with their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook


was (Author: aash):
Public implementation that's been around a while: Two Sigma's integration with 
their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations

2017-06-22 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930
 ] 

Andrew Ash commented on SPARK-19700:


Public implementation that's been around a while: Two Sigma's integration with 
their Cook scheduler, recent diff at 
https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Matt Cheah
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-06-22 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Attachment: (was: MiniBatchKMeans_Performance_II.pdf)

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MiniBatchKMeans_Performance.pdf
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-06-22 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Attachment: (was: MiniBatchKMeans_Performance.pdf)

> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
> significant.
> The MiniBatch KMeans is named XMeans in following lines.
> {code}
> val path = "/tmp/mnist8m.scale"
> val data = MLUtils.loadLibSVMFile(sc, path)
> val vecs = data.map(_.features).persist()
> val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", seed=123l)
> km.computeCost(vecs)
> res0: Double = 3.317029898599564E8
> val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
> xm.computeCost(vecs)
> res1: Double = 3.3169865959604424E8
> val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
> initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
> xm2.computeCost(vecs)
> res2: Double = 3.317195831216454E8
> {code}
> The above three training all reached the max number of iterations 10.
> We can see that the WSSSEs are almost the same. While their speed perfermence 
> have significant difference:
> {code}
> KMeans2876sec
> MiniBatch KMeans (fraction=0.1) 263sec
> MiniBatch KMeans (fraction=0.01)   90sec
> {code}
> With appropriate fraction, the bigger the dataset is, the higher speedup is.
> The data used above have 8,100,000 samples, 784 features. It can be 
> downloaded here 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Implement the Mini-Batch KMeans

2017-06-22 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Attachment: MBKM.xlsx

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM

2017-06-22 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Description: 
The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.


Comparison of the K-Means and MiniBatchKMeans on sklearn : 
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py

Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it a 
new estimator


  was:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
mini-batches to reduce the computation time, while still attempting to optimise 
the same objective function. Mini-batches are subsets of the input data, 
randomly sampled in each training iteration. These mini-batches drastically 
reduce the amount of computation required to converge to a local solution. In 
contrast to other algorithms that reduce the convergence time of k-means, 
mini-batch k-means produces results that are generally only slightly worse than 
the standard algorithm.

I have implemented mini-batch kmeans in Mllib, and the acceleration is realy 
significant.
The MiniBatch KMeans is named XMeans in following lines.
{code}
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()

val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8

val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8

val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, 
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
{code}
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence 
have significant difference:
{code}
KMeans2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01)   90sec
{code}

With appropriate fraction, the bigger the dataset is, the higher speedup is.

The data used above have 8,100,000 samples, 784 features. It can be downloaded 
here 
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)

Comparison of the K-Means and MiniBatchKMeans on sklearn : 
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py


> Accelerate KMeans via Mini-Batch EM
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14174) Implement the Mini-Batch KMeans

2017-06-22 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-14174:
-
Summary: Implement the Mini-Batch KMeans  (was: Accelerate KMeans via 
Mini-Batch EM)

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21163) DataFrame.toPandas should respect the data type

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21163.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18378
[https://github.com/apache/spark/pull/18378]

> DataFrame.toPandas should respect the data type
> ---
>
> Key: SPARK-21163
> URL: https://issues.apache.org/jira/browse/SPARK-21163
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Implement the Mini-Batch KMeans

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058975#comment-16058975
 ] 

Apache Spark commented on SPARK-14174:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/18389

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14174) Implement the Mini-Batch KMeans

2017-06-22 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058982#comment-16058982
 ] 

zhengruifeng commented on SPARK-14174:
--

[~mlnick] I send a new PR for MiniBatch KMeans, and the corresponding test 
results are attached.

> Implement the Mini-Batch KMeans
> ---
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses 
> mini-batches to reduce the computation time, while still attempting to 
> optimise the same objective function. Mini-batches are subsets of the input 
> data, randomly sampled in each training iteration. These mini-batches 
> drastically reduce the amount of computation required to converge to a local 
> solution. In contrast to other algorithms that reduce the convergence time of 
> k-means, mini-batch k-means produces results that are generally only slightly 
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn : 
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it 
> a new estimator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)
Ingo Schuster created SPARK-21176:
-

 Summary: Master UI hangs with spark.ui.reverseProxy=true if the 
master node has many CPUs
 Key: SPARK-21176
 URL: https://issues.apache.org/jira/browse/SPARK-21176
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0
 Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
externally other nodes are in an internal network
Reporter: Ingo Schuster


When *reverse proxy is enabled*
{quote}
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/
{quote}
 first of all any invocation of the spark master Web UI hangs forever locally 
(e.g. http://192.168.10.16:25001) and via external URL without any data 
received. 
One, sometimes two spark applications succeed without error and than workers 
start throwing exceptions:
{quote}
Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
{quote}
The application dies during creation of SparkContext:
{quote}
2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
bootstraps)
2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has been 
killed. Reason: All masters are unresponsive! Giving up.
2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is not 
initialized yet.
2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
.
Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
call getServletHandlers on a running MetricsSystem
{quote}

*This definitively does not happen without reverse proxy enabled!*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Affects Version/s: 2.2.1
   2.2.0
   2.1.1

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> When *reverse proxy is enabled*
> {quote}
> spark.ui.reverseProxy=true
> spark.ui.reverseProxyUrl=/
> {quote}
>  first of all any invocation of the spark master Web UI hangs forever locally 
> (e.g. http://192.168.10.16:25001) and via external URL without any data 
> received. 
> One, sometimes two spark applications succeed without error and than workers 
> start throwing exceptions:
> {quote}
> Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
> {quote}
> The application dies during creation of SparkContext:
> {quote}
> 2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
> connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
> bootstraps)
> 2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting 
> to master spark://node0101:25000...
> 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has 
> been killed. Reason: All masters are unresponsive! Giving up.
> 2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is 
> not initialized yet.
> 2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
> .
> Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
> call getServletHandlers on a running MetricsSystem
> {quote}
> *This definitively does not happen without reverse proxy enabled!*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058985#comment-16058985
 ] 

Sean Owen commented on SPARK-21080:
---

This really isn't my area; maybe [~vanzin]?

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
(see 
https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{  
selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
(see 
https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each connector, Jetty creates Selector threads: minimum 4, maximum half 
> the number of available CPUs:
> {{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)
> In reverse proxy mode, a connector is set up for each executor and one for 
> the master UI.
> I have a system with 88 CPUs on the master node and 7 executors. Jetty tries 
> to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
> initialized with 200 threads by default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)* }}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{  
selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
(see 
https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
When *reverse proxy is enabled*
{quote}
spark.ui.reverseProxy=true
spark.ui.reverseProxyUrl=/
{quote}
 first of all any invocation of the spark master Web UI hangs forever locally 
(e.g. http://192.168.10.16:25001) and via external URL without any data 
received. 
One, sometimes two spark applications succeed without error and than workers 
start throwing exceptions:
{quote}
Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050
{quote}
The application dies during creation of SparkContext:
{quote}
2017-05-22 16:11:23 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:11:23 INFO  TransportClientFactory:254 - Successfully created 
connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in 
bootstraps)
2017-05-22 16:11:43 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:12:03 INFO  StandaloneAppClient$ClientEndpoint:54 - Connecting to 
master spark://node0101:25000...
2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has been 
killed. Reason: All masters are unresponsive! Giving up.
2017-05-22 16:12:23 WARN  StandaloneSchedulerBackend:66 - Application ID is not 
initialized yet.
2017-05-22 16:12:23 INFO  Utils:54 - Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056.
.
Caused by: java.lang.IllegalArgumentException: requirement failed: Can only 
call getServletHandlers on a running MetricsSystem
{quote}

*This definitively does not happen without reverse proxy enabled!*


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each connector, Jetty creates Selector threads: minimum 4, maximum half 
> the number of available CPUs:
> {{  
> selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)
> In reverse proxy mode, a connector is set up for each executor and one for 
> the master UI.
> I have a system with 88 CPUs on the master node and 7 executors. Jetty tries 
> to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
> initialized with 200 threads by default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)* }}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application 

[jira] [Resolved] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.

2017-06-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21173.
---
Resolution: Not A Problem

> There are several configuration about SSL displayed in configuration.md but 
> never be used.
> --
>
> Key: SPARK-21173
> URL: https://issues.apache.org/jira/browse/SPARK-21173
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> There are several configuration about SSL displayed in configuration.md but 
> never be used and appear at spark's code.So I think it should be removed from 
> configuration.md.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21161) SparkContext stopped when execute a query on Solr

2017-06-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21161.
---
Resolution: Not A Problem

> SparkContext stopped when execute a query on Solr
> -
>
> Key: SPARK-21161
> URL: https://issues.apache.org/jira/browse/SPARK-21161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, 
> solr-solrj-6.5.1.jar
>Reporter: Jian Wu
>
> The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I 
> query Solr data in Spark.
> {code:none}
> 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR 
> scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; 
> shutting down SparkContext
> 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For 
> input string: “8983_solr”
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:580)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> java.lang.Integer.parseInt(Integer.java:615)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> scala.collection.immutable.List.foreach(List.scala:381)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
> 17/06/21 12:40:53 INFO ContextLauncher:   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> 17/06/21 12:40:53 INFO ContextLau

[jira] [Commented] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059018#comment-16059018
 ] 

Sean Owen commented on SPARK-21171:
---

There's no real detail here. I'd have to close this. This should start as a 
discussion on a mailing list, or at least include specific benchmarks and 
specific proposed changes.

> Speculate task scheduling block dirve handle normal task when a job task 
> number more than one hundred thousand
> --
>
> Key: SPARK-21171
> URL: https://issues.apache.org/jira/browse/SPARK-21171
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
> Environment: We have more than two hundred high-performance machine 
> to handle more than 2T data by one query
>Reporter: wangminfeng
>
> If a job have more then one hundred thousand tasks and spark.speculation is 
> true, when speculable tasks start, choosing a speculable will waste lots of 
> time and block other tasks. We do a ad-hoc query for data analyse,  we can't 
> tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059019#comment-16059019
 ] 

Apache Spark commented on SPARK-21167:
--

User 'dijingran' has created a pull request for this issue:
https://github.com/apache/spark/pull/18383

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-22 Thread Lukasz Raszka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059022#comment-16059022
 ] 

Lukasz Raszka commented on SPARK-21080:
---

Update: I think it might be a mistake on our side, and it was not Spark 
internal HDFS access attempt that caused this, but ours. Sorry for the all 
confusion. Still, I agree it would be great to have the mentioned PR rebased. 
Meanwhile I guess you can close this one.

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions

2017-06-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21080.
---
Resolution: Not A Problem

> Workaround for HDFS delegation token expiry broken with some Hadoop versions
> 
>
> Key: SPARK-21080
> URL: https://issues.apache.org/jira/browse/SPARK-21080
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3
>Reporter: Lukasz Raszka
>Priority: Minor
>
> We're getting struck by SPARK-11182, where the core issue in HDFS has been 
> fixed in more recent versions. It seems that [workaround introduced by user 
> SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d]
>  doesn't work in newer version of Hadoop. This seems to be cause by a move of 
> property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} 
> which happened somewhere around 2.7.0 or earlier. Taking this into account 
> should make workaround work again for less recent Hadoop versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers

2017-06-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059141#comment-16059141
 ] 

Steve Loughran commented on SPARK-11373:


metrics might help with understanding the s3 load issues in SPARK-19111

> Add metrics to the History Server and providers
> ---
>
> Key: SPARK-11373
> URL: https://issues.apache.org/jira/browse/SPARK-11373
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Steve Loughran
>
> The History server doesn't publish metrics about JVM load or anything from 
> the history provider plugins. This means that performance problems from 
> massive job histories aren't visible to management tools, and nor are any 
> provider-generated metrics such as time to load histories, failed history 
> loads, the number of connectivity failures talking to remote services, etc.
> If the history server set up a metrics registry and offered the option to 
> publish its metrics, then management tools could view this data.
> # the metrics registry would need to be passed down to the instantiated 
> {{ApplicationHistoryProvider}}, in order for it to register its metrics.
> # if the codahale metrics servlet were registered under a path such as 
> {{/metrics}}, the values would be visible as HTML and JSON, without the need 
> for management tools.
> # Integration tests could also retrieve the JSON-formatted data and use it as 
> part of the test suites.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21177) Append to hive slows down linearly, with number of appends.

2017-06-22 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-21177:
---

 Summary: Append to hive slows down linearly, with number of 
appends.
 Key: SPARK-21177
 URL: https://issues.apache.org/jira/browse/SPARK-21177
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Prashant Sharma



In short, please use the following shell transcript for the reproducer. 

{code:java}

Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> def printTimeTaken(str: String, f: () => Unit) {
val start = System.nanoTime()
f()
val end = System.nanoTime()
val timetaken = end - start
import scala.concurrent.duration._
println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
  }
 |  |  |  |  |  |  | printTimeTaken: (str: String, 
f: () => Unit)Unit

scala> 
for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { Seq(1, 
2).toDF().write.mode("append").saveAsTable("t1"); })}
Time taken for time to append to hive: is 284

Time taken for time to append to hive: is 211

...
...

Time taken for time to append to hive: is 2615

Time taken for time to append to hive: is 3055

Time taken for time to append to hive: is 22425


{code}

Why does it matter ?

In a streaming job it is not possible to append to hive using this dataframe 
operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21177) df.SaveAsTable slows down linearly, with number of appends.

2017-06-22 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-21177:

Summary: df.SaveAsTable slows down linearly, with number of appends.  (was: 
Append to hive slows down linearly, with number of appends.)

> df.SaveAsTable slows down linearly, with number of appends.
> ---
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends

2017-06-22 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-21177:

Summary: df.saveAsTable slows down linearly, with number of appends  (was: 
df.SaveAsTable slows down linearly, with number of appends.)

> df.saveAsTable slows down linearly, with number of appends
> --
>
> Key: SPARK-21177
> URL: https://issues.apache.org/jira/browse/SPARK-21177
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Prashant Sharma
>
> In short, please use the following shell transcript for the reproducer. 
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> def printTimeTaken(str: String, f: () => Unit) {
> val start = System.nanoTime()
> f()
> val end = System.nanoTime()
> val timetaken = end - start
> import scala.concurrent.duration._
> println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n")
>   }
>  |  |  |  |  |  |  | printTimeTaken: (str: 
> String, f: () => Unit)Unit
> scala> 
> for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { 
> Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })}
> Time taken for time to append to hive: is 284
> Time taken for time to append to hive: is 211
> ...
> ...
> Time taken for time to append to hive: is 2615
> Time taken for time to append to hive: is 3055
> Time taken for time to append to hive: is 22425
> 
> {code}
> Why does it matter ?
> In a streaming job it is not possible to append to hive using this dataframe 
> operation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator

2017-06-22 Thread Aman Rawat (JIRA)
Aman Rawat created SPARK-21178:
--

 Summary: Add support for label specific metrics in 
MulticlassClassificationEvaluator
 Key: SPARK-21178
 URL: https://issues.apache.org/jira/browse/SPARK-21178
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.1
Reporter: Aman Rawat


MulticlassClassificationEvaluator is restricted to the global metrics - f1, 
weightedPrecision, weightedRecall, accuracy

However, we have a requirement where we would want to optimize the learning on 
metric for a specific label - for instance, true positive rate (label 'B')

For example : Take a fraud detection use-case with labels 'good' and 'fraud' 
being passed to a manual verification team. We want to maximize the 
true-positive rate of ('fraud') label, so that whenever the model predicts a 
data point as 'good', it has a strong likelihood of it being 'good', and the 
manual team can ignore it.
While it's ok to predict some 'good' data points as 'fraud', as it will be 
taken care by the manual verification team.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator

2017-06-22 Thread Aman Rawat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aman Rawat updated SPARK-21178:
---
Target Version/s:   (was: 2.1.2)

> Add support for label specific metrics in MulticlassClassificationEvaluator
> ---
>
> Key: SPARK-21178
> URL: https://issues.apache.org/jira/browse/SPARK-21178
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Aman Rawat
>
> MulticlassClassificationEvaluator is restricted to the global metrics - f1, 
> weightedPrecision, weightedRecall, accuracy
> However, we have a requirement where we would want to optimize the learning 
> on metric for a specific label - for instance, true positive rate (label 'B')
> For example : Take a fraud detection use-case with labels 'good' and 'fraud' 
> being passed to a manual verification team. We want to maximize the 
> true-positive rate of ('fraud') label, so that whenever the model predicts a 
> data point as 'good', it has a strong likelihood of it being 'good', and the 
> manual team can ignore it.
> While it's ok to predict some 'good' data points as 'fraud', as it will be 
> taken care by the manual verification team.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21178:


Assignee: Apache Spark

> Add support for label specific metrics in MulticlassClassificationEvaluator
> ---
>
> Key: SPARK-21178
> URL: https://issues.apache.org/jira/browse/SPARK-21178
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Aman Rawat
>Assignee: Apache Spark
>
> MulticlassClassificationEvaluator is restricted to the global metrics - f1, 
> weightedPrecision, weightedRecall, accuracy
> However, we have a requirement where we would want to optimize the learning 
> on metric for a specific label - for instance, true positive rate (label 'B')
> For example : Take a fraud detection use-case with labels 'good' and 'fraud' 
> being passed to a manual verification team. We want to maximize the 
> true-positive rate of ('fraud') label, so that whenever the model predicts a 
> data point as 'good', it has a strong likelihood of it being 'good', and the 
> manual team can ignore it.
> While it's ok to predict some 'good' data points as 'fraud', as it will be 
> taken care by the manual verification team.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21178:


Assignee: (was: Apache Spark)

> Add support for label specific metrics in MulticlassClassificationEvaluator
> ---
>
> Key: SPARK-21178
> URL: https://issues.apache.org/jira/browse/SPARK-21178
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Aman Rawat
>
> MulticlassClassificationEvaluator is restricted to the global metrics - f1, 
> weightedPrecision, weightedRecall, accuracy
> However, we have a requirement where we would want to optimize the learning 
> on metric for a specific label - for instance, true positive rate (label 'B')
> For example : Take a fraud detection use-case with labels 'good' and 'fraud' 
> being passed to a manual verification team. We want to maximize the 
> true-positive rate of ('fraud') label, so that whenever the model predicts a 
> data point as 'good', it has a strong likelihood of it being 'good', and the 
> manual team can ignore it.
> While it's ok to predict some 'good' data points as 'fraud', as it will be 
> taken care by the manual verification team.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059251#comment-16059251
 ] 

Apache Spark commented on SPARK-21178:
--

User 'rawataaryan9' has created a pull request for this issue:
https://github.com/apache/spark/pull/18390

> Add support for label specific metrics in MulticlassClassificationEvaluator
> ---
>
> Key: SPARK-21178
> URL: https://issues.apache.org/jira/browse/SPARK-21178
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Aman Rawat
>
> MulticlassClassificationEvaluator is restricted to the global metrics - f1, 
> weightedPrecision, weightedRecall, accuracy
> However, we have a requirement where we would want to optimize the learning 
> on metric for a specific label - for instance, true positive rate (label 'B')
> For example : Take a fraud detection use-case with labels 'good' and 'fraud' 
> being passed to a manual verification team. We want to maximize the 
> true-positive rate of ('fraud') label, so that whenever the model predicts a 
> data point as 'good', it has a strong likelihood of it being 'good', and the 
> manual team can ignore it.
> While it's ok to predict some 'good' data points as 'fraud', as it will be 
> taken care by the manual verification team.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue

2017-06-22 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059273#comment-16059273
 ] 

cen yuhai commented on SPARK-20295:
---

I hit this bug...

> when  spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
> --
>
> Key: SPARK-20295
> URL: https://issues.apache.org/jira/browse/SPARK-20295
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL
>Affects Versions: 2.1.0
>Reporter: Ruhui Wang
>
> when run  tpcds-q95, and set  spark.sql.adaptive.enabled = true the physical 
> plan firstly:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- Exchange(coordinator id: 3)
>  spark.sql.exchange.reuse is opened, then physical plan will become below:
> Sort
> :  +- Exchange(coordinator id: 1)
> : +- Project***
> ::-Sort **
> ::  +- Exchange(coordinator id: 2)
> :: :- Project ***
> :+- Sort
> ::  +- ReusedExchange  Exchange(coordinator id: 2)
> If spark.sql.adaptive.enabled = true,  the code stack is : 
> ShuffleExchange#doExecute --> postShuffleRDD function --> 
> doEstimationIfNecessary . In this function, 
> assert(exchanges.length == numExchanges) will be error, as left side has only 
> one element, but right is equal to 2.
> If this is a bug of spark.sql.adaptive.enabled and exchange resue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20832.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18362
[https://github.com/apache/spark/pull/18362]

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
> Fix For: 2.3.0
>
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.
> This relates to SPARK-20115 in the sense that both tickets aim to address 
> issues where the external shuffle service is unavailable. The key difference 
> is the mechanism for detection: SPARK-20115 marks the external shuffle 
> service as unavailable whenever any fetch failure occurs from it, whereas the 
> proposal here relies on more explicit signals. This JIRA ticket's proposal is 
> scoped only to Spark Standalone mode. As a compromise, we might be able to 
> consider "all of a single shuffle's outputs lost on a single external shuffle 
> service" following a fetch failure (to be discussed in separate JIRA). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20832:
---

Assignee: Jiang Xingbo

> Standalone master should explicitly inform drivers of worker deaths and 
> invalidate external shuffle service outputs
> ---
>
> Key: SPARK-20832
> URL: https://issues.apache.org/jira/browse/SPARK-20832
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Scheduler
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
> Fix For: 2.3.0
>
>
> In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added 
> logic to the DAGScheduler to mark external shuffle service instances as 
> unavailable upon task failure when the task failure reason was "SlaveLost" 
> and this was known to be caused by worker death. If the Spark Master 
> discovered that a worker was dead then it would notify any drivers with 
> executors on those workers to mark those executors as dead. The linked patch 
> simply piggybacked on this logic to have the executor death notification also 
> imply worker death and to have worker-death-caused-executor-death imply 
> shuffle file loss.
> However, there are modes of external shuffle service loss which this 
> mechanism does not detect, leaving the system prone race conditions. Consider 
> the following:
> * Spark standalone is configured to run an external shuffle service embedded 
> in the Worker.
> * Application has shuffle outputs and executors on Worker A.
> * Stage depending on outputs of tasks that ran on Worker A starts.
> * All executors on worker A are removed due to dying with exceptions, 
> scaling-down via the dynamic allocation APIs, but _not_ due to worker death. 
> Worker A is still healthy at this point.
> * At this point the MapOutputTracker still records map output locations on 
> Worker A's shuffle service. This is expected behavior. 
> * Worker A dies at an instant where the application has no executors running 
> on it.
> * The Master knows that Worker A died but does not inform the driver (which 
> had no executors on that worker at the time of its death).
> * Some task from the running stage attempts to fetch map outputs from Worker 
> A but these requests time out because Worker A's shuffle service isn't 
> available.
> * Due to other logic in the scheduler, these preventable FetchFailures don't 
> wind up invaliding the now-invalid unavailable map output locations (this is 
> a distinct bug / behavior which I'll discuss in a separate JIRA ticket).
> * This behavior leads to several unsuccessful stage reattempts and ultimately 
> to a job failure.
> A simple way to address this would be to have the Master explicitly notify 
> drivers of all Worker deaths, even if those drivers don't currently have 
> executors. The Spark Standalone scheduler backend can receive the explicit 
> WorkerLost message and can bubble up the right calls to the task scheduler 
> and DAGScheduler to invalidate map output locations from the now-dead 
> external shuffle service.
> This relates to SPARK-20115 in the sense that both tickets aim to address 
> issues where the external shuffle service is unavailable. The key difference 
> is the mechanism for detection: SPARK-20115 marks the external shuffle 
> service as unavailable whenever any fetch failure occurs from it, whereas the 
> proposal here relies on more explicit signals. This JIRA ticket's proposal is 
> scoped only to Spark Standalone mode. As a compromise, we might be able to 
> consider "all of a single shuffle's outputs lost on a single external shuffle 
> service" following a fetch failure (to be discussed in separate JIRA). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21179) Unable to return Hive INT data type into Spark SQL via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2017-06-22 Thread Matthew Walton (JIRA)
Matthew Walton created SPARK-21179:
--

 Summary: Unable to return Hive INT data type into Spark SQL via 
Hive JDBC driver:  Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) 
Error converting value to int.  
 Key: SPARK-21179
 URL: https://issues.apache.org/jira/browse/SPARK-21179
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.0.0, 1.6.0
 Environment: OS:  Linux
HDP version 2.5.0.1-60
Hive version: 1.2.1
Spark  version 2.0.0.2.5.0.1-60
JDBC:  Download the latest Hortonworks JDBC driver
Reporter: Matthew Walton


I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
Unfortunately, when I try to query data that resides in an INT column I get the 
following error:  

17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.  

Steps to reproduce:

1) On Hive create a simple table with an INT column and insert some data (I 
used SQuirreL Client with the Hortonworks JDBC driver):

create table wh2.hivespark (country_id int, country_name string)
insert into wh2.hivespark values (1, 'USA')

2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
Spark Shell

3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files

./spark-shell --jars 
/home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar

4) In Spark shell load the data from Hive using the JDBC driver

val hivespark = spark.read.format("jdbc").options(Map("url" -> 
"jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable"
 -> 
"wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load()

5) In Spark shell try to display the data

hivespark.show()

At this point you should see the error:

scala> hivespark.show()
17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
at 
com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown 
Source)
at 
com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source)
at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown 
Source)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Note:  I also tested this issue using a JDBC driver from Progress DataDirect 
and I see a similar error message so this does not seem to be driver specific.

scala> hivespark.show()
17/06/22 12:0

[jira] [Commented] (SPARK-21166) Automated ML persistence

2017-06-22 Thread Mark Hamilton (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059481#comment-16059481
 ] 

Mark Hamilton commented on SPARK-21166:
---

The code is currently being developed here: 

https://github.com/Azure/mmlspark/pull/25


> Automated ML persistence
> 
>
> Key: SPARK-21166
> URL: https://issues.apache.org/jira/browse/SPARK-21166
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing the possibility of automating ML persistence.  
> Currently, custom save/load methods are written for every Model.  However, we 
> could design a mixin which provides automated persistence, inspecting model 
> data and Params and reading/writing (known types) automatically.  This was 
> brought up in discussions with developers behind 
> https://github.com/azure/mmlspark
> Some issues we will need to consider:
> * Providing generic mixin usable in most or all cases
> * Handling corner cases (strange Param types, etc.)
> * Backwards compatibility (loading models saved by old Spark versions)
> Because of backwards compatibility in particular, it may make sense to 
> implement testing for that first, before we try to address automated 
> persistence: [SPARK-15573]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan

2017-06-22 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-21180:


 Summary: Remove conf from stats functions since now we have conf 
in LogicalPlan
 Key: SPARK-21180
 URL: https://issues.apache.org/jira/browse/SPARK-21180
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each connector, Jetty creates Selector threads: minimum 4, maximum half 
> the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a connector is set up for each executor and one for 
> the master UI.
> I have a system with 88 CPUs on the master node and 7 executors. Jetty tries 
> to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
> initialized with 200 threads by default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)* }}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atl

[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-22 Thread Ingo Schuster (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ingo Schuster updated SPARK-21176:
--
Description: 
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
(see 
https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.

  was:
In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node 
has too many cpus or the cluster has too many executers:

For each connector, Jetty creates Selector threads: minimum 4, maximum half the 
number of available CPUs:
{{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}}
(see 
https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java)

In reverse proxy mode, a connector is set up for each executor and one for the 
master UI.
I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to 
instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
initialized with 200 threads by default, the UI gets stuck.

I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
QueuedThreadPool*(400)* }}). With this hack, the UI works.

Obviously, the Jetty defaults are meant for a real web server. If that has 88 
CPUs, you do certainly expect a lot of traffic.
For the Spark admin UI however, there will rarely be concurrent accesses for 
the same application or the same executor.
I therefore propose to dramatically reduce the number of selector threads that 
get instantiated - at least by default.

I will propose a fix in a pull request.


> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each connector, Jetty creates Selector threads: minimum 4, maximum half 
> the number of available CPUs:
> {{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a connector is set up for each executor and one for 
> the master UI.
> I have a system with 88 CPUs on the master node and 7 executors. Jetty tries 
> to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is 
> initialized with 200 threads by default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool*(400)* }}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.1

[jira] [Assigned] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21180:


Assignee: (was: Apache Spark)

> Remove conf from stats functions since now we have conf in LogicalPlan
> --
>
> Key: SPARK-21180
> URL: https://issues.apache.org/jira/browse/SPARK-21180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21180:


Assignee: Apache Spark

> Remove conf from stats functions since now we have conf in LogicalPlan
> --
>
> Key: SPARK-21180
> URL: https://issues.apache.org/jira/browse/SPARK-21180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059525#comment-16059525
 ] 

Apache Spark commented on SPARK-21180:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18391

> Remove conf from stats functions since now we have conf in LogicalPlan
> --
>
> Key: SPARK-21180
> URL: https://issues.apache.org/jira/browse/SPARK-21180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters

2017-06-22 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059536#comment-16059536
 ] 

Nicholas Chammas commented on SPARK-21110:
--

cc [~marmbrus] - Assuming this is a valid feature request, maybe it belongs as 
part of a larger umbrella task.

> Structs should be usable in inequality filters
> --
>
> Key: SPARK-21110
> URL: https://issues.apache.org/jira/browse/SPARK-21110
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> It seems like a missing feature that you can't compare structs in a filter on 
> a DataFrame.
> Here's a simple demonstration of a) where this would be useful and b) how 
> it's different from simply comparing each of the components of the structs.
> {code}
> import pyspark
> from pyspark.sql.functions import col, struct, concat
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [
> ('Boston', 'Bob'),
> ('Boston', 'Nick'),
> ('San Francisco', 'Bob'),
> ('San Francisco', 'Nick'),
> ],
> ['city', 'person']
> )
> pairs = (
> df.select(
> struct('city', 'person').alias('p1')
> )
> .crossJoin(
> df.select(
> struct('city', 'person').alias('p2')
> )
> )
> )
> print("Everything")
> pairs.show()
> print("Comparing parts separately (doesn't give me what I want)")
> (pairs
> .where(col('p1.city') < col('p2.city'))
> .where(col('p1.person') < col('p2.person'))
> .show())
> print("Comparing parts together with concat (gives me what I want but is 
> hacky)")
> (pairs
> .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person'))
> .show())
> print("Comparing parts together with struct (my desired solution but 
> currently yields an error)")
> (pairs
> .where(col('p1') < col('p2'))
> .show())
> {code}
> The last query yields the following error in Spark 2.1.1:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to 
> data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint 
> or int or bigint or float or double or decimal or timestamp or date or string 
> or binary) type, not struct;;
> 'Filter (p1#5 < p2#8)
> +- Join Cross
>:- Project [named_struct(city, city#0, person, person#1) AS p1#5]
>:  +- LogicalRDD [city#0, person#1]
>+- Project [named_struct(city, city#0, person, person#1) AS p2#8]
>   +- LogicalRDD [city#0, person#1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-22 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-21181:


 Summary: Suppress memory leak errors reported by netty
 Key: SPARK-21181
 URL: https://issues.apache.org/jira/browse/SPARK-21181
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Dhruve Ashar
Priority: Minor


We are seeing netty report memory leak erros like the one below after switching 
to 2.1. 

{code}
ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's 
garbage-collected. Enable advanced leak reporting to find out where the leak 
occurred. To enable advanced leak reporting, specify the JVM option 
'-Dio.netty.leakDetection.level=advanced' or call 
ResourceLeakDetector.setLevel() See 
http://netty.io/wiki/reference-counted-objects.html for more information.
{code}

Looking a bit deeper, Spark is not leaking any memory here, but it is confusing 
for the user to see the error message in the driver logs. 

After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
SparkSaslServer to be the source of these leaks.

Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21181:


Assignee: Apache Spark

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Assignee: Apache Spark
>Priority: Minor
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059671#comment-16059671
 ] 

Apache Spark commented on SPARK-21181:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/18392

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Priority: Minor
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21181:


Assignee: (was: Apache Spark)

> Suppress memory leak errors reported by netty
> -
>
> Key: SPARK-21181
> URL: https://issues.apache.org/jira/browse/SPARK-21181
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.0
>Reporter: Dhruve Ashar
>Priority: Minor
>
> We are seeing netty report memory leak erros like the one below after 
> switching to 2.1. 
> {code}
> ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before 
> it's garbage-collected. Enable advanced leak reporting to find out where the 
> leak occurred. To enable advanced leak reporting, specify the JVM option 
> '-Dio.netty.leakDetection.level=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> {code}
> Looking a bit deeper, Spark is not leaking any memory here, but it is 
> confusing for the user to see the error message in the driver logs. 
> After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
> SparkSaslServer to be the source of these leaks.
> Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress

2017-06-22 Thread Eric Vandenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21155:

Attachment: Screen Shot 2017-06-22 at 9.58.08 AM.png

> Add (? running tasks) into Spark UI progress
> 
>
> Key: SPARK-21155
> URL: https://issues.apache.org/jira/browse/SPARK-21155
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.1
>Reporter: Eric Vandenberg
>Priority: Minor
> Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot 
> 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png
>
>
> The progress UI for Active Jobs / Tasks should show the number of exact 
> number of running tasks.  See screen shot attachment for what this looks like.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20599) ConsoleSink should work with write (batch)

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20599.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>  Labels: starter
> Fix For: 2.3.0
>
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21167:
-
Comment: was deleted

(was: User 'dijingran' has created a pull request for this issue:
https://github.com/apache/spark/pull/18383)

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059816#comment-16059816
 ] 

Apache Spark commented on SPARK-21167:
--

User 'dijingran' has created a pull request for this issue:
https://github.com/apache/spark/pull/18383

> Path is not decoded correctly when reading output of FileSink
> -
>
> Key: SPARK-21167
> URL: https://issues.apache.org/jira/browse/SPARK-21167
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> When reading output of FileSink, path is not decoded correctly. So if the 
> path has some special characters, such as spaces, Spark cannot read it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21168) KafkaRDD should always set kafka clientId.

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21168:
-
Component/s: (was: Structured Streaming)
 DStreams

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Priority: Trivial
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2017-06-22 Thread Matthew Walton (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Walton updated SPARK-21179:
---
Affects Version/s: 2.1.1
  Summary: Unable to return Hive INT data type into Spark via Hive 
JDBC driver:  Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
converting value to int.(was: Unable to return Hive INT data type into 
Spark SQL via Hive JDBC driver:  Caused by: java.sql.SQLDataException: 
[Simba][JDBC](10140) Error converting value to int.  )

> Unable to return Hive INT data type into Spark via Hive JDBC driver:  Caused 
> by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> -
>
> Key: SPARK-21179
> URL: https://issues.apache.org/jira/browse/SPARK-21179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> HDP version 2.5.0.1-60
> Hive version: 1.2.1
> Spark  version 2.0.0.2.5.0.1-60
> JDBC:  Download the latest Hortonworks JDBC driver
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
> Unfortunately, when I try to query data that resides in an INT column I get 
> the following error:  
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> Steps to reproduce:
> 1) On Hive create a simple table with an INT column and insert some data (I 
> used SQuirreL Client with the Hortonworks JDBC driver):
> create table wh2.hivespark (country_id int, country_name string)
> insert into wh2.hivespark values (1, 'USA')
> 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files
> ./spark-shell --jars 
> /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar
> 4) In Spark shell load the data from Hive using the JDBC driver
> val hivespark = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable"
>  -> 
> "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load()
> 5) In Spark shell try to display the data
> hivespark.show()
> At this point you should see the error:
> scala> hivespark.show()
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
> at 
> com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at 
> com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source)
> at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown 
> Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

[jira] [Created] (SPARK-21182) Structured streaming on Spark-shell on windows

2017-06-22 Thread Vijay (JIRA)
Vijay created SPARK-21182:
-

 Summary: Structured streaming on Spark-shell on windows
 Key: SPARK-21182
 URL: https://issues.apache.org/jira/browse/SPARK-21182
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.1
 Environment: Windows 10
spark-2.1.1-bin-hadoop2.7
Reporter: Vijay
Priority: Minor


Structured streaming output operation is failing on Windows shell.

As per the error message, path is being prefixed with File separator as in 
Linux.
Thus, causing the IllegalArgumentException.

Following is the error message.

scala> val query = wordCounts.writeStream  .outputMode("complete")  
.format("console")  .start()
java.lang.IllegalArgumentException: Pathname 
{color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
 from 
C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets
 is not a valid DFS filename.
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
  at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
  at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
  at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
  at 
org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222)
  at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280)
  at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268)
  ... 52 elided



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to

2017-06-22 Thread Matthew Walton (JIRA)
Matthew Walton created SPARK-21183:
--

 Summary: Unable to return Google BigQuery INTEGER data type into 
Spark via google BigQuery JDBC driver: java.sql.SQLDataException: 
[Simba][JDBC](10140) Error converting value to long.
 Key: SPARK-21183
 URL: https://issues.apache.org/jira/browse/SPARK-21183
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.1.1, 2.0.0, 1.6.0
 Environment: OS:  Linux
Spark  version 2.1.1
JDBC:  Download the latest google BigQuery JDBC Driver from Google
Reporter: Matthew Walton


I'm trying to fetch back data in Spark using a JDBC connection to Google 
BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
column I get the following error:  

java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long. 
 

Steps to reproduce:

1) On Google BigQuery console create a simple table with an INT column and 
insert some data 

2) Copy the Google BigQuery JDBC driver to the machine where you will run Spark 
Shell

3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files

./spark-shell --jars 
/home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar

4) In Spark shell load the data from Google BigQuery using the JDBC driver

val gbq = spark.read.format("jdbc").options(Map("url" -> 
"jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable";
 -> 
"test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()

5) In Spark shell try to display the data

gbq.show()

At this point you should see the error:

scala> gbq.show()
17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
[Simba][JDBC](10140) Error converting value to long.
at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
Source)
at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(

[jira] [Created] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

2017-06-22 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-21184:
--

 Summary: QuantileSummaries implementation is wrong and 
QuantileSummariesSuite fails with larger n
 Key: SPARK-21184
 URL: https://issues.apache.org/jira/browse/SPARK-21184
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Andrew Ray


1. QuantileSummaries implementation does not match the paper it is supposed to 
be based on.

1a. The compress method 
(https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
 merges neighboring buckets, but thats not what the paper says to do. The paper 
(http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
describes an implicit tree structure and the compress method deletes selected 
subtrees.

1b. The paper does not discuss merging these summary data structures at all. 
The following comment is in the merge method of QuantileSummaries:

{quote}  // The GK algorithm is a bit unclear about it, but it seems there 
is no need to adjust the
  // statistics during the merging: the invariants are still respected 
after the merge.{quote}

Unless I'm missing something that needs substantiation, it's not clear that 
that the invariants hold.

2. QuantileSummariesSuite fails with n = 1 (and other non trivial values)
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27

One possible solution if these issues can't be resolved would be to move to an 
algorithm that explicitly supports merging and is well tested like 
https://github.com/tdunning/t-digest




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19937) Collect metrics of block sizes when shuffle.

2017-06-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19937.

   Resolution: Fixed
 Assignee: jin xing
Fix Version/s: 2.3.0

> Collect metrics of block sizes when shuffle.
> 
>
> Key: SPARK-19937
> URL: https://issues.apache.org/jira/browse/SPARK-19937
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.3.0
>
>
> This improvement is related to SPARK-19659 and also documented there. Metrics 
> of blocks sizes(when shuffle) should be collected for later analysis. This is 
> helpful for analysis when skew situations or OOM happens(though 
> maxBytesInFlight is set). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20342) DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060063#comment-16060063
 ] 

Apache Spark commented on SPARK-20342:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18393

> DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators
> ---
>
> Key: SPARK-20342
> URL: https://issues.apache.org/jira/browse/SPARK-20342
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Hit this on 2.2, but probably has been there forever. This is similar in 
> spirit to SPARK-20205.
> Event is sent here, around L1154:
> {code}
> listenerBus.post(SparkListenerTaskEnd(
>stageId, task.stageAttemptId, taskType, event.reason, event.taskInfo, 
> taskMetrics))
> {code}
> Accumulators are updated later, around L1173:
> {code}
> val stage = stageIdToStage(task.stageId)
> event.reason match {
>   case Success =>
> task match {
>   case rt: ResultTask[_, _] =>
> // Cast to ResultStage here because it's part of the ResultTask
> // TODO Refactor this out to a function that accepts a ResultStage
> val resultStage = stage.asInstanceOf[ResultStage]
> resultStage.activeJob match {
>   case Some(job) =>
> if (!job.finished(rt.outputId)) {
>   updateAccumulators(event)
> {code}
> Same thing applies here; UI shows correct info because it's pointing at the 
> mutable {{TaskInfo}} structure. But the event log, for example, may record 
> the wrong information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20379) Allow setting SSL-related passwords through env variables

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060065#comment-16060065
 ] 

Apache Spark commented on SPARK-20379:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18394

> Allow setting SSL-related passwords through env variables
> -
>
> Key: SPARK-20379
> URL: https://issues.apache.org/jira/browse/SPARK-20379
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Currently, Spark reads all SSL options from configuration, which can be 
> provided in a file or through the command line. This means that to set the 
> SSL keystore / trust store / key passwords, you have to use one of those 
> options.
> Using the command line for that is not secure, and in some environments 
> admins prefer to not have the password written in plain text in a file (since 
> the file and the data it's protecting could be stored in the same 
> filesystem). So for these cases it would be nice to be able to provide these 
> passwords through environment variables, which are not written to disk and 
> also not readable by other users on the same machine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20655) In-memory key-value store implementation

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060067#comment-16060067
 ] 

Apache Spark commented on SPARK-20655:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18395

> In-memory key-value store implementation
> 
>
> Key: SPARK-20655
> URL: https://issues.apache.org/jira/browse/SPARK-20655
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks adding an in-memory implementation for the key-value store 
> abstraction added in SPARK-20641. This is desired because people might want 
> to avoid having to store this data on disk when running applications, and 
> also because the LevelDB native libraries are not available on all platforms.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API

2017-06-22 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060097#comment-16060097
 ] 

Jose Soltren commented on SPARK-20391:
--

So, this is months old now and irrelevant, but since you pinged me, I'll say 
that jerryshao's changes look fine to me. Thanks.

> Properly rename the memory related fields in ExecutorSummary REST API
> -
>
> Key: SPARK-20391
> URL: https://issues.apache.org/jira/browse/SPARK-20391
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Blocker
> Fix For: 2.2.0
>
>
> Currently in Spark we could get executor summary through REST API 
> {{/api/v1/applications//executors}}. The format of executor summary 
> is:
> {code}
> class ExecutorSummary private[spark](
> val id: String,
> val hostPort: String,
> val isActive: Boolean,
> val rddBlocks: Int,
> val memoryUsed: Long,
> val diskUsed: Long,
> val totalCores: Int,
> val maxTasks: Int,
> val activeTasks: Int,
> val failedTasks: Int,
> val completedTasks: Int,
> val totalTasks: Int,
> val totalDuration: Long,
> val totalGCTime: Long,
> val totalInputBytes: Long,
> val totalShuffleRead: Long,
> val totalShuffleWrite: Long,
> val isBlacklisted: Boolean,
> val maxMemory: Long,
> val executorLogs: Map[String, String],
> val onHeapMemoryUsed: Option[Long],
> val offHeapMemoryUsed: Option[Long],
> val maxOnHeapMemory: Option[Long],
> val maxOffHeapMemory: Option[Long])
> {code}
> Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, 
> {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, 
> {{maxOffHeapMemory}}.
> These all 6 fields reflects the *storage* memory usage in Spark, but from the 
> name of this 6 fields, user doesn't really know it is referring to *storage* 
> memory or the total memory (storage memory + execution memory). This will be 
> misleading.
> So I think we should properly rename these fields to reflect their real 
> meanings. Or we should will document it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060196#comment-16060196
 ] 

Apache Spark commented on SPARK-21145:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/18396

> Restarted queries reuse same StateStoreProvider, causing multiple concurrent 
> tasks to update same StateStore
> 
>
> Key: SPARK-21145
> URL: https://issues.apache.org/jira/browse/SPARK-21145
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> StateStoreProvider instances are loaded on-demand in a executor when a query 
> is started. When a query is restarted, the loaded provider instance will get 
> reused. Now, there is a non-trivial chance, that the task of the previous 
> query run is still running, while the tasks of the restarted run has started. 
> So for a stateful partition, there may be two concurrent tasks related to the 
> same stateful partition, and there for using the same provider instance. This 
> can lead to inconsistent results and possibly random failures, as state store 
> implementations are not designed to be thread-safe.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2017-06-22 Thread Kanagha Pradha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060203#comment-16060203
 ] 

Kanagha Pradha commented on SPARK-18343:


I am getting the same error with spark 2.0.2 - scala 2.11 version. As [~lminer] 
mentioned, I looked at the versions.
the spark pom.xml by default references json4s - 3.2.11 version. I'm not sure 
what is the reason.
Also, this error occurs intermittently and several times spark-submit succeeds. 
Any pointers on what could be causing this?


Thread 24 (org.apache.hadoop.hdfs.PeerCache@1c80e49b):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 23
  Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:255)
org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46)
org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124)
java.lang.Thread.run(Thread.java:748)
Thread 22 (communication thread):
  State: RUNNABLE
  Blocked count: 20
  Waited count: 63
  Stack:
sun.management.ThreadImpl.getThreadInfo1(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:168)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:223)
org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:785)
java.lang.Thread.run(Thread.java:748)
Thread 16 (process reaper):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 5
  Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)

java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:748)
Thread 14 
(org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner):
  State: WAITING
  Blocked count: 0
  Waited count: 1
  Waiting on java.lang.ref.ReferenceQueue$Lock@407a7f2a
  Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)

org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3060)
java.lang.Thread.run(Thread.java:748)
Thread 13 (IPC Parameter Sending Thread #0):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 31
  Stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)

java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)

java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:748)
Thread 4 (Signal Dispatcher):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
Thread 3 (Finalizer):
  State: WAITING
  Blocked count: 567
  Waited count: 6
  Waiting on java.lang.ref.ReferenceQueue$Lock@2cb2fc20
  Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
Thread 2 (Reference Handler):
  State: WAITING
  Blocked count: 9
  Waited count: 5
  Waiting on java.lang.ref.Reference$Lock@4f4c4b1a
  Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.lang.ref.Reference.tryHandlePending(Reference.java:191)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)

Jun 22, 2017 11:01:23 PM org.apache.hadoop.mapred.Task run
WARNING: Last retry, killing attempt_1498171863687_0003_m_00_0

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> 

[jira] [Assigned] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21159:


Assignee: Apache Spark

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>Assignee: Apache Spark
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.sca

[jira] [Assigned] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21159:


Assignee: (was: Apache Spark)

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at org.a

[jira] [Commented] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher

2017-06-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060208#comment-16060208
 ] 

Apache Spark commented on SPARK-21159:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18397

> Cluster mode, driver throws connection refused exception submitted by 
> SparkLauncher
> ---
>
> Key: SPARK-21159
> URL: https://issues.apache.org/jira/browse/SPARK-21159
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Server A-Master
> Server B-Slave
>Reporter: niefei
>
> When an spark application submitted by SparkLauncher#startApplication method, 
> this will get a SparkAppHandle. In the test environment, the launcher runs on 
> server A, if it runs in Client mode, everything is ok. In cluster mode, the 
> launcher will run on Server A, and the driver will be run on Server B, in 
> this scenario, when initialize SparkContext, a LauncherBackend will try to 
> connect to the launcher application via specified port and ip address. the 
> problem is the implementation of LauncherBackend uses loopback ip to connect 
> which is 127.0.0.1. this will cause the connection refused as server B never 
> ran the launcher. 
> The expected behavior is the LauncherBackend should use Server A's Ip address 
> to connect for reporting the running status.
> Below is the stacktrace:
> 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext.
> java.net.ConnectException: Connection refused
>   at java.net.PlainSocketImpl.socketConnect(Native Method)
>   at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>   at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>   at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>   at java.net.Socket.connect(Socket.java:589)
>   at java.net.Socket.connect(Socket.java:538)
>   at java.net.Socket.(Socket.java:434)
>   at java.net.Socket.(Socket.java:244)
>   at 
> org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156)
>   at org.apache.spark.SparkContext.(SparkContext.scala:509)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91)
>   at 
> com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25)
>   at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61)
>   at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>   at 
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at 
> http://172.25.108.62:4040
> 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors
> 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking 
> each executor to shut down
> 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214)
>   at 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkC

[jira] [Resolved] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-13534.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 15821
[https://github.com/apache/spark/pull/15821]

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-13534:
---

Assignee: Bryan Cutler

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21185) Spurious errors in unidoc causing PRs to fail

2017-06-22 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-21185:
-

 Summary: Spurious errors in unidoc causing PRs to fail
 Key: SPARK-21185
 URL: https://issues.apache.org/jira/browse/SPARK-21185
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.0
Reporter: Tathagata Das
Assignee: Hyukjin Kwon


Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc 
generates Java files from Scala files, the generated java files can have errors 
in them. When JavaDoc attempts to generate docs on these generated java files, 
it throws errors. Usually, the errors are marked as warnings, so the unidoc 
does not fail the build. 
Example - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull

{code}
[info] Constructing Javadoc information...
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus 
listenerBus, org.apache.spark.SparkConf conf, 
scala.Option allocationClient, 
org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
[warn]  
  ^
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
scala.Option allocationClient)  { 
throw new RuntimeException(); }
{code}

However in some PR builds these are marked as errors, thus causing the build to 
fail due to unidoc. Example - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull

{code}
[info] Constructing Javadoc information...
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[error]   public   BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus 
listenerBus, org.apache.spark.SparkConf conf, 
scala.Option allocationClient, 
org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
[error] 
   ^
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[error]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
scala.Option allocationClient)  { 
throw new RuntimeException(); }
[error] 
{code}




https://github.com/apache/spark/pull/18355



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21185) Spurious errors in unidoc causing PRs to fail

2017-06-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-21185:
--
Description: 
Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc 
generates Java files from Scala files, the generated java files can have errors 
in them. When JavaDoc attempts to generate docs on these generated java files, 
it throws errors. Usually, the errors are marked as warnings, so the unidoc 
does not fail the build. 
Example - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull

{code}
[info] Constructing Javadoc information...
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus 
listenerBus, org.apache.spark.SparkConf conf, 
scala.Option allocationClient, 
org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
[warn]  
  ^
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
scala.Option allocationClient)  { 
throw new RuntimeException(); }
{code}

However in some PR builds these are marked as errors, thus causing the build to 
fail due to unidoc. Example - 
https://github.com/apache/spark/pull/18355
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull

{code}
[info] Constructing Javadoc information...
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[error]   public   BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus 
listenerBus, org.apache.spark.SparkConf conf, 
scala.Option allocationClient, 
org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
[error] 
   ^
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[error]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
scala.Option allocationClient)  { 
throw new RuntimeException(); }
[error] 
{code}



  was:
Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc 
generates Java files from Scala files, the generated java files can have errors 
in them. When JavaDoc attempts to generate docs on these generated java files, 
it throws errors. Usually, the errors are marked as warnings, so the unidoc 
does not fail the build. 
Example - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull

{code}
[info] Constructing Javadoc information...
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus 
listenerBus, org.apache.spark.SparkConf conf, 
scala.Option allocationClient, 
org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
[warn]  
  ^
[warn] 
/home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
 error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
accessed from outside package
[warn]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
scala.Option allocationClient)  { 
throw new RuntimeException(); }
{code}

However in some PR builds these are marked as errors, thus causing the build to 
fail due to unidoc. Example - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull

{code}
[info] Constructing Javadoc information...
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
 error: ExecutorAllocationClient is not public i

[jira] [Assigned] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20923:
---

Assignee: Thomas Graves

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20923.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20923:

Labels: releasenotes  (was: )

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory

2017-06-22 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060283#comment-16060283
 ] 

Wenchen Fan commented on SPARK-20923:
-

This patch changes the public behavior and we should mention it in the release 
notes. Basically users can track the status of updated blocks via 
{{SparkListenerTaskEnd}} event, but this feature was introduced for internal 
usage at the beginning and I'm wondering how many users are using this feature. 
After this patch we don't trach it anymore by default, users can still turn it 
on by setting {{spark.taskMetrics.trackUpdatedBlockStatuses}} to true.

> TaskMetrics._updatedBlockStatuses uses a lot of memory
> --
>
> Key: SPARK-20923
> URL: https://issues.apache.org/jira/browse/SPARK-20923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> The driver appears to use a ton of memory in certain cases to store the task 
> metrics updated block status'.  For instance I had a user reading data form 
> hive and caching it.  The # of tasks to read was around 62,000, they were 
> using 1000 executors and it ended up caching a couple TB's of data.  The 
> driver kept running out of memory. 
> I investigated and it looks like there was 5GB of a 10GB heap being used up 
> by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks.
> The updatedBlockStatuses was already removed from the task end event under 
> SPARK-20084.  I don't see anything else that seems to be using this.  Anybody 
> know if I missed something?
>  If its not being used we should remove it, otherwise we need to figure out a 
> better way of doing it so it doesn't use so much memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21174.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18387
[https://github.com/apache/spark/pull/18387]

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level

2017-06-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21174:
---

Assignee: Gengliang Wang

> Validate sampling fraction in logical operator level
> 
>
> Key: SPARK-21174
> URL: https://issues.apache.org/jira/browse/SPARK-21174
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently the validation of sampling fraction in dataset is incomplete.
> As an improvement, validate sampling ratio in logical operator level:
> 1) if with replacement: ratio should be nonnegative
> 2) else: ratio should be on interval [0, 1]
> Also add test cases for the validation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21185) Spurious errors in unidoc causing PRs to fail

2017-06-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060296#comment-16060296
 ] 

Hyukjin Kwon commented on SPARK-21185:
--

Yea, I believe this is a duplicate of 
https://issues.apache.org/jira/browse/SPARK-20840

> Spurious errors in unidoc causing PRs to fail
> -
>
> Key: SPARK-21185
> URL: https://issues.apache.org/jira/browse/SPARK-21185
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Tathagata Das
>Assignee: Hyukjin Kwon
>
> Some PRs are failing because of unidoc throwing random errors. When 
> GenJavaDoc generates Java files from Scala files, the generated java files 
> can have errors in them. When JavaDoc attempts to generate docs on these 
> generated java files, it throws errors. Usually, the errors are marked as 
> warnings, so the unidoc does not fail the build. 
> Example - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull
> {code}
> [info] Constructing Javadoc information...
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [warn]   public   BlacklistTracker 
> (org.apache.spark.scheduler.LiveListenerBus listenerBus, 
> org.apache.spark.SparkConf conf, 
> scala.Option allocationClient, 
> org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
> [warn]
> ^
> [warn] 
> /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [warn]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
> scala.Option allocationClient)  { 
> throw new RuntimeException(); }
> {code}
> However in some PR builds these are marked as errors, thus causing the build 
> to fail due to unidoc. Example - 
> https://github.com/apache/spark/pull/18355
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull
> {code}
> [info] Constructing Javadoc information...
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [error]   public   BlacklistTracker 
> (org.apache.spark.scheduler.LiveListenerBus listenerBus, 
> org.apache.spark.SparkConf conf, 
> scala.Option allocationClient, 
> org.apache.spark.util.Clock clock)  { throw new RuntimeException(); }
> [error]   
>  ^
> [error] 
> /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118:
>  error: ExecutorAllocationClient is not public in org.apache.spark; cannot be 
> accessed from outside package
> [error]   public   BlacklistTracker (org.apache.spark.SparkContext sc, 
> scala.Option allocationClient)  { 
> throw new RuntimeException(); }
> [error] 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20599) ConsoleSink should work with write (batch)

2017-06-22 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-20599:


Assignee: Lubo Zhang

> ConsoleSink should work with write (batch)
> --
>
> Key: SPARK-20599
> URL: https://issues.apache.org/jira/browse/SPARK-20599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Assignee: Lubo Zhang
>Priority: Minor
>  Labels: starter
> Fix For: 2.3.0
>
>
> I think the following should just work.
> {code}
> spark.
>   read.  // <-- it's a batch query not streaming query if that matters
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   load.
>   write.
>   format("console").  // <-- that's not supported currently
>   save
> {code}
> The above combination of {{kafka}} source and {{console}} sink leads to the 
> following exception:
> {code}
> java.lang.RuntimeException: 
> org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow 
> create table as select.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060328#comment-16060328
 ] 

Reynold Xin commented on SPARK-13534:
-

Was this done? I thought there are still other data types that are not 
supported. We should either turn this into an umbrella ticket, or create a new 
umbrella ticket.


> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand

2017-06-22 Thread wangminfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060348#comment-16060348
 ] 

wangminfeng commented on SPARK-21171:
-

We have modified some code for this feature, i will add a benchmark soon.  It 
is the first time i contribute to community, give me some time for learn the 
rule. Thank you.

> Speculate task scheduling block dirve handle normal task when a job task 
> number more than one hundred thousand
> --
>
> Key: SPARK-21171
> URL: https://issues.apache.org/jira/browse/SPARK-21171
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
> Environment: We have more than two hundred high-performance machine 
> to handle more than 2T data by one query
>Reporter: wangminfeng
>
> If a job have more then one hundred thousand tasks and spark.speculation is 
> true, when speculable tasks start, choosing a speculable will waste lots of 
> time and block other tasks. We do a ad-hoc query for data analyse,  we can't 
> tolerate one job wasting time even it is a large job



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21186) PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar

2017-06-22 Thread HanCheol Cho (JIRA)
HanCheol Cho created SPARK-21186:


 Summary: PySpark with --packages fails to import library due to 
lack of pythonpath to .ivy2/jars/*.jar
 Key: SPARK-21186
 URL: https://issues.apache.org/jira/browse/SPARK-21186
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Spark is downloaded and compiled by myself.

Spark: 2.2.0-SNAPSHOT
Python: Anaconda Python2 (on client and workers)
Reporter: HanCheol Cho
Priority: Minor


I experienced "ImportError: No module named sparkdl" exception while trying to 
use databricks' spark-deep-learning (sparkdl) in PySpark.
The package is included with --packages option as below.

{code}
$ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
{code}

The problem was that PySpark fails to detect this package's jar files located 
in .ivy2/jars/ directory.
I could circumvent this issue by manually adding this path to PYTHONPATH after 
launching PySpark as follows.

{code}
>>> import sys, glob, os
>>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), 
>>> ".ivy2/jars/*.jar")))
>>>
>>> import sparkdl
Using TensorFlow backend.
>>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg")
>>> my_images.show()
+++
|filePath|   image|
+++
|hdfs://mycluster/...|[RGB,263,320,3,[B...|
|hdfs://mycluster/...|[RGB,313,500,3,[B...|
|hdfs://mycluster/...|[RGB,215,320,3,[B...|
...
{code}

I think that it may be better to add ivy2/jar directory path to PYTHONPATH 
while launching PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21186) PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar

2017-06-22 Thread HanCheol Cho (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HanCheol Cho updated SPARK-21186:
-
Description: 
I experienced "ImportError: No module named sparkdl" exception while trying to 
use databricks' spark-deep-learning (sparkdl) in PySpark.
The package is included with --packages option as below.

{code}
$ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11

>>> import sparkdl
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named sparkdl
{code}

The problem was that PySpark fails to detect this package's jar files located 
in .ivy2/jars/ directory.
I could circumvent this issue by manually adding this path to PYTHONPATH after 
launching PySpark as follows.

{code}
$ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11

>>> import sys, glob, os
>>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), 
>>> ".ivy2/jars/*.jar")))
>>>
>>> import sparkdl
Using TensorFlow backend.
>>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg")
>>> my_images.show()
+++
|filePath|   image|
+++
|hdfs://mycluster/...|[RGB,263,320,3,[B...|
|hdfs://mycluster/...|[RGB,313,500,3,[B...|
|hdfs://mycluster/...|[RGB,215,320,3,[B...|
...
{code}

I think that it may be better to add ivy2/jar directory path to PYTHONPATH 
while launching PySpark.

  was:
I experienced "ImportError: No module named sparkdl" exception while trying to 
use databricks' spark-deep-learning (sparkdl) in PySpark.
The package is included with --packages option as below.

{code}
$ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
{code}

The problem was that PySpark fails to detect this package's jar files located 
in .ivy2/jars/ directory.
I could circumvent this issue by manually adding this path to PYTHONPATH after 
launching PySpark as follows.

{code}
>>> import sys, glob, os
>>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), 
>>> ".ivy2/jars/*.jar")))
>>>
>>> import sparkdl
Using TensorFlow backend.
>>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg")
>>> my_images.show()
+++
|filePath|   image|
+++
|hdfs://mycluster/...|[RGB,263,320,3,[B...|
|hdfs://mycluster/...|[RGB,313,500,3,[B...|
|hdfs://mycluster/...|[RGB,215,320,3,[B...|
...
{code}

I think that it may be better to add ivy2/jar directory path to PYTHONPATH 
while launching PySpark.


> PySpark with --packages fails to import library due to lack of pythonpath to 
> .ivy2/jars/*.jar
> -
>
> Key: SPARK-21186
> URL: https://issues.apache.org/jira/browse/SPARK-21186
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: Spark is downloaded and compiled by myself.
> Spark: 2.2.0-SNAPSHOT
> Python: Anaconda Python2 (on client and workers)
>Reporter: HanCheol Cho
>Priority: Minor
>
> I experienced "ImportError: No module named sparkdl" exception while trying 
> to use databricks' spark-deep-learning (sparkdl) in PySpark.
> The package is included with --packages option as below.
> {code}
> $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
> >>> import sparkdl
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: No module named sparkdl
> {code}
> The problem was that PySpark fails to detect this package's jar files located 
> in .ivy2/jars/ directory.
> I could circumvent this issue by manually adding this path to PYTHONPATH 
> after launching PySpark as follows.
> {code}
> $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11
> >>> import sys, glob, os
> >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), 
> >>> ".ivy2/jars/*.jar")))
> >>>
> >>> import sparkdl
> Using TensorFlow backend.
> >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg")
> >>> my_images.show()
> +++
> |filePath|   image|
> +++
> |hdfs://mycluster/...|[RGB,263,320,3,[B...|
> |hdfs://mycluster/...|[RGB,313,500,3,[B...|
> |hdfs://mycluster/...|[RGB,215,320,3,[B...|
> ...
> {code}
> I think that it may be better to add ivy2/jar directory path to PYTHONPATH 
> while launching PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21066) LibSVM load just one input file

2017-06-22 Thread 颜发才

 [ 
https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Facai (颜发才) updated SPARK-21066:

Priority: Trivial  (was: Major)

> LibSVM load just one input file
> ---
>
> Key: SPARK-21066
> URL: https://issues.apache.org/jira/browse/SPARK-21066
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Trivial
>
> Currently when we using SVM to train dataset we found the input files limit 
> only one .
> The file store on the Distributed File System such as HDFS is split into 
> mutil piece and I think this limit is not necessary .
>  We can join input paths into a string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21066) LibSVM load just one input file

2017-06-22 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060381#comment-16060381
 ] 

Yan Facai (颜发才) commented on SPARK-21066:
-

Downgrade to Trivial since `numFeatures` should work.

> LibSVM load just one input file
> ---
>
> Key: SPARK-21066
> URL: https://issues.apache.org/jira/browse/SPARK-21066
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>Priority: Trivial
>
> Currently when we using SVM to train dataset we found the input files limit 
> only one .
> The file store on the Distributed File System such as HDFS is split into 
> mutil piece and I think this limit is not necessary .
>  We can join input paths into a string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21187) Complete support for remaining Spark data type in Arrow Converters

2017-06-22 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-21187:


 Summary: Complete support for remaining Spark data type in Arrow 
Converters
 Key: SPARK-21187
 URL: https://issues.apache.org/jira/browse/SPARK-21187
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20738) Option to turn off building docs in sbt build.

2017-06-22 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma closed SPARK-20738.
---
Resolution: Won't Fix

> Option to turn off building docs in sbt build.
> --
>
> Key: SPARK-20738
> URL: https://issues.apache.org/jira/browse/SPARK-20738
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>Priority: Minor
>  Labels: sbt
>
> The default behavior is unchanged. If the option is not set, the docs are 
> built by default.
> sbt publish-local tries to build the docs along with other artifacts and 
> as the codebase is being updated with no build checks for sbt docs build. It 
> appears to be difficult to upkeep the correct building of docs with sbt.
> 
> An alternative is, we can hide building of docs behind an option 
> `-Dbuild.docs=false`.
> 
> This is also useful, if someone uses sbt publish and does not need the 
> building of docs as it is generally time consuming.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-22 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060413#comment-16060413
 ] 

Bryan Cutler commented on SPARK-13534:
--

That is correct [~rxin], this did not have support for complex types or 
date/timestamp.  I created SPARK-21187 as an umbrella to track addition of all 
remaining types.

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

2017-06-22 Thread Feng Liu (JIRA)
Feng Liu created SPARK-21188:


 Summary: releaseAllLocksForTask should synchronize the whole method
 Key: SPARK-21188
 URL: https://issues.apache.org/jira/browse/SPARK-21188
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core
Affects Versions: 2.1.0, 2.2.0
Reporter: Feng Liu


Since the objects readLocksByTask, writeLocksByTask and infos are coupled and 
supposed to be modified by other threads concurrently, all the read and writes 
of them in the releaseAllLocksForTask method should be protected by a single 
synchronized block. The fine-grained synchronization in the current code can 
cause some test flakiness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2017-06-22 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Summary: Complete support for remaining Spark data types in Arrow 
Converters  (was: Complete support for remaining Spark data type in Arrow 
Converters)

> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method

2017-06-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21188:


Assignee: Apache Spark

> releaseAllLocksForTask should synchronize the whole method
> --
>
> Key: SPARK-21188
> URL: https://issues.apache.org/jira/browse/SPARK-21188
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Feng Liu
>Assignee: Apache Spark
>
> Since the objects readLocksByTask, writeLocksByTask and infos are coupled and 
> supposed to be modified by other threads concurrently, all the read and 
> writes of them in the releaseAllLocksForTask method should be protected by a 
> single synchronized block. The fine-grained synchronization in the current 
> code can cause some test flakiness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >