[jira] [Commented] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058869#comment-16058869 ] Apache Spark commented on SPARK-21174: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/18387 > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21167) Path is not decoded correctly when reading output of FileSink
[ https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21167. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 2.1.2 > Path is not decoded correctly when reading output of FileSink > - > > Key: SPARK-21167 > URL: https://issues.apache.org/jira/browse/SPARK-21167 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > When reading output of FileSink, path is not decoded correctly. So if the > path has some special characters, such as spaces, Spark cannot read it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21174: Assignee: Apache Spark > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21174: Assignee: (was: Apache Spark) > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21167) Path is not decoded correctly when reading output of FileSink
[ https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21167: - Comment: was deleted (was: User 'dijingran' has created a pull request for this issue: https://github.com/apache/spark/pull/18383) > Path is not decoded correctly when reading output of FileSink > - > > Key: SPARK-21167 > URL: https://issues.apache.org/jira/browse/SPARK-21167 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > When reading output of FileSink, path is not decoded correctly. So if the > path has some special characters, such as spaces, Spark cannot read it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21174: Component/s: (was: Optimizer) SQL > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
jin xing created SPARK-21175: Summary: Slow down "open blocks" on shuffle service when memory shortage to avoid OOM. Key: SPARK-21175 URL: https://issues.apache.org/jira/browse/SPARK-21175 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.1.1 Reporter: jin xing A shuffle service can serves blocks from multiple apps/tasks. Thus the shuffle service can suffers high memory usage when lots of {{shuffle-read}} happen at the same time. In my cluster, OOM always happens on shuffle service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. It might make sense to reject "open blocks" request when memory usage is high on shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
[ https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21175: Assignee: Apache Spark > Slow down "open blocks" on shuffle service when memory shortage to avoid OOM. > - > > Key: SPARK-21175 > URL: https://issues.apache.org/jira/browse/SPARK-21175 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: jin xing >Assignee: Apache Spark > > A shuffle service can serves blocks from multiple apps/tasks. Thus the > shuffle service can suffers high memory usage when lots of {{shuffle-read}} > happen at the same time. In my cluster, OOM always happens on shuffle > service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. > It might make sense to reject "open blocks" request when memory usage is high > on shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
[ https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21175: Assignee: (was: Apache Spark) > Slow down "open blocks" on shuffle service when memory shortage to avoid OOM. > - > > Key: SPARK-21175 > URL: https://issues.apache.org/jira/browse/SPARK-21175 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: jin xing > > A shuffle service can serves blocks from multiple apps/tasks. Thus the > shuffle service can suffers high memory usage when lots of {{shuffle-read}} > happen at the same time. In my cluster, OOM always happens on shuffle > service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. > It might make sense to reject "open blocks" request when memory usage is high > on shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21175) Slow down "open blocks" on shuffle service when memory shortage to avoid OOM.
[ https://issues.apache.org/jira/browse/SPARK-21175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058892#comment-16058892 ] Apache Spark commented on SPARK-21175: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/18388 > Slow down "open blocks" on shuffle service when memory shortage to avoid OOM. > - > > Key: SPARK-21175 > URL: https://issues.apache.org/jira/browse/SPARK-21175 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.1 >Reporter: jin xing > > A shuffle service can serves blocks from multiple apps/tasks. Thus the > shuffle service can suffers high memory usage when lots of {{shuffle-read}} > happen at the same time. In my cluster, OOM always happens on shuffle > service. Analyzing heap dump, memory cost by Netty(chunks) can be up to 2~3G. > It might make sense to reject "open blocks" request when memory usage is high > on shuffle service. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058917#comment-16058917 ] Andrew Ash commented on SPARK-19700: Found another potential implementation: Facebook's in-house scheduler mentioned by [~tejasp] at https://github.com/apache/spark/pull/18209#issuecomment-307538061 > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930 ] Andrew Ash edited comment on SPARK-19700 at 6/22/17 7:47 AM: - Public implementation that's been around a while (Hamel is familiar with this): Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook was (Author: aash): Public implementation that's been around a while: Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930 ] Andrew Ash commented on SPARK-19700: Public implementation that's been around a while: Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Attachment: (was: MiniBatchKMeans_Performance_II.pdf) > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MiniBatchKMeans_Performance.pdf > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Attachment: (was: MiniBatchKMeans_Performance.pdf) > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > I have implemented mini-batch kmeans in Mllib, and the acceleration is realy > significant. > The MiniBatch KMeans is named XMeans in following lines. > {code} > val path = "/tmp/mnist8m.scale" > val data = MLUtils.loadLibSVMFile(sc, path) > val vecs = data.map(_.features).persist() > val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", seed=123l) > km.computeCost(vecs) > res0: Double = 3.317029898599564E8 > val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) > xm.computeCost(vecs) > res1: Double = 3.3169865959604424E8 > val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, > initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) > xm2.computeCost(vecs) > res2: Double = 3.317195831216454E8 > {code} > The above three training all reached the max number of iterations 10. > We can see that the WSSSEs are almost the same. While their speed perfermence > have significant difference: > {code} > KMeans2876sec > MiniBatch KMeans (fraction=0.1) 263sec > MiniBatch KMeans (fraction=0.01) 90sec > {code} > With appropriate fraction, the bigger the dataset is, the higher speedup is. > The data used above have 8,100,000 samples, 784 features. It can be > downloaded here > (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Implement the Mini-Batch KMeans
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Attachment: MBKM.xlsx > Implement the Mini-Batch KMeans > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MBKM.xlsx > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it > a new estimator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Accelerate KMeans via Mini-Batch EM
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Description: The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. Comparison of the K-Means and MiniBatchKMeans on sklearn : http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it a new estimator was: The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm. I have implemented mini-batch kmeans in Mllib, and the acceleration is realy significant. The MiniBatch KMeans is named XMeans in following lines. {code} val path = "/tmp/mnist8m.scale" val data = MLUtils.loadLibSVMFile(sc, path) val vecs = data.map(_.features).persist() val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", seed=123l) km.computeCost(vecs) res0: Double = 3.317029898599564E8 val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.1, seed=123l) xm.computeCost(vecs) res1: Double = 3.3169865959604424E8 val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1, initializationMode="k-means||", miniBatchFraction=0.01, seed=123l) xm2.computeCost(vecs) res2: Double = 3.317195831216454E8 {code} The above three training all reached the max number of iterations 10. We can see that the WSSSEs are almost the same. While their speed perfermence have significant difference: {code} KMeans2876sec MiniBatch KMeans (fraction=0.1) 263sec MiniBatch KMeans (fraction=0.01) 90sec {code} With appropriate fraction, the bigger the dataset is, the higher speedup is. The data used above have 8,100,000 samples, 784 features. It can be downloaded here (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2) Comparison of the K-Means and MiniBatchKMeans on sklearn : http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Accelerate KMeans via Mini-Batch EM > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MBKM.xlsx > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it > a new estimator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14174) Implement the Mini-Batch KMeans
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-14174: - Summary: Implement the Mini-Batch KMeans (was: Accelerate KMeans via Mini-Batch EM) > Implement the Mini-Batch KMeans > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MBKM.xlsx > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it > a new estimator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21163) DataFrame.toPandas should respect the data type
[ https://issues.apache.org/jira/browse/SPARK-21163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21163. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18378 [https://github.com/apache/spark/pull/18378] > DataFrame.toPandas should respect the data type > --- > > Key: SPARK-21163 > URL: https://issues.apache.org/jira/browse/SPARK-21163 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Implement the Mini-Batch KMeans
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058975#comment-16058975 ] Apache Spark commented on SPARK-14174: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/18389 > Implement the Mini-Batch KMeans > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MBKM.xlsx > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it > a new estimator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14174) Implement the Mini-Batch KMeans
[ https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058982#comment-16058982 ] zhengruifeng commented on SPARK-14174: -- [~mlnick] I send a new PR for MiniBatch KMeans, and the corresponding test results are attached. > Implement the Mini-Batch KMeans > --- > > Key: SPARK-14174 > URL: https://issues.apache.org/jira/browse/SPARK-14174 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: MBKM.xlsx > > > The MiniBatchKMeans is a variant of the KMeans algorithm which uses > mini-batches to reduce the computation time, while still attempting to > optimise the same objective function. Mini-batches are subsets of the input > data, randomly sampled in each training iteration. These mini-batches > drastically reduce the amount of computation required to converge to a local > solution. In contrast to other algorithms that reduce the convergence time of > k-means, mini-batch k-means produces results that are generally only slightly > worse than the standard algorithm. > Comparison of the K-Means and MiniBatchKMeans on sklearn : > http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py > Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it > a new estimator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
Ingo Schuster created SPARK-21176: - Summary: Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs Key: SPARK-21176 URL: https://issues.apache.org/jira/browse/SPARK-21176 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Environment: ppc64le GNU/Linux, POWER8, only master node is reachable externally other nodes are in an internal network Reporter: Ingo Schuster When *reverse proxy is enabled* {quote} spark.ui.reverseProxy=true spark.ui.reverseProxyUrl=/ {quote} first of all any invocation of the spark master Web UI hangs forever locally (e.g. http://192.168.10.16:25001) and via external URL without any data received. One, sometimes two spark applications succeed without error and than workers start throwing exceptions: {quote} Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050 {quote} The application dies during creation of SparkContext: {quote} 2017-05-22 16:11:23 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:11:23 INFO TransportClientFactory:254 - Successfully created connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in bootstraps) 2017-05-22 16:11:43 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:12:03 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has been killed. Reason: All masters are unresponsive! Giving up. 2017-05-22 16:12:23 WARN StandaloneSchedulerBackend:66 - Application ID is not initialized yet. 2017-05-22 16:12:23 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056. . Caused by: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem {quote} *This definitively does not happen without reverse proxy enabled!* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ingo Schuster updated SPARK-21176: -- Affects Version/s: 2.2.1 2.2.0 2.1.1 > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > > When *reverse proxy is enabled* > {quote} > spark.ui.reverseProxy=true > spark.ui.reverseProxyUrl=/ > {quote} > first of all any invocation of the spark master Web UI hangs forever locally > (e.g. http://192.168.10.16:25001) and via external URL without any data > received. > One, sometimes two spark applications succeed without error and than workers > start throwing exceptions: > {quote} > Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050 > {quote} > The application dies during creation of SparkContext: > {quote} > 2017-05-22 16:11:23 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting > to master spark://node0101:25000... > 2017-05-22 16:11:23 INFO TransportClientFactory:254 - Successfully created > connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in > bootstraps) > 2017-05-22 16:11:43 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting > to master spark://node0101:25000... > 2017-05-22 16:12:03 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting > to master spark://node0101:25000... > 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has > been killed. Reason: All masters are unresponsive! Giving up. > 2017-05-22 16:12:23 WARN StandaloneSchedulerBackend:66 - Application ID is > not initialized yet. > 2017-05-22 16:12:23 INFO Utils:54 - Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056. > . > Caused by: java.lang.IllegalArgumentException: requirement failed: Can only > call getServletHandlers on a running MetricsSystem > {quote} > *This definitively does not happen without reverse proxy enabled!* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058985#comment-16058985 ] Sean Owen commented on SPARK-21080: --- This really isn't my area; maybe [~vanzin]? > Workaround for HDFS delegation token expiry broken with some Hadoop versions > > > Key: SPARK-21080 > URL: https://issues.apache.org/jira/browse/SPARK-21080 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3 >Reporter: Lukasz Raszka >Priority: Minor > > We're getting struck by SPARK-11182, where the core issue in HDFS has been > fixed in more recent versions. It seems that [workaround introduced by user > SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d] > doesn't work in newer version of Hadoop. This seems to be cause by a move of > property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} > which happened somewhere around 2.7.0 or earlier. Taking this into account > should make workaround work again for less recent Hadoop versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ingo Schuster updated SPARK-21176: -- Description: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} (see https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. was: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{ selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} (see https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each connector, Jetty creates Selector threads: minimum 4, maximum half > the number of available CPUs: > {{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} > (see > https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) > In reverse proxy mode, a connector is set up for each executor and one for > the master UI. > I have a system with 88 CPUs on the master node and 7 executors. Jetty tries > to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is > initialized with 200 threads by default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool*(400)* }}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application or the same executor. > I therefore propose to dramatically reduce the number of selector threads > that get instantiated - at least by default. > I will propose a fix in a pull request. -- This message was sent by Atlassian JIRA (v6.4.14#64029) -
[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ingo Schuster updated SPARK-21176: -- Description: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{ selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} (see https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. was: When *reverse proxy is enabled* {quote} spark.ui.reverseProxy=true spark.ui.reverseProxyUrl=/ {quote} first of all any invocation of the spark master Web UI hangs forever locally (e.g. http://192.168.10.16:25001) and via external URL without any data received. One, sometimes two spark applications succeed without error and than workers start throwing exceptions: {quote} Caused by: java.io.IOException: Failed to connect to /192.168.10.16:25050 {quote} The application dies during creation of SparkContext: {quote} 2017-05-22 16:11:23 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:11:23 INFO TransportClientFactory:254 - Successfully created connection to node0101/192.168.10.16:25000 after 169 ms (132 ms spent in bootstraps) 2017-05-22 16:11:43 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:12:03 INFO StandaloneAppClient$ClientEndpoint:54 - Connecting to master spark://node0101:25000... 2017-05-22 16:12:23 ERROR StandaloneSchedulerBackend:70 - Application has been killed. Reason: All masters are unresponsive! Giving up. 2017-05-22 16:12:23 WARN StandaloneSchedulerBackend:66 - Application ID is not initialized yet. 2017-05-22 16:12:23 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 25056. . Caused by: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem {quote} *This definitively does not happen without reverse proxy enabled!* > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each connector, Jetty creates Selector threads: minimum 4, maximum half > the number of available CPUs: > {{ > selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} > (see > https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) > In reverse proxy mode, a connector is set up for each executor and one for > the master UI. > I have a system with 88 CPUs on the master node and 7 executors. Jetty tries > to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is > initialized with 200 threads by default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool*(400)* }}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application
[jira] [Resolved] (SPARK-21173) There are several configuration about SSL displayed in configuration.md but never be used.
[ https://issues.apache.org/jira/browse/SPARK-21173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21173. --- Resolution: Not A Problem > There are several configuration about SSL displayed in configuration.md but > never be used. > -- > > Key: SPARK-21173 > URL: https://issues.apache.org/jira/browse/SPARK-21173 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Priority: Trivial > > There are several configuration about SSL displayed in configuration.md but > never be used and appear at spark's code.So I think it should be removed from > configuration.md. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21161) SparkContext stopped when execute a query on Solr
[ https://issues.apache.org/jira/browse/SPARK-21161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21161. --- Resolution: Not A Problem > SparkContext stopped when execute a query on Solr > - > > Key: SPARK-21161 > URL: https://issues.apache.org/jira/browse/SPARK-21161 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Hadoop2.7.3, Spark 2.1.1, spark-solr-3.0.1.jar, > solr-solrj-6.5.1.jar >Reporter: Jian Wu > > The SparkContext stopped due to DAGSchedulerEventProcessLoop failed when I > query Solr data in Spark. > {code:none} > 17/06/21 12:40:53 INFO ContextLauncher: 17/06/21 12:40:53 ERROR > scheduler.DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; > shutting down SparkContext > 17/06/21 12:40:53 INFO ContextLauncher: java.lang.NumberFormatException: For > input string: “8983_solr” > 17/06/21 12:40:53 INFO ContextLauncher: at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > 17/06/21 12:40:53 INFO ContextLauncher: at > java.lang.Integer.parseInt(Integer.java:580) > 17/06/21 12:40:53 INFO ContextLauncher: at > java.lang.Integer.parseInt(Integer.java:615) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.util.Utils$.parseHostPort(Utils.scala:959) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:36) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:200) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSetManager$$anonfun$org$apache$spark$scheduler$TaskSetManager$$addPendingTask$1.apply(TaskSetManager.scala:181) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSetManager.org$apache$spark$scheduler$TaskSetManager$$addPendingTask(TaskSetManager.scala:181) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:160) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSetManager.(TaskSetManager.scala:159) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:212) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:176) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1043) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:918) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:921) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:920) > 17/06/21 12:40:53 INFO ContextLauncher: at > scala.collection.immutable.List.foreach(List.scala:381) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:920) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:862) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1613) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > 17/06/21 12:40:53 INFO ContextLauncher: at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > 17/06/21 12:40:53 INFO ContextLau
[jira] [Commented] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand
[ https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059018#comment-16059018 ] Sean Owen commented on SPARK-21171: --- There's no real detail here. I'd have to close this. This should start as a discussion on a mailing list, or at least include specific benchmarks and specific proposed changes. > Speculate task scheduling block dirve handle normal task when a job task > number more than one hundred thousand > -- > > Key: SPARK-21171 > URL: https://issues.apache.org/jira/browse/SPARK-21171 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 > Environment: We have more than two hundred high-performance machine > to handle more than 2T data by one query >Reporter: wangminfeng > > If a job have more then one hundred thousand tasks and spark.speculation is > true, when speculable tasks start, choosing a speculable will waste lots of > time and block other tasks. We do a ad-hoc query for data analyse, we can't > tolerate one job wasting time even it is a large job -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink
[ https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059019#comment-16059019 ] Apache Spark commented on SPARK-21167: -- User 'dijingran' has created a pull request for this issue: https://github.com/apache/spark/pull/18383 > Path is not decoded correctly when reading output of FileSink > - > > Key: SPARK-21167 > URL: https://issues.apache.org/jira/browse/SPARK-21167 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > When reading output of FileSink, path is not decoded correctly. So if the > path has some special characters, such as spaces, Spark cannot read it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059022#comment-16059022 ] Lukasz Raszka commented on SPARK-21080: --- Update: I think it might be a mistake on our side, and it was not Spark internal HDFS access attempt that caused this, but ours. Sorry for the all confusion. Still, I agree it would be great to have the mentioned PR rebased. Meanwhile I guess you can close this one. > Workaround for HDFS delegation token expiry broken with some Hadoop versions > > > Key: SPARK-21080 > URL: https://issues.apache.org/jira/browse/SPARK-21080 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3 >Reporter: Lukasz Raszka >Priority: Minor > > We're getting struck by SPARK-11182, where the core issue in HDFS has been > fixed in more recent versions. It seems that [workaround introduced by user > SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d] > doesn't work in newer version of Hadoop. This seems to be cause by a move of > property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} > which happened somewhere around 2.7.0 or earlier. Taking this into account > should make workaround work again for less recent Hadoop versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21080) Workaround for HDFS delegation token expiry broken with some Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-21080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21080. --- Resolution: Not A Problem > Workaround for HDFS delegation token expiry broken with some Hadoop versions > > > Key: SPARK-21080 > URL: https://issues.apache.org/jira/browse/SPARK-21080 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 > Environment: Spark 2.1.0 on Yarn, Hadoop 2.7.3 >Reporter: Lukasz Raszka >Priority: Minor > > We're getting struck by SPARK-11182, where the core issue in HDFS has been > fixed in more recent versions. It seems that [workaround introduced by user > SaintBacchus|https://github.com/apache/spark/commit/646366b5d2f12e42f8e7287672ba29a8c918a17d] > doesn't work in newer version of Hadoop. This seems to be cause by a move of > property name from {{fs.hdfs.impl}} to {{fs.AbstractFileSystem.hdfs.impl}} > which happened somewhere around 2.7.0 or earlier. Taking this into account > should make workaround work again for less recent Hadoop versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11373) Add metrics to the History Server and providers
[ https://issues.apache.org/jira/browse/SPARK-11373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059141#comment-16059141 ] Steve Loughran commented on SPARK-11373: metrics might help with understanding the s3 load issues in SPARK-19111 > Add metrics to the History Server and providers > --- > > Key: SPARK-11373 > URL: https://issues.apache.org/jira/browse/SPARK-11373 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Steve Loughran > > The History server doesn't publish metrics about JVM load or anything from > the history provider plugins. This means that performance problems from > massive job histories aren't visible to management tools, and nor are any > provider-generated metrics such as time to load histories, failed history > loads, the number of connectivity failures talking to remote services, etc. > If the history server set up a metrics registry and offered the option to > publish its metrics, then management tools could view this data. > # the metrics registry would need to be passed down to the instantiated > {{ApplicationHistoryProvider}}, in order for it to register its metrics. > # if the codahale metrics servlet were registered under a path such as > {{/metrics}}, the values would be visible as HTML and JSON, without the need > for management tools. > # Integration tests could also retrieve the JSON-formatted data and use it as > part of the test suites. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21177) Append to hive slows down linearly, with number of appends.
Prashant Sharma created SPARK-21177: --- Summary: Append to hive slows down linearly, with number of appends. Key: SPARK-21177 URL: https://issues.apache.org/jira/browse/SPARK-21177 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Prashant Sharma In short, please use the following shell transcript for the reproducer. {code:java} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala> def printTimeTaken(str: String, f: () => Unit) { val start = System.nanoTime() f() val end = System.nanoTime() val timetaken = end - start import scala.concurrent.duration._ println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n") } | | | | | | | printTimeTaken: (str: String, f: () => Unit)Unit scala> for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })} Time taken for time to append to hive: is 284 Time taken for time to append to hive: is 211 ... ... Time taken for time to append to hive: is 2615 Time taken for time to append to hive: is 3055 Time taken for time to append to hive: is 22425 {code} Why does it matter ? In a streaming job it is not possible to append to hive using this dataframe operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21177) df.SaveAsTable slows down linearly, with number of appends.
[ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-21177: Summary: df.SaveAsTable slows down linearly, with number of appends. (was: Append to hive slows down linearly, with number of appends.) > df.SaveAsTable slows down linearly, with number of appends. > --- > > Key: SPARK-21177 > URL: https://issues.apache.org/jira/browse/SPARK-21177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Prashant Sharma > > In short, please use the following shell transcript for the reproducer. > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) > Type in expressions to have them evaluated. > Type :help for more information. > scala> def printTimeTaken(str: String, f: () => Unit) { > val start = System.nanoTime() > f() > val end = System.nanoTime() > val timetaken = end - start > import scala.concurrent.duration._ > println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n") > } > | | | | | | | printTimeTaken: (str: > String, f: () => Unit)Unit > scala> > for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { > Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })} > Time taken for time to append to hive: is 284 > Time taken for time to append to hive: is 211 > ... > ... > Time taken for time to append to hive: is 2615 > Time taken for time to append to hive: is 3055 > Time taken for time to append to hive: is 22425 > > {code} > Why does it matter ? > In a streaming job it is not possible to append to hive using this dataframe > operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21177) df.saveAsTable slows down linearly, with number of appends
[ https://issues.apache.org/jira/browse/SPARK-21177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma updated SPARK-21177: Summary: df.saveAsTable slows down linearly, with number of appends (was: df.SaveAsTable slows down linearly, with number of appends.) > df.saveAsTable slows down linearly, with number of appends > -- > > Key: SPARK-21177 > URL: https://issues.apache.org/jira/browse/SPARK-21177 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Prashant Sharma > > In short, please use the following shell transcript for the reproducer. > {code:java} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.3.0-SNAPSHOT > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) > Type in expressions to have them evaluated. > Type :help for more information. > scala> def printTimeTaken(str: String, f: () => Unit) { > val start = System.nanoTime() > f() > val end = System.nanoTime() > val timetaken = end - start > import scala.concurrent.duration._ > println(s"Time taken for $str is ${timetaken.nanos.toMillis}\n") > } > | | | | | | | printTimeTaken: (str: > String, f: () => Unit)Unit > scala> > for(i <- 1 to 1) {printTimeTaken("time to append to hive:", () => { > Seq(1, 2).toDF().write.mode("append").saveAsTable("t1"); })} > Time taken for time to append to hive: is 284 > Time taken for time to append to hive: is 211 > ... > ... > Time taken for time to append to hive: is 2615 > Time taken for time to append to hive: is 3055 > Time taken for time to append to hive: is 22425 > > {code} > Why does it matter ? > In a streaming job it is not possible to append to hive using this dataframe > operation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator
Aman Rawat created SPARK-21178: -- Summary: Add support for label specific metrics in MulticlassClassificationEvaluator Key: SPARK-21178 URL: https://issues.apache.org/jira/browse/SPARK-21178 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.1 Reporter: Aman Rawat MulticlassClassificationEvaluator is restricted to the global metrics - f1, weightedPrecision, weightedRecall, accuracy However, we have a requirement where we would want to optimize the learning on metric for a specific label - for instance, true positive rate (label 'B') For example : Take a fraud detection use-case with labels 'good' and 'fraud' being passed to a manual verification team. We want to maximize the true-positive rate of ('fraud') label, so that whenever the model predicts a data point as 'good', it has a strong likelihood of it being 'good', and the manual team can ignore it. While it's ok to predict some 'good' data points as 'fraud', as it will be taken care by the manual verification team. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aman Rawat updated SPARK-21178: --- Target Version/s: (was: 2.1.2) > Add support for label specific metrics in MulticlassClassificationEvaluator > --- > > Key: SPARK-21178 > URL: https://issues.apache.org/jira/browse/SPARK-21178 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Aman Rawat > > MulticlassClassificationEvaluator is restricted to the global metrics - f1, > weightedPrecision, weightedRecall, accuracy > However, we have a requirement where we would want to optimize the learning > on metric for a specific label - for instance, true positive rate (label 'B') > For example : Take a fraud detection use-case with labels 'good' and 'fraud' > being passed to a manual verification team. We want to maximize the > true-positive rate of ('fraud') label, so that whenever the model predicts a > data point as 'good', it has a strong likelihood of it being 'good', and the > manual team can ignore it. > While it's ok to predict some 'good' data points as 'fraud', as it will be > taken care by the manual verification team. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21178: Assignee: Apache Spark > Add support for label specific metrics in MulticlassClassificationEvaluator > --- > > Key: SPARK-21178 > URL: https://issues.apache.org/jira/browse/SPARK-21178 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Aman Rawat >Assignee: Apache Spark > > MulticlassClassificationEvaluator is restricted to the global metrics - f1, > weightedPrecision, weightedRecall, accuracy > However, we have a requirement where we would want to optimize the learning > on metric for a specific label - for instance, true positive rate (label 'B') > For example : Take a fraud detection use-case with labels 'good' and 'fraud' > being passed to a manual verification team. We want to maximize the > true-positive rate of ('fraud') label, so that whenever the model predicts a > data point as 'good', it has a strong likelihood of it being 'good', and the > manual team can ignore it. > While it's ok to predict some 'good' data points as 'fraud', as it will be > taken care by the manual verification team. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21178: Assignee: (was: Apache Spark) > Add support for label specific metrics in MulticlassClassificationEvaluator > --- > > Key: SPARK-21178 > URL: https://issues.apache.org/jira/browse/SPARK-21178 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Aman Rawat > > MulticlassClassificationEvaluator is restricted to the global metrics - f1, > weightedPrecision, weightedRecall, accuracy > However, we have a requirement where we would want to optimize the learning > on metric for a specific label - for instance, true positive rate (label 'B') > For example : Take a fraud detection use-case with labels 'good' and 'fraud' > being passed to a manual verification team. We want to maximize the > true-positive rate of ('fraud') label, so that whenever the model predicts a > data point as 'good', it has a strong likelihood of it being 'good', and the > manual team can ignore it. > While it's ok to predict some 'good' data points as 'fraud', as it will be > taken care by the manual verification team. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21178) Add support for label specific metrics in MulticlassClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-21178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059251#comment-16059251 ] Apache Spark commented on SPARK-21178: -- User 'rawataaryan9' has created a pull request for this issue: https://github.com/apache/spark/pull/18390 > Add support for label specific metrics in MulticlassClassificationEvaluator > --- > > Key: SPARK-21178 > URL: https://issues.apache.org/jira/browse/SPARK-21178 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Aman Rawat > > MulticlassClassificationEvaluator is restricted to the global metrics - f1, > weightedPrecision, weightedRecall, accuracy > However, we have a requirement where we would want to optimize the learning > on metric for a specific label - for instance, true positive rate (label 'B') > For example : Take a fraud detection use-case with labels 'good' and 'fraud' > being passed to a manual verification team. We want to maximize the > true-positive rate of ('fraud') label, so that whenever the model predicts a > data point as 'good', it has a strong likelihood of it being 'good', and the > manual team can ignore it. > While it's ok to predict some 'good' data points as 'fraud', as it will be > taken care by the manual verification team. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20295) when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue
[ https://issues.apache.org/jira/browse/SPARK-20295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059273#comment-16059273 ] cen yuhai commented on SPARK-20295: --- I hit this bug... > when spark.sql.adaptive.enabled is enabled, have conflict with Exchange Resue > -- > > Key: SPARK-20295 > URL: https://issues.apache.org/jira/browse/SPARK-20295 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 2.1.0 >Reporter: Ruhui Wang > > when run tpcds-q95, and set spark.sql.adaptive.enabled = true the physical > plan firstly: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- Exchange(coordinator id: 3) > spark.sql.exchange.reuse is opened, then physical plan will become below: > Sort > : +- Exchange(coordinator id: 1) > : +- Project*** > ::-Sort ** > :: +- Exchange(coordinator id: 2) > :: :- Project *** > :+- Sort > :: +- ReusedExchange Exchange(coordinator id: 2) > If spark.sql.adaptive.enabled = true, the code stack is : > ShuffleExchange#doExecute --> postShuffleRDD function --> > doEstimationIfNecessary . In this function, > assert(exchanges.length == numExchanges) will be error, as left side has only > one element, but right is equal to 2. > If this is a bug of spark.sql.adaptive.enabled and exchange resue? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
[ https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20832. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18362 [https://github.com/apache/spark/pull/18362] > Standalone master should explicitly inform drivers of worker deaths and > invalidate external shuffle service outputs > --- > > Key: SPARK-20832 > URL: https://issues.apache.org/jira/browse/SPARK-20832 > Project: Spark > Issue Type: Bug > Components: Deploy, Scheduler >Affects Versions: 2.0.0 >Reporter: Josh Rosen > Fix For: 2.3.0 > > > In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added > logic to the DAGScheduler to mark external shuffle service instances as > unavailable upon task failure when the task failure reason was "SlaveLost" > and this was known to be caused by worker death. If the Spark Master > discovered that a worker was dead then it would notify any drivers with > executors on those workers to mark those executors as dead. The linked patch > simply piggybacked on this logic to have the executor death notification also > imply worker death and to have worker-death-caused-executor-death imply > shuffle file loss. > However, there are modes of external shuffle service loss which this > mechanism does not detect, leaving the system prone race conditions. Consider > the following: > * Spark standalone is configured to run an external shuffle service embedded > in the Worker. > * Application has shuffle outputs and executors on Worker A. > * Stage depending on outputs of tasks that ran on Worker A starts. > * All executors on worker A are removed due to dying with exceptions, > scaling-down via the dynamic allocation APIs, but _not_ due to worker death. > Worker A is still healthy at this point. > * At this point the MapOutputTracker still records map output locations on > Worker A's shuffle service. This is expected behavior. > * Worker A dies at an instant where the application has no executors running > on it. > * The Master knows that Worker A died but does not inform the driver (which > had no executors on that worker at the time of its death). > * Some task from the running stage attempts to fetch map outputs from Worker > A but these requests time out because Worker A's shuffle service isn't > available. > * Due to other logic in the scheduler, these preventable FetchFailures don't > wind up invaliding the now-invalid unavailable map output locations (this is > a distinct bug / behavior which I'll discuss in a separate JIRA ticket). > * This behavior leads to several unsuccessful stage reattempts and ultimately > to a job failure. > A simple way to address this would be to have the Master explicitly notify > drivers of all Worker deaths, even if those drivers don't currently have > executors. The Spark Standalone scheduler backend can receive the explicit > WorkerLost message and can bubble up the right calls to the task scheduler > and DAGScheduler to invalidate map output locations from the now-dead > external shuffle service. > This relates to SPARK-20115 in the sense that both tickets aim to address > issues where the external shuffle service is unavailable. The key difference > is the mechanism for detection: SPARK-20115 marks the external shuffle > service as unavailable whenever any fetch failure occurs from it, whereas the > proposal here relies on more explicit signals. This JIRA ticket's proposal is > scoped only to Spark Standalone mode. As a compromise, we might be able to > consider "all of a single shuffle's outputs lost on a single external shuffle > service" following a fetch failure (to be discussed in separate JIRA). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20832) Standalone master should explicitly inform drivers of worker deaths and invalidate external shuffle service outputs
[ https://issues.apache.org/jira/browse/SPARK-20832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20832: --- Assignee: Jiang Xingbo > Standalone master should explicitly inform drivers of worker deaths and > invalidate external shuffle service outputs > --- > > Key: SPARK-20832 > URL: https://issues.apache.org/jira/browse/SPARK-20832 > Project: Spark > Issue Type: Bug > Components: Deploy, Scheduler >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.3.0 > > > In SPARK-17370 (a patch authored by [~ekhliang] and reviewed by me), we added > logic to the DAGScheduler to mark external shuffle service instances as > unavailable upon task failure when the task failure reason was "SlaveLost" > and this was known to be caused by worker death. If the Spark Master > discovered that a worker was dead then it would notify any drivers with > executors on those workers to mark those executors as dead. The linked patch > simply piggybacked on this logic to have the executor death notification also > imply worker death and to have worker-death-caused-executor-death imply > shuffle file loss. > However, there are modes of external shuffle service loss which this > mechanism does not detect, leaving the system prone race conditions. Consider > the following: > * Spark standalone is configured to run an external shuffle service embedded > in the Worker. > * Application has shuffle outputs and executors on Worker A. > * Stage depending on outputs of tasks that ran on Worker A starts. > * All executors on worker A are removed due to dying with exceptions, > scaling-down via the dynamic allocation APIs, but _not_ due to worker death. > Worker A is still healthy at this point. > * At this point the MapOutputTracker still records map output locations on > Worker A's shuffle service. This is expected behavior. > * Worker A dies at an instant where the application has no executors running > on it. > * The Master knows that Worker A died but does not inform the driver (which > had no executors on that worker at the time of its death). > * Some task from the running stage attempts to fetch map outputs from Worker > A but these requests time out because Worker A's shuffle service isn't > available. > * Due to other logic in the scheduler, these preventable FetchFailures don't > wind up invaliding the now-invalid unavailable map output locations (this is > a distinct bug / behavior which I'll discuss in a separate JIRA ticket). > * This behavior leads to several unsuccessful stage reattempts and ultimately > to a job failure. > A simple way to address this would be to have the Master explicitly notify > drivers of all Worker deaths, even if those drivers don't currently have > executors. The Spark Standalone scheduler backend can receive the explicit > WorkerLost message and can bubble up the right calls to the task scheduler > and DAGScheduler to invalidate map output locations from the now-dead > external shuffle service. > This relates to SPARK-20115 in the sense that both tickets aim to address > issues where the external shuffle service is unavailable. The key difference > is the mechanism for detection: SPARK-20115 marks the external shuffle > service as unavailable whenever any fetch failure occurs from it, whereas the > proposal here relies on more explicit signals. This JIRA ticket's proposal is > scoped only to Spark Standalone mode. As a compromise, we might be able to > consider "all of a single shuffle's outputs lost on a single external shuffle > service" following a fetch failure (to be discussed in separate JIRA). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21179) Unable to return Hive INT data type into Spark SQL via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
Matthew Walton created SPARK-21179: -- Summary: Unable to return Hive INT data type into Spark SQL via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. Key: SPARK-21179 URL: https://issues.apache.org/jira/browse/SPARK-21179 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.0.0, 1.6.0 Environment: OS: Linux HDP version 2.5.0.1-60 Hive version: 1.2.1 Spark version 2.0.0.2.5.0.1-60 JDBC: Download the latest Hortonworks JDBC driver Reporter: Matthew Walton I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. Unfortunately, when I try to query data that resides in an INT column I get the following error: 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. Steps to reproduce: 1) On Hive create a simple table with an INT column and insert some data (I used SQuirreL Client with the Hortonworks JDBC driver): create table wh2.hivespark (country_id int, country_name string) insert into wh2.hivespark values (1, 'USA') 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run Spark Shell 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files ./spark-shell --jars /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar 4) In Spark shell load the data from Hive using the JDBC driver val hivespark = spark.read.format("jdbc").options(Map("url" -> "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable" -> "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load() 5) In Spark shell try to display the data hivespark.show() At this point you should see the error: scala> hivespark.show() 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. at com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source) at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Note: I also tested this issue using a JDBC driver from Progress DataDirect and I see a similar error message so this does not seem to be driver specific. scala> hivespark.show() 17/06/22 12:0
[jira] [Commented] (SPARK-21166) Automated ML persistence
[ https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059481#comment-16059481 ] Mark Hamilton commented on SPARK-21166: --- The code is currently being developed here: https://github.com/Azure/mmlspark/pull/25 > Automated ML persistence > > > Key: SPARK-21166 > URL: https://issues.apache.org/jira/browse/SPARK-21166 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > This JIRA is for discussing the possibility of automating ML persistence. > Currently, custom save/load methods are written for every Model. However, we > could design a mixin which provides automated persistence, inspecting model > data and Params and reading/writing (known types) automatically. This was > brought up in discussions with developers behind > https://github.com/azure/mmlspark > Some issues we will need to consider: > * Providing generic mixin usable in most or all cases > * Handling corner cases (strange Param types, etc.) > * Backwards compatibility (loading models saved by old Spark versions) > Because of backwards compatibility in particular, it may make sense to > implement testing for that first, before we try to address automated > persistence: [SPARK-15573] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan
Zhenhua Wang created SPARK-21180: Summary: Remove conf from stats functions since now we have conf in LogicalPlan Key: SPARK-21180 URL: https://issues.apache.org/jira/browse/SPARK-21180 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ingo Schuster updated SPARK-21176: -- Description: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} (see https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. was: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} (see https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each connector, Jetty creates Selector threads: minimum 4, maximum half > the number of available CPUs: > {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} > (see > https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) > In reverse proxy mode, a connector is set up for each executor and one for > the master UI. > I have a system with 88 CPUs on the master node and 7 executors. Jetty tries > to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is > initialized with 200 threads by default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool*(400)* }}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application or the same executor. > I therefore propose to dramatically reduce the number of selector threads > that get instantiated - at least by default. > I will propose a fix in a pull request. -- This message was sent by Atl
[jira] [Updated] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs
[ https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ingo Schuster updated SPARK-21176: -- Description: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} (see https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. was: In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master node has too many cpus or the cluster has too many executers: For each connector, Jetty creates Selector threads: minimum 4, maximum half the number of available CPUs: {{selectors>0?selectors:Math.max(1,Math.min(4,Runtime.getRuntime().availableProcessors()/2)));}} (see https://github.com/eclipse/jetty.project/blob/jetty-9.3.x/jetty-server/src/main/java/org/eclipse/jetty/server/ServerConnector.java) In reverse proxy mode, a connector is set up for each executor and one for the master UI. I have a system with 88 CPUs on the master node and 7 executors. Jetty tries to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is initialized with 200 threads by default, the UI gets stuck. I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new QueuedThreadPool*(400)* }}). With this hack, the UI works. Obviously, the Jetty defaults are meant for a real web server. If that has 88 CPUs, you do certainly expect a lot of traffic. For the Spark admin UI however, there will rarely be concurrent accesses for the same application or the same executor. I therefore propose to dramatically reduce the number of selector threads that get instantiated - at least by default. I will propose a fix in a pull request. > Master UI hangs with spark.ui.reverseProxy=true if the master node has many > CPUs > > > Key: SPARK-21176 > URL: https://issues.apache.org/jira/browse/SPARK-21176 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1 > Environment: ppc64le GNU/Linux, POWER8, only master node is reachable > externally other nodes are in an internal network >Reporter: Ingo Schuster > Labels: network, web-ui > > In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master > node has too many cpus or the cluster has too many executers: > For each connector, Jetty creates Selector threads: minimum 4, maximum half > the number of available CPUs: > {{ this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}} > (see > https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java) > In reverse proxy mode, a connector is set up for each executor and one for > the master UI. > I have a system with 88 CPUs on the master node and 7 executors. Jetty tries > to instantiate 8*44 = 352 selector threads, but since the QueuedThreadPool is > initialized with 200 threads by default, the UI gets stuck. > I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new > QueuedThreadPool*(400)* }}). With this hack, the UI works. > Obviously, the Jetty defaults are meant for a real web server. If that has 88 > CPUs, you do certainly expect a lot of traffic. > For the Spark admin UI however, there will rarely be concurrent accesses for > the same application or the same executor. > I therefore propose to dramatically reduce the number of selector threads > that get instantiated - at least by default. > I will propose a fix in a pull request. -- This message was sent by Atlassian JIRA (v6.4.1
[jira] [Assigned] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21180: Assignee: (was: Apache Spark) > Remove conf from stats functions since now we have conf in LogicalPlan > -- > > Key: SPARK-21180 > URL: https://issues.apache.org/jira/browse/SPARK-21180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21180: Assignee: Apache Spark > Remove conf from stats functions since now we have conf in LogicalPlan > -- > > Key: SPARK-21180 > URL: https://issues.apache.org/jira/browse/SPARK-21180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21180) Remove conf from stats functions since now we have conf in LogicalPlan
[ https://issues.apache.org/jira/browse/SPARK-21180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059525#comment-16059525 ] Apache Spark commented on SPARK-21180: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18391 > Remove conf from stats functions since now we have conf in LogicalPlan > -- > > Key: SPARK-21180 > URL: https://issues.apache.org/jira/browse/SPARK-21180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21110) Structs should be usable in inequality filters
[ https://issues.apache.org/jira/browse/SPARK-21110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059536#comment-16059536 ] Nicholas Chammas commented on SPARK-21110: -- cc [~marmbrus] - Assuming this is a valid feature request, maybe it belongs as part of a larger umbrella task. > Structs should be usable in inequality filters > -- > > Key: SPARK-21110 > URL: https://issues.apache.org/jira/browse/SPARK-21110 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nicholas Chammas >Priority: Minor > > It seems like a missing feature that you can't compare structs in a filter on > a DataFrame. > Here's a simple demonstration of a) where this would be useful and b) how > it's different from simply comparing each of the components of the structs. > {code} > import pyspark > from pyspark.sql.functions import col, struct, concat > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame( > [ > ('Boston', 'Bob'), > ('Boston', 'Nick'), > ('San Francisco', 'Bob'), > ('San Francisco', 'Nick'), > ], > ['city', 'person'] > ) > pairs = ( > df.select( > struct('city', 'person').alias('p1') > ) > .crossJoin( > df.select( > struct('city', 'person').alias('p2') > ) > ) > ) > print("Everything") > pairs.show() > print("Comparing parts separately (doesn't give me what I want)") > (pairs > .where(col('p1.city') < col('p2.city')) > .where(col('p1.person') < col('p2.person')) > .show()) > print("Comparing parts together with concat (gives me what I want but is > hacky)") > (pairs > .where(concat('p1.city', 'p1.person') < concat('p2.city', 'p2.person')) > .show()) > print("Comparing parts together with struct (my desired solution but > currently yields an error)") > (pairs > .where(col('p1') < col('p2')) > .show()) > {code} > The last query yields the following error in Spark 2.1.1: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '(`p1` < `p2`)' due to > data type mismatch: '(`p1` < `p2`)' requires (boolean or tinyint or smallint > or int or bigint or float or double or decimal or timestamp or date or string > or binary) type, not struct;; > 'Filter (p1#5 < p2#8) > +- Join Cross >:- Project [named_struct(city, city#0, person, person#1) AS p1#5] >: +- LogicalRDD [city#0, person#1] >+- Project [named_struct(city, city#0, person, person#1) AS p2#8] > +- LogicalRDD [city#0, person#1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21181) Suppress memory leak errors reported by netty
Dhruve Ashar created SPARK-21181: Summary: Suppress memory leak errors reported by netty Key: SPARK-21181 URL: https://issues.apache.org/jira/browse/SPARK-21181 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.1.0 Reporter: Dhruve Ashar Priority: Minor We are seeing netty report memory leak erros like the one below after switching to 2.1. {code} ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's garbage-collected. Enable advanced leak reporting to find out where the leak occurred. To enable advanced leak reporting, specify the JVM option '-Dio.netty.leakDetection.level=advanced' or call ResourceLeakDetector.setLevel() See http://netty.io/wiki/reference-counted-objects.html for more information. {code} Looking a bit deeper, Spark is not leaking any memory here, but it is confusing for the user to see the error message in the driver logs. After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the SparkSaslServer to be the source of these leaks. Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21181) Suppress memory leak errors reported by netty
[ https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21181: Assignee: Apache Spark > Suppress memory leak errors reported by netty > - > > Key: SPARK-21181 > URL: https://issues.apache.org/jira/browse/SPARK-21181 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Dhruve Ashar >Assignee: Apache Spark >Priority: Minor > > We are seeing netty report memory leak erros like the one below after > switching to 2.1. > {code} > ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before > it's garbage-collected. Enable advanced leak reporting to find out where the > leak occurred. To enable advanced leak reporting, specify the JVM option > '-Dio.netty.leakDetection.level=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > {code} > Looking a bit deeper, Spark is not leaking any memory here, but it is > confusing for the user to see the error message in the driver logs. > After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the > SparkSaslServer to be the source of these leaks. > Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21181) Suppress memory leak errors reported by netty
[ https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059671#comment-16059671 ] Apache Spark commented on SPARK-21181: -- User 'dhruve' has created a pull request for this issue: https://github.com/apache/spark/pull/18392 > Suppress memory leak errors reported by netty > - > > Key: SPARK-21181 > URL: https://issues.apache.org/jira/browse/SPARK-21181 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Dhruve Ashar >Priority: Minor > > We are seeing netty report memory leak erros like the one below after > switching to 2.1. > {code} > ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before > it's garbage-collected. Enable advanced leak reporting to find out where the > leak occurred. To enable advanced leak reporting, specify the JVM option > '-Dio.netty.leakDetection.level=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > {code} > Looking a bit deeper, Spark is not leaking any memory here, but it is > confusing for the user to see the error message in the driver logs. > After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the > SparkSaslServer to be the source of these leaks. > Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21181) Suppress memory leak errors reported by netty
[ https://issues.apache.org/jira/browse/SPARK-21181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21181: Assignee: (was: Apache Spark) > Suppress memory leak errors reported by netty > - > > Key: SPARK-21181 > URL: https://issues.apache.org/jira/browse/SPARK-21181 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Dhruve Ashar >Priority: Minor > > We are seeing netty report memory leak erros like the one below after > switching to 2.1. > {code} > ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before > it's garbage-collected. Enable advanced leak reporting to find out where the > leak occurred. To enable advanced leak reporting, specify the JVM option > '-Dio.netty.leakDetection.level=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > {code} > Looking a bit deeper, Spark is not leaking any memory here, but it is > confusing for the user to see the error message in the driver logs. > After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the > SparkSaslServer to be the source of these leaks. > Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21155) Add (? running tasks) into Spark UI progress
[ https://issues.apache.org/jira/browse/SPARK-21155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21155: Attachment: Screen Shot 2017-06-22 at 9.58.08 AM.png > Add (? running tasks) into Spark UI progress > > > Key: SPARK-21155 > URL: https://issues.apache.org/jira/browse/SPARK-21155 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Eric Vandenberg >Priority: Minor > Attachments: Screen Shot 2017-06-20 at 12.32.58 PM.png, Screen Shot > 2017-06-20 at 3.40.39 PM.png, Screen Shot 2017-06-22 at 9.58.08 AM.png > > > The progress UI for Active Jobs / Tasks should show the number of exact > number of running tasks. See screen shot attachment for what this looks like. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20599) ConsoleSink should work with write (batch)
[ https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20599. -- Resolution: Fixed Fix Version/s: 2.3.0 > ConsoleSink should work with write (batch) > -- > > Key: SPARK-20599 > URL: https://issues.apache.org/jira/browse/SPARK-20599 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > Labels: starter > Fix For: 2.3.0 > > > I think the following should just work. > {code} > spark. > read. // <-- it's a batch query not streaming query if that matters > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > write. > format("console"). // <-- that's not supported currently > save > {code} > The above combination of {{kafka}} source and {{console}} sink leads to the > following exception: > {code} > java.lang.RuntimeException: > org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow > create table as select. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21167) Path is not decoded correctly when reading output of FileSink
[ https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21167: - Comment: was deleted (was: User 'dijingran' has created a pull request for this issue: https://github.com/apache/spark/pull/18383) > Path is not decoded correctly when reading output of FileSink > - > > Key: SPARK-21167 > URL: https://issues.apache.org/jira/browse/SPARK-21167 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > When reading output of FileSink, path is not decoded correctly. So if the > path has some special characters, such as spaces, Spark cannot read it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21167) Path is not decoded correctly when reading output of FileSink
[ https://issues.apache.org/jira/browse/SPARK-21167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16059816#comment-16059816 ] Apache Spark commented on SPARK-21167: -- User 'dijingran' has created a pull request for this issue: https://github.com/apache/spark/pull/18383 > Path is not decoded correctly when reading output of FileSink > - > > Key: SPARK-21167 > URL: https://issues.apache.org/jira/browse/SPARK-21167 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > When reading output of FileSink, path is not decoded correctly. So if the > path has some special characters, such as spaces, Spark cannot read it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21168) KafkaRDD should always set kafka clientId.
[ https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21168: - Component/s: (was: Structured Streaming) DStreams > KafkaRDD should always set kafka clientId. > -- > > Key: SPARK-21168 > URL: https://issues.apache.org/jira/browse/SPARK-21168 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Xingxing Di >Priority: Trivial > > I found KafkaRDD not set kafka client.id in "fetchBatch" method > (FetchRequestBuilder will set clientId to empty by default), normally this > will affect nothing, but in our case ,we use clientId at kafka server side, > so we have to rebuild spark-streaming-kafka。 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
[ https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Walton updated SPARK-21179: --- Affects Version/s: 2.1.1 Summary: Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.(was: Unable to return Hive INT data type into Spark SQL via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. ) > Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused > by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > - > > Key: SPARK-21179 > URL: https://issues.apache.org/jira/browse/SPARK-21179 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > HDP version 2.5.0.1-60 > Hive version: 1.2.1 > Spark version 2.0.0.2.5.0.1-60 > JDBC: Download the latest Hortonworks JDBC driver >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in an INT column I get > the following error: > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > Steps to reproduce: > 1) On Hive create a simple table with an INT column and insert some data (I > used SQuirreL Client with the Hortonworks JDBC driver): > create table wh2.hivespark (country_id int, country_name string) > insert into wh2.hivespark values (1, 'USA') > 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files > ./spark-shell --jars > /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar > 4) In Spark shell load the data from Hive using the JDBC driver > val hivespark = spark.read.format("jdbc").options(Map("url" -> > "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable" > -> > "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load() > 5) In Spark shell try to display the data > hivespark.show() > At this point you should see the error: > scala> hivespark.show() > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. > at > com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at > com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source) > at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
[jira] [Created] (SPARK-21182) Structured streaming on Spark-shell on windows
Vijay created SPARK-21182: - Summary: Structured streaming on Spark-shell on windows Key: SPARK-21182 URL: https://issues.apache.org/jira/browse/SPARK-21182 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.1 Environment: Windows 10 spark-2.1.1-bin-hadoop2.7 Reporter: Vijay Priority: Minor Structured streaming output operation is failing on Windows shell. As per the error message, path is being prefixed with File separator as in Linux. Thus, causing the IllegalArgumentException. Following is the error message. scala> val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() java.lang.IllegalArgumentException: Pathname {color:red}*/*{color}C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets from C:/Users/Vijay/AppData/Local/Temp/temporary-081b482c-98a4-494e-8cfb-22d966c2da01/offsets is not a valid DFS filename. at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:197) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426) at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:222) at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:280) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:268) ... 52 elided -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to
Matthew Walton created SPARK-21183: -- Summary: Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long. Key: SPARK-21183 URL: https://issues.apache.org/jira/browse/SPARK-21183 Project: Spark Issue Type: Bug Components: Spark Shell, SQL Affects Versions: 2.1.1, 2.0.0, 1.6.0 Environment: OS: Linux Spark version 2.1.1 JDBC: Download the latest google BigQuery JDBC Driver from Google Reporter: Matthew Walton I'm trying to fetch back data in Spark using a JDBC connection to Google BigQuery. Unfortunately, when I try to query data that resides in an INTEGER column I get the following error: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long. Steps to reproduce: 1) On Google BigQuery console create a simple table with an INT column and insert some data 2) Copy the Google BigQuery JDBC driver to the machine where you will run Spark Shell 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files ./spark-shell --jars /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar 4) In Spark shell load the data from Google BigQuery using the JDBC driver val gbq = spark.read.format("jdbc").options(Map("url" -> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; -> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() 5) In Spark shell try to display the data gbq.show() At this point you should see the error: scala> gbq.show() 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to long. at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown Source) at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(
[jira] [Created] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n
Andrew Ray created SPARK-21184: -- Summary: QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n Key: SPARK-21184 URL: https://issues.apache.org/jira/browse/SPARK-21184 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Andrew Ray 1. QuantileSummaries implementation does not match the paper it is supposed to be based on. 1a. The compress method (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240) merges neighboring buckets, but thats not what the paper says to do. The paper (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) describes an implicit tree structure and the compress method deletes selected subtrees. 1b. The paper does not discuss merging these summary data structures at all. The following comment is in the merge method of QuantileSummaries: {quote} // The GK algorithm is a bit unclear about it, but it seems there is no need to adjust the // statistics during the merging: the invariants are still respected after the merge.{quote} Unless I'm missing something that needs substantiation, it's not clear that that the invariants hold. 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values) https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27 One possible solution if these issues can't be resolved would be to move to an algorithm that explicitly supports merging and is well tested like https://github.com/tdunning/t-digest -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19937) Collect metrics of block sizes when shuffle.
[ https://issues.apache.org/jira/browse/SPARK-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19937. Resolution: Fixed Assignee: jin xing Fix Version/s: 2.3.0 > Collect metrics of block sizes when shuffle. > > > Key: SPARK-19937 > URL: https://issues.apache.org/jira/browse/SPARK-19937 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Fix For: 2.3.0 > > > This improvement is related to SPARK-19659 and also documented there. Metrics > of blocks sizes(when shuffle) should be collected for later analysis. This is > helpful for analysis when skew situations or OOM happens(though > maxBytesInFlight is set). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20342) DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators
[ https://issues.apache.org/jira/browse/SPARK-20342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060063#comment-16060063 ] Apache Spark commented on SPARK-20342: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18393 > DAGScheduler sends SparkListenerTaskEnd before updating task's accumulators > --- > > Key: SPARK-20342 > URL: https://issues.apache.org/jira/browse/SPARK-20342 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Hit this on 2.2, but probably has been there forever. This is similar in > spirit to SPARK-20205. > Event is sent here, around L1154: > {code} > listenerBus.post(SparkListenerTaskEnd( >stageId, task.stageAttemptId, taskType, event.reason, event.taskInfo, > taskMetrics)) > {code} > Accumulators are updated later, around L1173: > {code} > val stage = stageIdToStage(task.stageId) > event.reason match { > case Success => > task match { > case rt: ResultTask[_, _] => > // Cast to ResultStage here because it's part of the ResultTask > // TODO Refactor this out to a function that accepts a ResultStage > val resultStage = stage.asInstanceOf[ResultStage] > resultStage.activeJob match { > case Some(job) => > if (!job.finished(rt.outputId)) { > updateAccumulators(event) > {code} > Same thing applies here; UI shows correct info because it's pointing at the > mutable {{TaskInfo}} structure. But the event log, for example, may record > the wrong information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20379) Allow setting SSL-related passwords through env variables
[ https://issues.apache.org/jira/browse/SPARK-20379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060065#comment-16060065 ] Apache Spark commented on SPARK-20379: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18394 > Allow setting SSL-related passwords through env variables > - > > Key: SPARK-20379 > URL: https://issues.apache.org/jira/browse/SPARK-20379 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Currently, Spark reads all SSL options from configuration, which can be > provided in a file or through the command line. This means that to set the > SSL keystore / trust store / key passwords, you have to use one of those > options. > Using the command line for that is not secure, and in some environments > admins prefer to not have the password written in plain text in a file (since > the file and the data it's protecting could be stored in the same > filesystem). So for these cases it would be nice to be able to provide these > passwords through environment variables, which are not written to disk and > also not readable by other users on the same machine. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20655) In-memory key-value store implementation
[ https://issues.apache.org/jira/browse/SPARK-20655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060067#comment-16060067 ] Apache Spark commented on SPARK-20655: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18395 > In-memory key-value store implementation > > > Key: SPARK-20655 > URL: https://issues.apache.org/jira/browse/SPARK-20655 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks adding an in-memory implementation for the key-value store > abstraction added in SPARK-20641. This is desired because people might want > to avoid having to store this data on disk when running applications, and > also because the LevelDB native libraries are not available on all platforms. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20391) Properly rename the memory related fields in ExecutorSummary REST API
[ https://issues.apache.org/jira/browse/SPARK-20391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060097#comment-16060097 ] Jose Soltren commented on SPARK-20391: -- So, this is months old now and irrelevant, but since you pinged me, I'll say that jerryshao's changes look fine to me. Thanks. > Properly rename the memory related fields in ExecutorSummary REST API > - > > Key: SPARK-20391 > URL: https://issues.apache.org/jira/browse/SPARK-20391 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Blocker > Fix For: 2.2.0 > > > Currently in Spark we could get executor summary through REST API > {{/api/v1/applications//executors}}. The format of executor summary > is: > {code} > class ExecutorSummary private[spark]( > val id: String, > val hostPort: String, > val isActive: Boolean, > val rddBlocks: Int, > val memoryUsed: Long, > val diskUsed: Long, > val totalCores: Int, > val maxTasks: Int, > val activeTasks: Int, > val failedTasks: Int, > val completedTasks: Int, > val totalTasks: Int, > val totalDuration: Long, > val totalGCTime: Long, > val totalInputBytes: Long, > val totalShuffleRead: Long, > val totalShuffleWrite: Long, > val isBlacklisted: Boolean, > val maxMemory: Long, > val executorLogs: Map[String, String], > val onHeapMemoryUsed: Option[Long], > val offHeapMemoryUsed: Option[Long], > val maxOnHeapMemory: Option[Long], > val maxOffHeapMemory: Option[Long]) > {code} > Here are 6 memory related fields: {{memoryUsed}}, {{maxMemory}}, > {{onHeapMemoryUsed}}, {{offHeapMemoryUsed}}, {{maxOnHeapMemory}}, > {{maxOffHeapMemory}}. > These all 6 fields reflects the *storage* memory usage in Spark, but from the > name of this 6 fields, user doesn't really know it is referring to *storage* > memory or the total memory (storage memory + execution memory). This will be > misleading. > So I think we should properly rename these fields to reflect their real > meanings. Or we should will document it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21145) Restarted queries reuse same StateStoreProvider, causing multiple concurrent tasks to update same StateStore
[ https://issues.apache.org/jira/browse/SPARK-21145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060196#comment-16060196 ] Apache Spark commented on SPARK-21145: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/18396 > Restarted queries reuse same StateStoreProvider, causing multiple concurrent > tasks to update same StateStore > > > Key: SPARK-21145 > URL: https://issues.apache.org/jira/browse/SPARK-21145 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > StateStoreProvider instances are loaded on-demand in a executor when a query > is started. When a query is restarted, the loaded provider instance will get > reused. Now, there is a non-trivial chance, that the task of the previous > query run is still running, while the tasks of the restarted run has started. > So for a stateful partition, there may be two concurrent tasks related to the > same stateful partition, and there for using the same provider instance. This > can lead to inconsistent results and possibly random failures, as state store > implementations are not designed to be thread-safe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
[ https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060203#comment-16060203 ] Kanagha Pradha commented on SPARK-18343: I am getting the same error with spark 2.0.2 - scala 2.11 version. As [~lminer] mentioned, I looked at the versions. the spark pom.xml by default references json4s - 3.2.11 version. I'm not sure what is the reason. Also, this error occurs intermittently and several times spark-submit succeeds. Any pointers on what could be causing this? Thread 24 (org.apache.hadoop.hdfs.PeerCache@1c80e49b): State: TIMED_WAITING Blocked count: 0 Waited count: 23 Stack: java.lang.Thread.sleep(Native Method) org.apache.hadoop.hdfs.PeerCache.run(PeerCache.java:255) org.apache.hadoop.hdfs.PeerCache.access$000(PeerCache.java:46) org.apache.hadoop.hdfs.PeerCache$1.run(PeerCache.java:124) java.lang.Thread.run(Thread.java:748) Thread 22 (communication thread): State: RUNNABLE Blocked count: 20 Waited count: 63 Stack: sun.management.ThreadImpl.getThreadInfo1(Native Method) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:178) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:139) org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:168) org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:223) org.apache.hadoop.mapred.Task$TaskReporter.run(Task.java:785) java.lang.Thread.run(Thread.java:748) Thread 16 (process reaper): State: TIMED_WAITING Blocked count: 0 Waited count: 5 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:748) Thread 14 (org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner): State: WAITING Blocked count: 0 Waited count: 1 Waiting on java.lang.ref.ReferenceQueue$Lock@407a7f2a Stack: java.lang.Object.wait(Native Method) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner.run(FileSystem.java:3060) java.lang.Thread.run(Thread.java:748) Thread 13 (IPC Parameter Sending Thread #0): State: TIMED_WAITING Blocked count: 0 Waited count: 31 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460) java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362) java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:748) Thread 4 (Signal Dispatcher): State: RUNNABLE Blocked count: 0 Waited count: 0 Stack: Thread 3 (Finalizer): State: WAITING Blocked count: 567 Waited count: 6 Waiting on java.lang.ref.ReferenceQueue$Lock@2cb2fc20 Stack: java.lang.Object.wait(Native Method) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:143) java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:164) java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209) Thread 2 (Reference Handler): State: WAITING Blocked count: 9 Waited count: 5 Waiting on java.lang.ref.Reference$Lock@4f4c4b1a Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:502) java.lang.ref.Reference.tryHandlePending(Reference.java:191) java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) Jun 22, 2017 11:01:23 PM org.apache.hadoop.mapred.Task run WARNING: Last retry, killing attempt_1498171863687_0003_m_00_0 > FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write > -- > > Key: SPARK-18343 > URL: https://issues.apache.org/jira/browse/SPARK-18343 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1 >
[jira] [Assigned] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher
[ https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21159: Assignee: Apache Spark > Cluster mode, driver throws connection refused exception submitted by > SparkLauncher > --- > > Key: SPARK-21159 > URL: https://issues.apache.org/jira/browse/SPARK-21159 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0 > Environment: Server A-Master > Server B-Slave >Reporter: niefei >Assignee: Apache Spark > > When an spark application submitted by SparkLauncher#startApplication method, > this will get a SparkAppHandle. In the test environment, the launcher runs on > server A, if it runs in Client mode, everything is ok. In cluster mode, the > launcher will run on Server A, and the driver will be run on Server B, in > this scenario, when initialize SparkContext, a LauncherBackend will try to > connect to the launcher application via specified port and ip address. the > problem is the implementation of LauncherBackend uses loopback ip to connect > which is 127.0.0.1. this will cause the connection refused as server B never > ran the launcher. > The expected behavior is the LauncherBackend should use Server A's Ip address > to connect for reporting the running status. > Below is the stacktrace: > 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext. > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.(Socket.java:434) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) > at org.apache.spark.SparkContext.(SparkContext.scala:509) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25) > at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61) > at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) > at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) > 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at > http://172.25.108.62:4040 > 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors > 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking > each executor to shut down > 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.sca
[jira] [Assigned] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher
[ https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21159: Assignee: (was: Apache Spark) > Cluster mode, driver throws connection refused exception submitted by > SparkLauncher > --- > > Key: SPARK-21159 > URL: https://issues.apache.org/jira/browse/SPARK-21159 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0 > Environment: Server A-Master > Server B-Slave >Reporter: niefei > > When an spark application submitted by SparkLauncher#startApplication method, > this will get a SparkAppHandle. In the test environment, the launcher runs on > server A, if it runs in Client mode, everything is ok. In cluster mode, the > launcher will run on Server A, and the driver will be run on Server B, in > this scenario, when initialize SparkContext, a LauncherBackend will try to > connect to the launcher application via specified port and ip address. the > problem is the implementation of LauncherBackend uses loopback ip to connect > which is 127.0.0.1. this will cause the connection refused as server B never > ran the launcher. > The expected behavior is the LauncherBackend should use Server A's Ip address > to connect for reporting the running status. > Below is the stacktrace: > 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext. > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.(Socket.java:434) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) > at org.apache.spark.SparkContext.(SparkContext.scala:509) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25) > at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61) > at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) > at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) > 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at > http://172.25.108.62:4040 > 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors > 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking > each executor to shut down > 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.a
[jira] [Commented] (SPARK-21159) Cluster mode, driver throws connection refused exception submitted by SparkLauncher
[ https://issues.apache.org/jira/browse/SPARK-21159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060208#comment-16060208 ] Apache Spark commented on SPARK-21159: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/18397 > Cluster mode, driver throws connection refused exception submitted by > SparkLauncher > --- > > Key: SPARK-21159 > URL: https://issues.apache.org/jira/browse/SPARK-21159 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0 > Environment: Server A-Master > Server B-Slave >Reporter: niefei > > When an spark application submitted by SparkLauncher#startApplication method, > this will get a SparkAppHandle. In the test environment, the launcher runs on > server A, if it runs in Client mode, everything is ok. In cluster mode, the > launcher will run on Server A, and the driver will be run on Server B, in > this scenario, when initialize SparkContext, a LauncherBackend will try to > connect to the launcher application via specified port and ip address. the > problem is the implementation of LauncherBackend uses loopback ip to connect > which is 127.0.0.1. this will cause the connection refused as server B never > ran the launcher. > The expected behavior is the LauncherBackend should use Server A's Ip address > to connect for reporting the running status. > Below is the stacktrace: > 17/06/20 17:24:37 ERROR SparkContext: Error initializing SparkContext. > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.(Socket.java:434) > at java.net.Socket.(Socket.java:244) > at > org.apache.spark.launcher.LauncherBackend.connect(LauncherBackend.scala:43) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:60) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:156) > at org.apache.spark.SparkContext.(SparkContext.scala:509) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2313) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:868) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:860) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:860) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.executeSparkJob(AbstractCommonSparkTask.scala:91) > at > com.asura.grinder.datatask.task.AbstractCommonSparkTask.runSparkJob(AbstractCommonSparkTask.scala:25) > at com.asura.grinder.datatask.main.TaskMain$.main(TaskMain.scala:61) > at com.asura.grinder.datatask.main.TaskMain.main(TaskMain.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58) > at > org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala) > 17/06/20 17:24:37 INFO SparkUI: Stopped Spark web UI at > http://172.25.108.62:4040 > 17/06/20 17:24:37 INFO StandaloneSchedulerBackend: Shutting down all executors > 17/06/20 17:24:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking > each executor to shut down > 17/06/20 17:24:37 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:214) > at > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:116) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:467) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1588) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkC
[jira] [Resolved] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-13534. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 15821 [https://github.com/apache/spark/pull/15821] > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Wes McKinney > Fix For: 2.3.0 > > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-13534: --- Assignee: Bryan Cutler > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Wes McKinney >Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21185) Spurious errors in unidoc causing PRs to fail
Tathagata Das created SPARK-21185: - Summary: Spurious errors in unidoc causing PRs to fail Key: SPARK-21185 URL: https://issues.apache.org/jira/browse/SPARK-21185 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Tathagata Das Assignee: Hyukjin Kwon Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc generates Java files from Scala files, the generated java files can have errors in them. When JavaDoc attempts to generate docs on these generated java files, it throws errors. Usually, the errors are marked as warnings, so the unidoc does not fail the build. Example - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull {code} [info] Constructing Javadoc information... [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus listenerBus, org.apache.spark.SparkConf conf, scala.Option allocationClient, org.apache.spark.util.Clock clock) { throw new RuntimeException(); } [warn] ^ [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.SparkContext sc, scala.Option allocationClient) { throw new RuntimeException(); } {code} However in some PR builds these are marked as errors, thus causing the build to fail due to unidoc. Example - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull {code} [info] Constructing Javadoc information... [error] /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [error] public BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus listenerBus, org.apache.spark.SparkConf conf, scala.Option allocationClient, org.apache.spark.util.Clock clock) { throw new RuntimeException(); } [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [error] public BlacklistTracker (org.apache.spark.SparkContext sc, scala.Option allocationClient) { throw new RuntimeException(); } [error] {code} https://github.com/apache/spark/pull/18355 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21185) Spurious errors in unidoc causing PRs to fail
[ https://issues.apache.org/jira/browse/SPARK-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-21185: -- Description: Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc generates Java files from Scala files, the generated java files can have errors in them. When JavaDoc attempts to generate docs on these generated java files, it throws errors. Usually, the errors are marked as warnings, so the unidoc does not fail the build. Example - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull {code} [info] Constructing Javadoc information... [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus listenerBus, org.apache.spark.SparkConf conf, scala.Option allocationClient, org.apache.spark.util.Clock clock) { throw new RuntimeException(); } [warn] ^ [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.SparkContext sc, scala.Option allocationClient) { throw new RuntimeException(); } {code} However in some PR builds these are marked as errors, thus causing the build to fail due to unidoc. Example - https://github.com/apache/spark/pull/18355 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull {code} [info] Constructing Javadoc information... [error] /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [error] public BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus listenerBus, org.apache.spark.SparkConf conf, scala.Option allocationClient, org.apache.spark.util.Clock clock) { throw new RuntimeException(); } [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [error] public BlacklistTracker (org.apache.spark.SparkContext sc, scala.Option allocationClient) { throw new RuntimeException(); } [error] {code} was: Some PRs are failing because of unidoc throwing random errors. When GenJavaDoc generates Java files from Scala files, the generated java files can have errors in them. When JavaDoc attempts to generate docs on these generated java files, it throws errors. Usually, the errors are marked as warnings, so the unidoc does not fail the build. Example - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull {code} [info] Constructing Javadoc information... [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.scheduler.LiveListenerBus listenerBus, org.apache.spark.SparkConf conf, scala.Option allocationClient, org.apache.spark.util.Clock clock) { throw new RuntimeException(); } [warn] ^ [warn] /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: error: ExecutorAllocationClient is not public in org.apache.spark; cannot be accessed from outside package [warn] public BlacklistTracker (org.apache.spark.SparkContext sc, scala.Option allocationClient) { throw new RuntimeException(); } {code} However in some PR builds these are marked as errors, thus causing the build to fail due to unidoc. Example - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull {code} [info] Constructing Javadoc information... [error] /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: error: ExecutorAllocationClient is not public i
[jira] [Assigned] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20923: --- Assignee: Thomas Graves > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20923. - Resolution: Fixed Fix Version/s: 2.3.0 > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Labels: releasenotes > Fix For: 2.3.0 > > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20923: Labels: releasenotes (was: ) > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Labels: releasenotes > Fix For: 2.3.0 > > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20923) TaskMetrics._updatedBlockStatuses uses a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-20923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060283#comment-16060283 ] Wenchen Fan commented on SPARK-20923: - This patch changes the public behavior and we should mention it in the release notes. Basically users can track the status of updated blocks via {{SparkListenerTaskEnd}} event, but this feature was introduced for internal usage at the beginning and I'm wondering how many users are using this feature. After this patch we don't trach it anymore by default, users can still turn it on by setting {{spark.taskMetrics.trackUpdatedBlockStatuses}} to true. > TaskMetrics._updatedBlockStatuses uses a lot of memory > -- > > Key: SPARK-20923 > URL: https://issues.apache.org/jira/browse/SPARK-20923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > Labels: releasenotes > Fix For: 2.3.0 > > > The driver appears to use a ton of memory in certain cases to store the task > metrics updated block status'. For instance I had a user reading data form > hive and caching it. The # of tasks to read was around 62,000, they were > using 1000 executors and it ended up caching a couple TB's of data. The > driver kept running out of memory. > I investigated and it looks like there was 5GB of a 10GB heap being used up > by the TaskMetrics._updatedBlockStatuses because there are a lot of blocks. > The updatedBlockStatuses was already removed from the task end event under > SPARK-20084. I don't see anything else that seems to be using this. Anybody > know if I missed something? > If its not being used we should remove it, otherwise we need to figure out a > better way of doing it so it doesn't use so much memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21174. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18387 [https://github.com/apache/spark/pull/18387] > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Priority: Minor > Fix For: 2.3.0 > > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21174) Validate sampling fraction in logical operator level
[ https://issues.apache.org/jira/browse/SPARK-21174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21174: --- Assignee: Gengliang Wang > Validate sampling fraction in logical operator level > > > Key: SPARK-21174 > URL: https://issues.apache.org/jira/browse/SPARK-21174 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 2.3.0 > > > Currently the validation of sampling fraction in dataset is incomplete. > As an improvement, validate sampling ratio in logical operator level: > 1) if with replacement: ratio should be nonnegative > 2) else: ratio should be on interval [0, 1] > Also add test cases for the validation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21185) Spurious errors in unidoc causing PRs to fail
[ https://issues.apache.org/jira/browse/SPARK-21185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060296#comment-16060296 ] Hyukjin Kwon commented on SPARK-21185: -- Yea, I believe this is a duplicate of https://issues.apache.org/jira/browse/SPARK-20840 > Spurious errors in unidoc causing PRs to fail > - > > Key: SPARK-21185 > URL: https://issues.apache.org/jira/browse/SPARK-21185 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.0 >Reporter: Tathagata Das >Assignee: Hyukjin Kwon > > Some PRs are failing because of unidoc throwing random errors. When > GenJavaDoc generates Java files from Scala files, the generated java files > can have errors in them. When JavaDoc attempts to generate docs on these > generated java files, it throws errors. Usually, the errors are marked as > warnings, so the unidoc does not fail the build. > Example - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78270/consoleFull > {code} > [info] Constructing Javadoc information... > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: > error: ExecutorAllocationClient is not public in org.apache.spark; cannot be > accessed from outside package > [warn] public BlacklistTracker > (org.apache.spark.scheduler.LiveListenerBus listenerBus, > org.apache.spark.SparkConf conf, > scala.Option allocationClient, > org.apache.spark.util.Clock clock) { throw new RuntimeException(); } > [warn] > ^ > [warn] > /home/jenkins/workspace/SparkPullRequestBuilder/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: > error: ExecutorAllocationClient is not public in org.apache.spark; cannot be > accessed from outside package > [warn] public BlacklistTracker (org.apache.spark.SparkContext sc, > scala.Option allocationClient) { > throw new RuntimeException(); } > {code} > However in some PR builds these are marked as errors, thus causing the build > to fail due to unidoc. Example - > https://github.com/apache/spark/pull/18355 > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/78484/consoleFull > {code} > [info] Constructing Javadoc information... > [error] > /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:117: > error: ExecutorAllocationClient is not public in org.apache.spark; cannot be > accessed from outside package > [error] public BlacklistTracker > (org.apache.spark.scheduler.LiveListenerBus listenerBus, > org.apache.spark.SparkConf conf, > scala.Option allocationClient, > org.apache.spark.util.Clock clock) { throw new RuntimeException(); } > [error] > ^ > [error] > /home/jenkins/workspace/SparkPullRequestBuilder@3/core/target/java/org/apache/spark/scheduler/BlacklistTracker.java:118: > error: ExecutorAllocationClient is not public in org.apache.spark; cannot be > accessed from outside package > [error] public BlacklistTracker (org.apache.spark.SparkContext sc, > scala.Option allocationClient) { > throw new RuntimeException(); } > [error] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20599) ConsoleSink should work with write (batch)
[ https://issues.apache.org/jira/browse/SPARK-20599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-20599: Assignee: Lubo Zhang > ConsoleSink should work with write (batch) > -- > > Key: SPARK-20599 > URL: https://issues.apache.org/jira/browse/SPARK-20599 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Assignee: Lubo Zhang >Priority: Minor > Labels: starter > Fix For: 2.3.0 > > > I think the following should just work. > {code} > spark. > read. // <-- it's a batch query not streaming query if that matters > format("kafka"). > option("subscribe", "topic1"). > option("kafka.bootstrap.servers", "localhost:9092"). > load. > write. > format("console"). // <-- that's not supported currently > save > {code} > The above combination of {{kafka}} source and {{console}} sink leads to the > following exception: > {code} > java.lang.RuntimeException: > org.apache.spark.sql.execution.streaming.ConsoleSinkProvider does not allow > create table as select. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:93) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060328#comment-16060328 ] Reynold Xin commented on SPARK-13534: - Was this done? I thought there are still other data types that are not supported. We should either turn this into an umbrella ticket, or create a new umbrella ticket. > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Wes McKinney >Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21171) Speculate task scheduling block dirve handle normal task when a job task number more than one hundred thousand
[ https://issues.apache.org/jira/browse/SPARK-21171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060348#comment-16060348 ] wangminfeng commented on SPARK-21171: - We have modified some code for this feature, i will add a benchmark soon. It is the first time i contribute to community, give me some time for learn the rule. Thank you. > Speculate task scheduling block dirve handle normal task when a job task > number more than one hundred thousand > -- > > Key: SPARK-21171 > URL: https://issues.apache.org/jira/browse/SPARK-21171 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 > Environment: We have more than two hundred high-performance machine > to handle more than 2T data by one query >Reporter: wangminfeng > > If a job have more then one hundred thousand tasks and spark.speculation is > true, when speculable tasks start, choosing a speculable will waste lots of > time and block other tasks. We do a ad-hoc query for data analyse, we can't > tolerate one job wasting time even it is a large job -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21186) PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar
HanCheol Cho created SPARK-21186: Summary: PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar Key: SPARK-21186 URL: https://issues.apache.org/jira/browse/SPARK-21186 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Environment: Spark is downloaded and compiled by myself. Spark: 2.2.0-SNAPSHOT Python: Anaconda Python2 (on client and workers) Reporter: HanCheol Cho Priority: Minor I experienced "ImportError: No module named sparkdl" exception while trying to use databricks' spark-deep-learning (sparkdl) in PySpark. The package is included with --packages option as below. {code} $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 {code} The problem was that PySpark fails to detect this package's jar files located in .ivy2/jars/ directory. I could circumvent this issue by manually adding this path to PYTHONPATH after launching PySpark as follows. {code} >>> import sys, glob, os >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), >>> ".ivy2/jars/*.jar"))) >>> >>> import sparkdl Using TensorFlow backend. >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg") >>> my_images.show() +++ |filePath| image| +++ |hdfs://mycluster/...|[RGB,263,320,3,[B...| |hdfs://mycluster/...|[RGB,313,500,3,[B...| |hdfs://mycluster/...|[RGB,215,320,3,[B...| ... {code} I think that it may be better to add ivy2/jar directory path to PYTHONPATH while launching PySpark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21186) PySpark with --packages fails to import library due to lack of pythonpath to .ivy2/jars/*.jar
[ https://issues.apache.org/jira/browse/SPARK-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] HanCheol Cho updated SPARK-21186: - Description: I experienced "ImportError: No module named sparkdl" exception while trying to use databricks' spark-deep-learning (sparkdl) in PySpark. The package is included with --packages option as below. {code} $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 >>> import sparkdl Traceback (most recent call last): File "", line 1, in ImportError: No module named sparkdl {code} The problem was that PySpark fails to detect this package's jar files located in .ivy2/jars/ directory. I could circumvent this issue by manually adding this path to PYTHONPATH after launching PySpark as follows. {code} $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 >>> import sys, glob, os >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), >>> ".ivy2/jars/*.jar"))) >>> >>> import sparkdl Using TensorFlow backend. >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg") >>> my_images.show() +++ |filePath| image| +++ |hdfs://mycluster/...|[RGB,263,320,3,[B...| |hdfs://mycluster/...|[RGB,313,500,3,[B...| |hdfs://mycluster/...|[RGB,215,320,3,[B...| ... {code} I think that it may be better to add ivy2/jar directory path to PYTHONPATH while launching PySpark. was: I experienced "ImportError: No module named sparkdl" exception while trying to use databricks' spark-deep-learning (sparkdl) in PySpark. The package is included with --packages option as below. {code} $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 {code} The problem was that PySpark fails to detect this package's jar files located in .ivy2/jars/ directory. I could circumvent this issue by manually adding this path to PYTHONPATH after launching PySpark as follows. {code} >>> import sys, glob, os >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), >>> ".ivy2/jars/*.jar"))) >>> >>> import sparkdl Using TensorFlow backend. >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg") >>> my_images.show() +++ |filePath| image| +++ |hdfs://mycluster/...|[RGB,263,320,3,[B...| |hdfs://mycluster/...|[RGB,313,500,3,[B...| |hdfs://mycluster/...|[RGB,215,320,3,[B...| ... {code} I think that it may be better to add ivy2/jar directory path to PYTHONPATH while launching PySpark. > PySpark with --packages fails to import library due to lack of pythonpath to > .ivy2/jars/*.jar > - > > Key: SPARK-21186 > URL: https://issues.apache.org/jira/browse/SPARK-21186 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 > Environment: Spark is downloaded and compiled by myself. > Spark: 2.2.0-SNAPSHOT > Python: Anaconda Python2 (on client and workers) >Reporter: HanCheol Cho >Priority: Minor > > I experienced "ImportError: No module named sparkdl" exception while trying > to use databricks' spark-deep-learning (sparkdl) in PySpark. > The package is included with --packages option as below. > {code} > $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 > >>> import sparkdl > Traceback (most recent call last): > File "", line 1, in > ImportError: No module named sparkdl > {code} > The problem was that PySpark fails to detect this package's jar files located > in .ivy2/jars/ directory. > I could circumvent this issue by manually adding this path to PYTHONPATH > after launching PySpark as follows. > {code} > $ pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 > >>> import sys, glob, os > >>> sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), > >>> ".ivy2/jars/*.jar"))) > >>> > >>> import sparkdl > Using TensorFlow backend. > >>> my_images = sparkdl.readImages("data/flower_photos/daisy/*.jpg") > >>> my_images.show() > +++ > |filePath| image| > +++ > |hdfs://mycluster/...|[RGB,263,320,3,[B...| > |hdfs://mycluster/...|[RGB,313,500,3,[B...| > |hdfs://mycluster/...|[RGB,215,320,3,[B...| > ... > {code} > I think that it may be better to add ivy2/jar directory path to PYTHONPATH > while launching PySpark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-21066: Priority: Trivial (was: Major) > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Trivial > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060381#comment-16060381 ] Yan Facai (颜发才) commented on SPARK-21066: - Downgrade to Trivial since `numFeatures` should work. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Trivial > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21187) Complete support for remaining Spark data type in Arrow Converters
Bryan Cutler created SPARK-21187: Summary: Complete support for remaining Spark data type in Arrow Converters Key: SPARK-21187 URL: https://issues.apache.org/jira/browse/SPARK-21187 Project: Spark Issue Type: Umbrella Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler This is to track adding the remaining type support in Arrow Converters. Currently, only primitive data types are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20738) Option to turn off building docs in sbt build.
[ https://issues.apache.org/jira/browse/SPARK-20738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma closed SPARK-20738. --- Resolution: Won't Fix > Option to turn off building docs in sbt build. > -- > > Key: SPARK-20738 > URL: https://issues.apache.org/jira/browse/SPARK-20738 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Prashant Sharma >Priority: Minor > Labels: sbt > > The default behavior is unchanged. If the option is not set, the docs are > built by default. > sbt publish-local tries to build the docs along with other artifacts and > as the codebase is being updated with no build checks for sbt docs build. It > appears to be difficult to upkeep the correct building of docs with sbt. > > An alternative is, we can hide building of docs behind an option > `-Dbuild.docs=false`. > > This is also useful, if someone uses sbt publish and does not need the > building of docs as it is generally time consuming. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas
[ https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060413#comment-16060413 ] Bryan Cutler commented on SPARK-13534: -- That is correct [~rxin], this did not have support for complex types or date/timestamp. I created SPARK-21187 as an umbrella to track addition of all remaining types. > Implement Apache Arrow serializer for Spark DataFrame for use in > DataFrame.toPandas > --- > > Key: SPARK-13534 > URL: https://issues.apache.org/jira/browse/SPARK-13534 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Wes McKinney >Assignee: Bryan Cutler > Fix For: 2.3.0 > > Attachments: benchmark.py > > > The current code path for accessing Spark DataFrame data in Python using > PySpark passes through an inefficient serialization-deserialiation process > that I've examined at a high level here: > https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] > objects are being deserialized in pure Python as a list of tuples, which are > then converted to pandas.DataFrame using its {{from_records}} alternate > constructor. This also uses a large amount of memory. > For flat (no nested types) schemas, the Apache Arrow memory layout > (https://github.com/apache/arrow/tree/master/format) can be deserialized to > {{pandas.DataFrame}} objects with comparatively small overhead compared with > memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, > replacing the corresponding null values with pandas's sentinel values (None > or NaN as appropriate). > I will be contributing patches to Arrow in the coming weeks for converting > between Arrow and pandas in the general case, so if Spark can send Arrow > memory to PySpark, we will hopefully be able to increase the Python data > access throughput by an order of magnitude or more. I propose to add an new > serializer for Spark DataFrame and a new method that can be invoked from > PySpark to request a Arrow memory-layout byte stream, prefixed by a data > header indicating array buffer offsets and sizes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method
Feng Liu created SPARK-21188: Summary: releaseAllLocksForTask should synchronize the whole method Key: SPARK-21188 URL: https://issues.apache.org/jira/browse/SPARK-21188 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 2.1.0, 2.2.0 Reporter: Feng Liu Since the objects readLocksByTask, writeLocksByTask and infos are coupled and supposed to be modified by other threads concurrently, all the read and writes of them in the releaseAllLocksForTask method should be protected by a single synchronized block. The fine-grained synchronization in the current code can cause some test flakiness. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21187: - Summary: Complete support for remaining Spark data types in Arrow Converters (was: Complete support for remaining Spark data type in Arrow Converters) > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21188) releaseAllLocksForTask should synchronize the whole method
[ https://issues.apache.org/jira/browse/SPARK-21188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21188: Assignee: Apache Spark > releaseAllLocksForTask should synchronize the whole method > -- > > Key: SPARK-21188 > URL: https://issues.apache.org/jira/browse/SPARK-21188 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Feng Liu >Assignee: Apache Spark > > Since the objects readLocksByTask, writeLocksByTask and infos are coupled and > supposed to be modified by other threads concurrently, all the read and > writes of them in the releaseAllLocksForTask method should be protected by a > single synchronized block. The fine-grained synchronization in the current > code can cause some test flakiness. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org