[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching
[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128393#comment-16128393 ] Sean Owen commented on SPARK-21742: --- Is that a bug? Isn't it stochastic and dependent on the data order anyway, which could vary if the input varies? Neither answer is wrong. > BisectingKMeans generate different models with/without caching > -- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val random = new Random(seed) > val nnz = random.nextInt(dim) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > val k = 5 > val bkm = new > BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res0: Int = 2 > sparseDataset.persist() > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res2: Int = 3 > {code} > [~imatiach] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128392#comment-16128392 ] Sean Owen commented on SPARK-21736: --- I am still not sure why you're saying it relates to Spark. The prop and file are parsed by your app code. > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* > UPDATE: I replaced the \ with / and it works. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21422) Depend on Apache ORC 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-21422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21422. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 > Depend on Apache ORC 1.4.0 > -- > > Key: SPARK-21422 > URL: https://issues.apache.org/jira/browse/SPARK-21422 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.3.0 > > > Like Parquet, this issue aims to depend on the latest Apache ORC 1.4 for > Apache Spark 2.3. There are key benefits for now. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. > Later, we can get the following two key benefits by adding new ORCFileFormat > in SPARK-20728, too. > - Usability: User can use ORC data sources without hive module, i.e, -Phive. > - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This is > faster than the current implementation in Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21108) convert LinearSVC to aggregator framework
[ https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-21108: --- Assignee: yuhao yang > convert LinearSVC to aggregator framework > - > > Key: SPARK-21108 > URL: https://issues.apache.org/jira/browse/SPARK-21108 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector.
Takuya Ueshin created SPARK-21745: - Summary: Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector. Key: SPARK-21745 URL: https://issues.apache.org/jira/browse/SPARK-21745 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takuya Ueshin This is a refactoring of {{ColumnVector}} hierarchy and related classes. # make {{ColumnVector}} read-only # introduce {{MutableColumnVector}} with write interface # remove {{ReadOnlyColumnVector}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128336#comment-16128336 ] Ross Brigoli commented on SPARK-21736: -- You're right file:/z:/works. But the issue is that file:\z:\... works in Spark 2.1.0 so we did not bother changing it. Now we had to update all our configuration in production just to make it work in Spark 2.2.0 > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* > UPDATE: I replaced the \ with / and it works. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-14516: --- Assignee: Marco Gaido > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Marco Gaido > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14516) Clustering evaluator
[ https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-14516: Shepherd: Yanbo Liang Affects Version/s: 2.2.0 > Clustering evaluator > > > Key: SPARK-14516 > URL: https://issues.apache.org/jira/browse/SPARK-14516 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > MLlib does not have any general purposed clustering metrics with a ground > truth. > In > [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics), > there are several kinds of metrics for this. > It may be meaningful to add some clustering metrics into MLlib. > This should be added as a {{ClusteringEvaluator}} class of extending > {{Evaluator}} in spark.ml. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21744) Add retry logic when create new broadcast
[ https://issues.apache.org/jira/browse/SPARK-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-21744: - Description: When driver submit new stage and there is a bad disk before spark,then driver may will exit caused by exception below: {code:java} Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/xxx/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} We can add retry logic when create broadcast to lower the probability of this scenario occurrence。And there is no side-effect for normal scenario. was: When driver submit new stage and there is a bad disk before spark,then driver may will exit caused by exception below: {code:java} Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) scala.collection.mutable.ResizableArr
[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode
[ https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128327#comment-16128327 ] Liang-Chi Hsieh commented on SPARK-21726: - Submitted a PR at https://github.com/apache/spark/pull/18956 > Check for structural integrity of the plan in QO in test mode > - > > Key: SPARK-21726 > URL: https://issues.apache.org/jira/browse/SPARK-21726 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > Right now we don't have any checks in the optimizer to check for the > structural integrity of the plan (e.g. resolved). It would be great if in > test mode, we can check whether a plan is still resolved after the execution > of each rule, so we can catch rules that return invalid plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21744) Add retry logic when create new broadcast
zhoukang created SPARK-21744: Summary: Add retry logic when create new broadcast Key: SPARK-21744 URL: https://issues.apache.org/jira/browse/SPARK-21744 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0, 2.1.0, 1.6.1 Reporter: zhoukang Priority: Minor When driver submit new stage and there is a bad disk before spark,then driver may will exit caused by exception below: {code:java} Job aborted due to stage failure: Task serialization failed: java.io.IOException: Failed to create local dir in /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) scala.Option.foreach(Option.scala:236) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {code} We can add retry logic when create broadcast to lower the probability of this scenario occurrence。And there is no side-effect for normal scenario. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128318#comment-16128318 ] Gaurav Shah commented on SPARK-4502: [~marmbrus] Do you have some time to review this pull request ? It looks in a good state. > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128305#comment-16128305 ] Tejas Patil commented on SPARK-19256: - PR for writer side changes is out : https://github.com/apache/spark/pull/18954 I have documented the semantics and differences with Spark's bucketing in the jira description so that it makes it easy for anyone who wants to see what changes are done. Reader side changes are close to completion (core meat is done, but the work is unpolished). I assume that the writer PR will have some round of comments which will buy me time to work on that. Reader side changes depend on the writer side PR to get in so won't publish it until then. > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21743) top-most limit should not cause memory leak
Wenchen Fan created SPARK-21743: --- Summary: top-most limit should not cause memory leak Key: SPARK-21743 URL: https://issues.apache.org/jira/browse/SPARK-21743 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128271#comment-16128271 ] Nicholas Chammas commented on SPARK-17025: -- I'm still interested in this but I won't be able to test it until mid-next month, unfortunately. I've set myself a reminder to revisit this. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128267#comment-16128267 ] Feng Zhu commented on SPARK-21739: -- In such case, the Cast expression is evaluated directly without resolving *timeZoneId* because it is in the execution phase. {code:java} Cast(Literal(rawPartValues(partOrdinal)), attr.dataType).eval(null) {code} I will check this kind of issues and fix them. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. > Here is the error stack trace > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) > at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) > >at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21742) BisectingKMeans generate different models with/without caching
[ https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-21742: - Summary: BisectingKMeans generate different models with/without caching (was: BisectingKMeans generate different results with/without caching) > BisectingKMeans generate different models with/without caching > -- > > Key: SPARK-21742 > URL: https://issues.apache.org/jira/browse/SPARK-21742 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: zhengruifeng > > I found that {{BisectingKMeans}} will generate different models if the input > is cached or not. > Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we > cache the input, then the number of centers will change from 2 to 3. > So it looks like a potential bug. > {code} > import org.apache.spark.ml.param.ParamMap > import org.apache.spark.sql.Dataset > import org.apache.spark.ml.clustering._ > import org.apache.spark.ml.linalg._ > import scala.util.Random > case class TestRow(features: org.apache.spark.ml.linalg.Vector) > val rows = 10 > val dim = 1000 > val seed = 42 > val random = new Random(seed) > val nnz = random.nextInt(dim) > val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, > random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, > Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v)) > val sparseDataset = spark.createDataFrame(rdd) > val k = 5 > val bkm = new > BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res0: Int = 2 > sparseDataset.persist() > val model = bkm.fit(sparseDataset) > model.clusterCenters.length > res2: Int = 3 > {code} > [~imatiach] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21742) BisectingKMeans generate different results with/without caching
zhengruifeng created SPARK-21742: Summary: BisectingKMeans generate different results with/without caching Key: SPARK-21742 URL: https://issues.apache.org/jira/browse/SPARK-21742 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng I found that {{BisectingKMeans}} will generate different models if the input is cached or not. Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we cache the input, then the number of centers will change from 2 to 3. So it looks like a potential bug. {code} import org.apache.spark.ml.param.ParamMap import org.apache.spark.sql.Dataset import org.apache.spark.ml.clustering._ import org.apache.spark.ml.linalg._ import scala.util.Random case class TestRow(features: org.apache.spark.ml.linalg.Vector) val rows = 10 val dim = 1000 val seed = 42 val random = new Random(seed) val nnz = random.nextInt(dim) val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v)) val sparseDataset = spark.createDataFrame(rdd) val k = 5 val bkm = new BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123) val model = bkm.fit(sparseDataset) model.clusterCenters.length res0: Int = 2 sparseDataset.persist() val model = bkm.fit(sparseDataset) model.clusterCenters.length res2: Int = 3 {code} [~imatiach] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses
[ https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128243#comment-16128243 ] Takeshi Yamamuro commented on SPARK-18394: -- Any update? I checked and I found the master still has this issue; I just run the query above and dump output names in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L102. {code} 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_QUANTITY#9015,L_RETURNFLAG#9019,l_shipdate#9021,L_TAX#9018,L_DISCOUNT#9017,L_LINESTATUS#9020,L_EXTENDEDPRICE#9016 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_RETURNFLAG#9142,L_DISCOUNT#9140,L_EXTENDEDPRICE#9139,L_QUANTITY#9138,L_LINESTATUS#9143,l_shipdate#9144,L_TAX#9141 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_QUANTITY#9305,L_TAX#9308,l_shipdate#9311,L_DISCOUNT#9307,L_RETURNFLAG#9309,L_LINESTATUS#9310,L_EXTENDEDPRICE#9306 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_EXTENDEDPRICE#9451,L_QUANTITY#9450,L_RETURNFLAG#9454,L_TAX#9453,L_DISCOUNT#9452,l_shipdate#9456,L_LINESTATUS#9455 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_LINESTATUS#9600,l_shipdate#9601,L_DISCOUNT#9597,L_TAX#9598,L_EXTENDEDPRICE#9596,L_RETURNFLAG#9599,L_QUANTITY#9595 17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: L_QUANTITY#9740,L_TAX#9743,l_shipdate#9746,L_DISCOUNT#9742,L_EXTENDEDPRICE#9741,L_LINESTATUS#9745,L_RETURNFLAG#9744 ... {code} The attribute order is different, and then Spark generates different code in `GenerateColumnAccessor`. Also, I quickly checked `AttributeSet.toSeq` output attributes with a different order; {code} scala> val attr1 = AttributeReference("c1", IntegerType)(exprId = ExprId(1098)) scala> val attr2 = AttributeReference("c2", IntegerType)(exprId = ExprId(107)) scala> val attr3 = AttributeReference("c3", IntegerType)(exprId = ExprId(838)) scala> val attrSetA = AttributeSet(attr1 :: attr2 :: attr3 :: Nil) scala> val attr4 = AttributeReference("c4", IntegerType)(exprId = ExprId(389)) scala> val attr5 = AttributeReference("c5", IntegerType)(exprId = ExprId(89329)) scala> val attrSetB = AttributeSet(attr4 :: attr5 :: Nil) scala> (attrSetA ++ attrSetB).toSeq res1: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = WrappedArray(c3#838, c4#389, c2#107, c5#89329, c1#1098) scala> val attr1 = AttributeReference("c1", IntegerType)(exprId = ExprId(392)) scala> val attr2 = AttributeReference("c2", IntegerType)(exprId = ExprId(92)) scala> val attr3 = AttributeReference("c3", IntegerType)(exprId = ExprId(87)) scala> val attrSetA = AttributeSet(attr1 :: attr2 :: attr3 :: Nil) scala> val attr4 = AttributeReference("c4", IntegerType)(exprId = ExprId(9023920)) scala> val attr5 = AttributeReference("c5", IntegerType)(exprId = ExprId(522)) scala> val attrSetB = AttributeSet(attr4 :: attr5 :: Nil) scala> (attrSetA ++ attrSetB).toSeq res2: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = WrappedArray(c3#87, c1#392, c5#522, c2#92, c4#9023920) {code} As suggested, to fix this, `Attribute.toSeq` need to output attributes with a consistent order like; https://github.com/apache/spark/compare/master...maropu:SPARK-18394 > Executing the same query twice in a row results in CodeGenerator cache misses > - > > Key: SPARK-18394 > URL: https://issues.apache.org/jira/browse/SPARK-18394 > Project: Spark > Issue Type: Bug > Components: SQL > Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop >Reporter: Jonny Serencsa > > Executing the query: > {noformat} > select > l_returnflag, > l_linestatus, > sum(l_quantity) as sum_qty, > sum(l_extendedprice) as sum_base_price, > sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, > sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, > avg(l_quantity) as avg_qty, > avg(l_extendedprice) as avg_price, > avg(l_discount) as avg_disc, > count(*) as count_order > from > lineitem_1_row > where > l_shipdate <= date_sub('1998-12-01', '90') > group by > l_returnflag, > l_linestatus > ; > {noformat} > twice (in succession), will result in CodeGenerator cache misses in BOTH > executions. Since the query is identical, I would expect the same code to be > generated. > Turns out, the generated code is not exactly the same, resulting in cache > misses when performing the lookup in the CodeGenerator cache. Yet, the code > is equivalent. > Below is (some portion of the) generated code for two runs of the query: > run-1 > {noformat} > import java.nio.ByteBuffer; > import java.nio.ByteOrder; > import scala.collection.Iterator; > import org.apache.spark.sql.types.DataType; > import org.apache.spark.sql.catalyst.expre
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128242#comment-16128242 ] Liang-Chi Hsieh commented on SPARK-21657: - [~maropu] I've noticed that change. There is a hotfix trying to revert that: https://github.com/apache/spark/pull/17425. But in the end the hotfix doesn't revert it back. Actually I've tried to enable codegen for GenerateExec and ran those tests without failure in local. So I'm wondering why we still disable it. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21716) The time-range window can't be applied on the reduce operator
[ https://issues.apache.org/jira/browse/SPARK-21716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Donglai updated SPARK-21716: Description: I can't use GroupBy + Window operator to get the newest(the maximum event time) row in a window.It should make the window be applid on the reduce operator (was: I can't use GroupBy + Window operator to get the newest(the maximum event time) row in a window.It should make the window can be applid on the reduce operator) > The time-range window can't be applied on the reduce operator > -- > > Key: SPARK-21716 > URL: https://issues.apache.org/jira/browse/SPARK-21716 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Fan Donglai > > I can't use GroupBy + Window operator to get the newest(the maximum event > time) row in a window.It should make the window be applid on the reduce > operator -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128226#comment-16128226 ] Weichen Xu commented on SPARK-21741: OK I will work on this. I will post a design doc first. > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
Yanbo Liang created SPARK-21741: --- Summary: Python API for DataFrame-based multivariate summarizer Key: SPARK-21741 URL: https://issues.apache.org/jira/browse/SPARK-21741 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Yanbo Liang We support multivariate summarizer for DataFrame API at SPARK-19634, we should also make PySpark support it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128223#comment-16128223 ] Yanbo Liang commented on SPARK-21741: - [~WeichenXu123] Would you like to work on this? Thanks. > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19634) Feature parity for descriptive statistics in MLlib
[ https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-19634. - Resolution: Fixed Fix Version/s: 2.3.0 > Feature parity for descriptive statistics in MLlib > -- > > Key: SPARK-19634 > URL: https://issues.apache.org/jira/browse/SPARK-19634 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.1.0 >Reporter: Timothy Hunter >Assignee: Timothy Hunter > Fix For: 2.3.0 > > > This ticket tracks porting the functionality of > spark.mllib.MultivariateOnlineSummarizer over to spark.ml. > A design has been discussed in SPARK-19208 . Here is a design doc: > https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit# -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21712) Clarify PySpark Column.substr() type checking error message
[ https://issues.apache.org/jira/browse/SPARK-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-21712: Assignee: Nicholas Chammas > Clarify PySpark Column.substr() type checking error message > --- > > Key: SPARK-21712 > URL: https://issues.apache.org/jira/browse/SPARK-21712 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Trivial > Fix For: 2.3.0 > > > https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409 > "Can not mix the type" is really unclear. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21712) Clarify PySpark Column.substr() type checking error message
[ https://issues.apache.org/jira/browse/SPARK-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21712. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18926 [https://github.com/apache/spark/pull/18926] > Clarify PySpark Column.substr() type checking error message > --- > > Key: SPARK-21712 > URL: https://issues.apache.org/jira/browse/SPARK-21712 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Nicholas Chammas >Priority: Trivial > Fix For: 2.3.0 > > > https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409 > "Can not mix the type" is really unclear. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21681: -- Shepherd: Joseph K. Bradley > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21681: -- Affects Version/s: 2.3.0 > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21681: -- Target Version/s: 2.2.1, 2.3.0 > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21681: - Assignee: Weichen Xu > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127942#comment-16127942 ] Thomas Graves commented on SPARK-20589: --- Note that this type of option is also already supported in other big data engines like pig and mapreduce. There is a need for this and I don't think we should ask the user to split up their job to make it less optimal. This should not be that complex in the scheme of things. > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127930#comment-16127930 ] Amit Kumar commented on SPARK-20589: [~imranr] Like you said, we could restrict the number of executors for the whole pipeline, but it would make the throughput too slow or worse, start causing cause OOM and shuffle errors. As for the other option, I'm doing exactly that, breaking up one pipeline into multiple stages. But, as you can imagine, it makes your workflow code much more longer and complex than desired. Not only do we have to add the additional complexity of serializing intermediate data on HDFS , it increases the total time. Also , I don't agree about this being a rare use case. As [~mcnels1], also said, we would see this come up more and more as we start linking Apache Spark with external storage/querying solution which have a tighter QPS restrictions. > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127911#comment-16127911 ] Imran Rashid commented on SPARK-20589: -- couldn't you just limit the number of executors of your spark app? I realize that you might want more executors for other parts of your app, but you could always break into multiple independent spark apps. I realize it isn't ideal, but I'm worried about the complexity of this for what seems like a rare use case. > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21731) Upgrade scalastyle to 0.9
[ https://issues.apache.org/jira/browse/SPARK-21731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21731. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.3.0 https://github.com/apache/spark/pull/18943 > Upgrade scalastyle to 0.9 > - > > Key: SPARK-21731 > URL: https://issues.apache.org/jira/browse/SPARK-21731 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Trivial > Fix For: 2.3.0 > > > No new features that I'm interested in, but it fixes an issue with the import > order checker so that it provides more useful errors > (https://github.com/scalastyle/scalastyle/pull/185). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21740) DataFrame.write does not work with Phoenix JDBC Driver
Paul Wu created SPARK-21740: --- Summary: DataFrame.write does not work with Phoenix JDBC Driver Key: SPARK-21740 URL: https://issues.apache.org/jira/browse/SPARK-21740 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0, 2.0.0 Reporter: Paul Wu The reason for this is that Phoenix JDBC driver does not support "INSERT", but "UPSERT". Exception for the following program: 17/08/15 12:18:53 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00): Syntax error. Encountered "INSERT" at line 1, column 1. at org.apache.phoenix.exception.PhoenixParserException.newException(PhoenixParserException.java:33) {code:java} public class HbaseJDBCSpark { private static final SparkSession sparkSession = SparkSession.builder() .config("spark.sql.warehouse.dir", "file:///temp") .config("spark.driver.memory", "5g") .master("local[*]").appName("Spark2JdbcDs").getOrCreate(); static final String JDBC_URL = "jdbc:phoenix:somehost:2181:/hbase-unsecure"; public static void main(String[] args) { final Properties connectionProperties = new Properties(); Dataset jdbcDF = sparkSession.read() .jdbc(JDBC_URL, "javatest", connectionProperties); jdbcDF.show(); String url = JDBC_URL; Properties p = new Properties(); p.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver"); //p.put("batchsize", "10"); jdbcDF.write().mode(SaveMode.Append).jdbc(url, "javatest", p); sparkSession.close(); } // Create variables } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17742) Spark Launcher does not get failed state in Listener
[ https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17742. Resolution: Fixed Fix Version/s: 2.3.0 https://github.com/apache/spark/pull/18877 > Spark Launcher does not get failed state in Listener > - > > Key: SPARK-17742 > URL: https://issues.apache.org/jira/browse/SPARK-17742 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Aseem Bansal > Fix For: 2.3.0 > > > I tried to launch an application using the below code. This is dummy code to > reproduce the problem. I tried exiting spark with status -1, throwing an > exception etc. but in no case did the listener give me failed status. But if > a spark job returns -1 or throws an exception from the main method it should > be considered as a failure. > {code} > package com.example; > import org.apache.spark.launcher.SparkAppHandle; > import org.apache.spark.launcher.SparkLauncher; > import java.io.IOException; > public class Main2 { > public static void main(String[] args) throws IOException, > InterruptedException { > SparkLauncher launcher = new SparkLauncher() > .setSparkHome("/opt/spark2") > > .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar") > .setMainClass("com.example.Main") > .setMaster("local[2]"); > launcher.startApplication(new MyListener()); > Thread.sleep(1000 * 60); > } > } > class MyListener implements SparkAppHandle.Listener { > @Override > public void stateChanged(SparkAppHandle handle) { > System.out.println("state changed " + handle.getState()); > } > @Override > public void infoChanged(SparkAppHandle handle) { > System.out.println("info changed " + handle.getState()); > } > } > {code} > The spark job is > {code} > package com.example; > import org.apache.spark.sql.SparkSession; > import java.io.IOException; > public class Main { > public static void main(String[] args) throws IOException { > SparkSession sparkSession = SparkSession > .builder() > .appName("" + System.currentTimeMillis()) > .getOrCreate(); > try { > for (int i = 0; i < 15; i++) { > Thread.sleep(1000); > System.out.println("sleeping 1"); > } > } catch (InterruptedException e) { > e.printStackTrace(); > } > //sparkSession.stop(); > System.exit(-1); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657 ] poplav edited comment on SPARK-21720 at 8/15/17 6:10 PM: - [~kiszk] So, to clarify did the fix simply bump up the number of fields before we hit a failure and not actually make the number indefinite? As in we went from failing at 400 to failing at 1024. was (Author: poplav): So, to clarify did the fix simply bump up the number of fields before we hit a failure and not actually make the number indefinite? As in we went from failing at 400 to failing at 1024. > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and > Field142 = "" and Field143 = "" and Field144 = "" and Field145 = "" and > Field146 = "" and Field147 = "" and Field148 = "" and Field149 = "" and > Field150 = "" and Field151 = "" and Field152 = "" and Field153 = "" and > Field154 = "" and Field155 = "" and Field156 = "" and Field157 = "" and > Field158 = "" and Field159 = "" and Field160 = "" and Field161 = "" and > Field162 = "" and Field163 = "" and Field164 = "" and Field165 = "" and > Field166 = "" and Field167 = "" and Field168 = "" and Field169 = "" and > Field170 = "" and Field171 = "" and Field172 = "" and Field173 = "" and > Field174 = "" and Field175 = "" and Field176 = "" and Field177 = "" and > Field178 = "" and Field179 = "" and Field180 = "" and Field181 = "" and > Field182 = "" and Field183 = "" and Field184 = "" and Field185 =
[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657 ] poplav edited comment on SPARK-21720 at 8/15/17 6:10 PM: - [~kiszk] So, to clarify did the fix simply bump up the number of fields before we hit a failure and not actually make the number indefinite? As in we went from failing at 400 to failing at 1024? was (Author: poplav): [~kiszk] So, to clarify did the fix simply bump up the number of fields before we hit a failure and not actually make the number indefinite? As in we went from failing at 400 to failing at 1024. > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and > Field142 = "" and Field143 = "" and Field144 = "" and Field145 = "" and > Field146 = "" and Field147 = "" and Field148 = "" and Field149 = "" and > Field150 = "" and Field151 = "" and Field152 = "" and Field153 = "" and > Field154 = "" and Field155 = "" and Field156 = "" and Field157 = "" and > Field158 = "" and Field159 = "" and Field160 = "" and Field161 = "" and > Field162 = "" and Field163 = "" and Field164 = "" and Field165 = "" and > Field166 = "" and Field167 = "" and Field168 = "" and Field169 = "" and > Field170 = "" and Field171 = "" and Field172 = "" and Field173 = "" and > Field174 = "" and Field175 = "" and Field176 = "" and Field177 = "" and > Field178 = "" and Field179 = "" and Field180 = "" and Field181 = "" and > Field182 = "" and Field183 = "" and Field184 = "" and F
[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657 ] poplav commented on SPARK-21720: So, to clarify did the fix simply bump up the number of fields before we hit a failure and not actually make the number indefinite? As in we went from failing at 400 to failing at 1024. > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and > Field142 = "" and Field143 = "" and Field144 = "" and Field145 = "" and > Field146 = "" and Field147 = "" and Field148 = "" and Field149 = "" and > Field150 = "" and Field151 = "" and Field152 = "" and Field153 = "" and > Field154 = "" and Field155 = "" and Field156 = "" and Field157 = "" and > Field158 = "" and Field159 = "" and Field160 = "" and Field161 = "" and > Field162 = "" and Field163 = "" and Field164 = "" and Field165 = "" and > Field166 = "" and Field167 = "" and Field168 = "" and Field169 = "" and > Field170 = "" and Field171 = "" and Field172 = "" and Field173 = "" and > Field174 = "" and Field175 = "" and Field176 = "" and Field177 = "" and > Field178 = "" and Field179 = "" and Field180 = "" and Field181 = "" and > Field182 = "" and Field183 = "" and Field184 = "" and Field185 = "" and > Field186 = "" and Field187 = "" and Field188 = "" and Field189 = "" and > Field190 = "" and Field191 = "" and Field192 = "" and Field193 = "" and > Field194 = "" and Field195 = "" and Field196 = "" and Field197 = "" and > Field198 = "" and Fie
[jira] [Commented] (SPARK-21738) Thriftserver doesn't cancel jobs when session is closed
[ https://issues.apache.org/jira/browse/SPARK-21738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127542#comment-16127542 ] Xiao Li commented on SPARK-21738: - https://github.com/apache/spark/pull/18951 > Thriftserver doesn't cancel jobs when session is closed > --- > > Key: SPARK-21738 > URL: https://issues.apache.org/jira/browse/SPARK-21738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > When a session is closed, the jobs launched by that session should be killed > in order to avoid waste of resources. Instead, this doesn't happen. > So at the moment, if a user launches a query and then closes his connection, > the query goes on running until completion. This behavior should be changed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127511#comment-16127511 ] Ruslan Dautkhanov edited comment on SPARK-21657 at 8/15/17 4:59 PM: Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem might be worse in 2.2, but I don't think it's a root cause. As we see the same exponential time complexity to explode a nested array in Spark 2.0 and 2.1. was (Author: tagar): Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem might be worse in 2.2, but I don't think it's a root cause. As we see the same exponential time complecity to explode in Spark 2.0 and 2.1. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127511#comment-16127511 ] Ruslan Dautkhanov commented on SPARK-21657: --- Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem might be worse in 2.2, but I don't think it's a root cause. As we see the same exponential time complecity to explode in Spark 2.0 and 2.1. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127477#comment-16127477 ] Kazuaki Ishizaki commented on SPARK-21720: -- In this case, to add JVM option {{-Xss512m}} eliminates this exception and this works well. When the number of fields is 1024, I got the following exception: {code} 08:41:40.022 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB ... {code} I am working for solving this 64KB problem. > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and > Field142 = "" and Field143 = "" and Field144 = "" and Field145 = "" and > Field146 = "" and Field147 = "" and Field148 = "" and Field149 = "" and > Field150 = "" and Field151 = "" and Field152 = "" and Field153 = "" and > Field154 = "" and Field155 = "" and Field156 = "" and Field157 = "" and > Field158 = "" and Field159 = "" and Field160 = "" and Field161 = "" and > Field162 = "" and Field163 = "" and Field164 = "" and Field165 = "" and > Field166 = "" and Field167 = "" and Field168 = "" and Field169 = "" and > Field170 = "" and Field171 = "" and Field172 = "" and Field173 = "" and > Field174 = "" and Field175 = "" and Field176 = "" and
[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error
[ https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127477#comment-16127477 ] Kazuaki Ishizaki edited comment on SPARK-21720 at 8/15/17 4:26 PM: --- In this case, to add JVM option {{-Xss512m}} eliminates this exception and this works well. However, when the number of fields is 1024, I got the following exception: {code} 08:41:40.022 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB ... {code} I am working for solving this 64KB problem. was (Author: kiszk): In this case, to add JVM option {{-Xss512m}} eliminates this exception and this works well. When the number of fields is 1024, I got the following exception: {code} 08:41:40.022 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB ... {code} I am working for solving this 64KB problem. > Filter predicate with many conditions throw stackoverflow error > --- > > Key: SPARK-21720 > URL: https://issues.apache.org/jira/browse/SPARK-21720 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: srinivasan > > When trying to filter on dataset with many predicate conditions on both spark > sql and dataset filter transformation as described below, spark throws a > stackoverflow exception > Case 1: Filter Transformation on Data > Dataset filter = sourceDataset.filter(String.format("not(%s)", > buildQuery())); > filter.show(); > where buildQuery() returns > Field1 = "" and Field2 = "" and Field3 = "" and Field4 = "" and Field5 = > "" and BLANK_5 = "" and Field7 = "" and Field8 = "" and Field9 = "" and > Field10 = "" and Field11 = "" and Field12 = "" and Field13 = "" and > Field14 = "" and Field15 = "" and Field16 = "" and Field17 = "" and > Field18 = "" and Field19 = "" and Field20 = "" and Field21 = "" and > Field22 = "" and Field23 = "" and Field24 = "" and Field25 = "" and > Field26 = "" and Field27 = "" and Field28 = "" and Field29 = "" and > Field30 = "" and Field31 = "" and Field32 = "" and Field33 = "" and > Field34 = "" and Field35 = "" and Field36 = "" and Field37 = "" and > Field38 = "" and Field39 = "" and Field40 = "" and Field41 = "" and > Field42 = "" and Field43 = "" and Field44 = "" and Field45 = "" and > Field46 = "" and Field47 = "" and Field48 = "" and Field49 = "" and > Field50 = "" and Field51 = "" and Field52 = "" and Field53 = "" and > Field54 = "" and Field55 = "" and Field56 = "" and Field57 = "" and > Field58 = "" and Field59 = "" and Field60 = "" and Field61 = "" and > Field62 = "" and Field63 = "" and Field64 = "" and Field65 = "" and > Field66 = "" and Field67 = "" and Field68 = "" and Field69 = "" and > Field70 = "" and Field71 = "" and Field72 = "" and Field73 = "" and > Field74 = "" and Field75 = "" and Field76 = "" and Field77 = "" and > Field78 = "" and Field79 = "" and Field80 = "" and Field81 = "" and > Field82 = "" and Field83 = "" and Field84 = "" and Field85 = "" and > Field86 = "" and Field87 = "" and Field88 = "" and Field89 = "" and > Field90 = "" and Field91 = "" and Field92 = "" and Field93 = "" and > Field94 = "" and Field95 = "" and Field96 = "" and Field97 = "" and > Field98 = "" and Field99 = "" and Field100 = "" and Field101 = "" and > Field102 = "" and Field103 = "" and Field104 = "" and Field105 = "" and > Field106 = "" and Field107 = "" and Field108 = "" and Field109 = "" and > Field110 = "" and Field111 = "" and Field112 = "" and Field113 = "" and > Field114 = "" and Field115 = "" and Field116 = "" and Field117 = "" and > Field118 = "" and Field119 = "" and Field120 = "" and Field121 = "" and > Field122 = "" and Field123 = "" and Field124 = "" and Field125 = "" and > Field126 = "" and Field127 = "" and Field128 = "" and Field129 = "" and > Field130 = "" and Field131 = "" and Field132 = "" and Field133 = "" and > Field134 = "" and Field135 = "" and Field136 = "" and Field137 = "" and > Field138 = "" and Field139 = "" and Field140 = "" and Field141 = "" and
[jira] [Resolved] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join
[ https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhijit Bhole resolved SPARK-21034. --- Resolution: Duplicate > Allow filter pushdown filters through non deterministic functions for columns > involved in groupby / join > > > Key: SPARK-21034 > URL: https://issues.apache.org/jira/browse/SPARK-21034 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Abhijit Bhole > > If the column is involved in aggregation / join then pushing down filter > should not change the results. > Here is a sample code - > {code:java} > from pyspark.sql import functions as F > df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" > : 8}, >{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, > "c":7} ]) > df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain() > df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *HashAggregate(keys=[a#15L], functions=[sum(b#16L)]) > +- Exchange hashpartitioning(a#15L, 4) >+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)]) > +- *Project [a#15L, b#16L] > +- *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > >>> > >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)]) >+- Exchange hashpartitioning(a#15L, 4) > +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), > partial_first(c#17L, false)]) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > {code} > As you can see, the filter is not pushed down when F.first aggregate function > is used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. Here is the error stack trace {code: none} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) {code} was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. Here is the error stack trace {code:scala} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) {code} > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. > Here is the error stack trace > {code: none} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. Here is the error stack trace {code} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) {code} was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. Here is the error stack trace {code: none} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) {code} > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. > Here is the error stack trace > {code} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345)
[jira] [Commented] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join
[ https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127454#comment-16127454 ] Abhijit Bhole commented on SPARK-21034: --- Thank you !! > Allow filter pushdown filters through non deterministic functions for columns > involved in groupby / join > > > Key: SPARK-21034 > URL: https://issues.apache.org/jira/browse/SPARK-21034 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Abhijit Bhole > > If the column is involved in aggregation / join then pushing down filter > should not change the results. > Here is a sample code - > {code:java} > from pyspark.sql import functions as F > df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" > : 8}, >{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, > "c":7} ]) > df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain() > df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *HashAggregate(keys=[a#15L], functions=[sum(b#16L)]) > +- Exchange hashpartitioning(a#15L, 4) >+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)]) > +- *Project [a#15L, b#16L] > +- *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > >>> > >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)]) >+- Exchange hashpartitioning(a#15L, 4) > +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), > partial_first(c#17L, false)]) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > {code} > As you can see, the filter is not pushed down when F.first aggregate function > is used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. Here is the error stack trace {code:scala} java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) {code} was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. > Here is the error stack trace > {code:scala} > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) > at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) > >at > org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) > at > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) -
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that [TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230) try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The error stack trace is attached. > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that [TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230) try to cast the string to timestamp regardless if the timeZone exists. was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The error stack trace is attached. > The root cause is that > [TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230) > try to cast the string to timestamp regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is that TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The error stack trace is attached. > The root cause is that TableReader.scala#230 try to cast the string to > timestamp regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:java} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:scala} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:java} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The error stack trace is attached. > The root cause is TableReader.scala#230 try to cast the string to timestamp > regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Description: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: {code:scala} spark.sql("create table test (foo string) parititioned by (ts timestamp)") spark.sql("insert into table test partition(ts = 1) values('hi')") spark.table("test").show() {code} The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: bq. bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)") bq. spark.sql("insert into table test partition(ts = 1) values('hi')") bq. spark.table("test").show() The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > {code:scala} > spark.sql("create table test (foo string) parititioned by (ts timestamp)") > spark.sql("insert into table test partition(ts = 1) values('hi')") > spark.table("test").show() > {code} > The error stack trace is attached. > The root cause is TableReader.scala#230 try to cast the string to timestamp > regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Docs Text: (was: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: bq. bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)") bq. spark.sql("insert into table test partition(ts = 1) values('hi')") bq. spark.table("test").show() The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists.) > timestamp partition would fail in v2.2.0 > > > Key: SPARK-21739 > URL: https://issues.apache.org/jira/browse/SPARK-21739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: wangzhihao > > The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we > select data from a table with timestamp partitions. > The steps to reproduce it: > bq. > bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)") > bq. spark.sql("insert into table test partition(ts = 1) values('hi')") > bq. spark.table("test").show() > The error stack trace is attached. > The root cause is TableReader.scala#230 try to cast the string to timestamp > regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0
[ https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangzhihao updated SPARK-21739: --- Docs Text: The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: bq. bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)") bq. spark.sql("insert into table test partition(ts = 1) values('hi')") bq. spark.table("test").show() The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. was: java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.fillPartitionKeys$1(TableReader.scala:228) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.apply(TableReader.scala:235) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.apply(TableReader.scala:196) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.hive.HadoopTableReader.makeRDDForPartitionedTable(TableReader.scala:196) at org.apache.spark.sql.hive.HadoopTableReader.makeRDDForPartitionedTable(TableReader.scala:141) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:188) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:188) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2478) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2853) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153) at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836) at org.apache.spark.sq
[jira] [Created] (SPARK-21739) timestamp partition would fail in v2.2.0
wangzhihao created SPARK-21739: -- Summary: timestamp partition would fail in v2.2.0 Key: SPARK-21739 URL: https://issues.apache.org/jira/browse/SPARK-21739 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: wangzhihao The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we select data from a table with timestamp partitions. The steps to reproduce it: bq. bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)") bq. spark.sql("insert into table test partition(ts = 1) values('hi')") bq. spark.table("test").show() The error stack trace is attached. The root cause is TableReader.scala#230 try to cast the string to timestamp regardless if the timeZone exists. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21721) Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable
[ https://issues.apache.org/jira/browse/SPARK-21721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21721. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.3.0 2.2.1 2.1.2 > Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable > -- > > Key: SPARK-21721 > URL: https://issues.apache.org/jira/browse/SPARK-21721 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: yzheng616 >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.1.2, 2.2.1, 2.3.0 > > > The leak came from org.apache.spark.sql.hive.execution.InsertIntoHiveTable. > At line 118, it put a staging path to FileSystem delete cache, and then > remove the path from disk at line 385. It does not remove the path from > FileSystem cache. If a streaming application keep persisting data to a > partitioned hive table, the memory will keep increasing until JVM terminated. > Below is a simple code to reproduce it. > {code:java} > package test > import org.apache.spark.sql.SparkSession > import org.apache.hadoop.fs.Path > import org.apache.hadoop.fs.FileSystem > import org.apache.spark.sql.SaveMode > import java.lang.reflect.Field > case class PathLeakTest(id: Int, gp: String) > object StagePathLeak { > def main(args: Array[String]): Unit = { > val spark = > SparkSession.builder().master("local[4]").appName("StagePathLeak").enableHiveSupport().getOrCreate() > spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") > //create a partitioned table > spark.sql("drop table if exists path_leak"); > spark.sql("create table if not exists path_leak(id int)" + > " partitioned by (gp String)"+ > " row format serde > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+ > " stored as"+ > " inputformat > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+ > " outputformat > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'") > var seq = new scala.collection.mutable.ArrayBuffer[PathLeakTest]() > // 2 partitions > for (x <- 1 to 2) { > seq += (new PathLeakTest(x, "g" + x)) > } > val rdd = spark.sparkContext.makeRDD[PathLeakTest](seq) > //insert 50 records to Hive table > for (j <- 1 to 50) { > val df = spark.createDataFrame(rdd) > //#1 InsertIntoHiveTable line 118: add stage path to FileSystem > deleteOnExit cache > //#2 InsertIntoHiveTable line 385: delete the path from disk but not > from the FileSystem cache, and it caused the leak > df.write.mode(SaveMode.Overwrite).insertInto("path_leak") > } > > val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration) > val deleteOnExit = getDeleteOnExit(fs.getClass) > deleteOnExit.setAccessible(true) > val caches = deleteOnExit.get(fs).asInstanceOf[java.util.TreeSet[Path]] > //check FileSystem deleteOnExit cache size > println(caches.size()) > val it = caches.iterator() > //all starge pathes were still cached even they have already been deleted > from the disk > while(it.hasNext()){ > println(it.next()); > } > } > > def getDeleteOnExit(cls: Class[_]) : Field = { > try{ >return cls.getDeclaredField("deleteOnExit") > }catch{ > case ex: NoSuchFieldException => return > getDeleteOnExit(cls.getSuperclass) > } > return null > } > } > {code} > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21738) Thriftserver doesn't cancel jobs when session is closed
Marco Gaido created SPARK-21738: --- Summary: Thriftserver doesn't cancel jobs when session is closed Key: SPARK-21738 URL: https://issues.apache.org/jira/browse/SPARK-21738 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Marco Gaido When a session is closed, the jobs launched by that session should be killed in order to avoid waste of resources. Instead, this doesn't happen. So at the moment, if a user launches a query and then closes his connection, the query goes on running until completion. This behavior should be changed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127339#comment-16127339 ] Hyukjin Kwon commented on SPARK-19019: -- I think this was backported into Spark 2.1.1. Was your Spark version, 2.1.1+? > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > values in the function (non-bound arguments). > This ends up as below: > {code} > import types > import collections > def _copy_func(f): > return types.FunctionType(f.__code__, f.__globals__, f.__name__, > f.__defaults__, f.__closure__) > _old_namedtuple = _copy_func(collections.namedtuple) > _old_namedtuple(, "b") > _old_namedtuple("a") > {code} > If we call as below: > {code} > >>> _old_namedtuple("a", "b") > Traceback (most recent call last): > File "", line 1, in > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > It throws an exception as above becuase {{__kwdefaults__}} for required > keyword arguments seem unset in the copied function. So, if we give explicit > value for these, > {code} > >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None) > > {code} > It works fine. > It seems now we should properly set these into the hijected one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127330#comment-16127330 ] Mathias M. Andersen commented on SPARK-19019: - Just got this error post fix on spark 2.1: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, in from pyspark.context import SparkContext File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 25, in import platform File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in "system node release version machine processor") File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > values in the function (non-bound arguments). > This ends up as below: > {code} > import types > import collections > def _copy_func(f): > return types.FunctionType(f.__code__, f.__globals__, f.__name__, > f.__defaults__, f.__closure__) > _old_namedtuple = _copy_func(collections.namedtuple) > _old_namedtuple(, "b") > _old_namedtuple("a") > {code} > If we call as below: > {code} > >>> _old_namedtuple("a", "b") > Traceback (most recent call last): > File "", line 1, in > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > It throws an exception as above becuase {{__kwdefaults__}} for required > keyword arguments seem unset in the copied function. So, if we give explicit > value for these, > {code} > >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None) > > {code} > It works fine. > It seems now we should properly set these into the hijected one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h
[jira] [Comment Edited] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127330#comment-16127330 ] Mathias M. Andersen edited comment on SPARK-19019 at 8/15/17 2:51 PM: -- Just got this error post fix on spark 2.1: {code:java} Traceback (most recent call last): File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, in from pyspark.context import SparkContext File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 25, in import platform File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in "system node release version machine processor") File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' {code} was (Author: mrmathias): Just got this error post fix on spark 2.1: Traceback (most recent call last): File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in _run_module_as_main mod_name, mod_spec, code = _get_module_details(mod_name, _Error) File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in _get_module_details __import__(pkg_name) File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, in from pyspark.context import SparkContext File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in from pyspark.java_gateway import launch_gateway File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 25, in import platform File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in "system node release version machine processor") File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 381, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > valu
[jira] [Created] (SPARK-21737) Create communication channel between arbitrary clients and the Spark AM in YARN mode
Jong Yoon Lee created SPARK-21737: - Summary: Create communication channel between arbitrary clients and the Spark AM in YARN mode Key: SPARK-21737 URL: https://issues.apache.org/jira/browse/SPARK-21737 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: Jong Yoon Lee Priority: Minor Fix For: 2.1.1 In this JIRA, I develop code to create a communication channel between arbitrary clients and a Spark AM on YARN. This code can be utilized to send commands such as getting status command, getting history info from the CLI, killing the application and pushing new tokens. Design Doc: https://docs.google.com/document/d/1QMbWhg13ocIoADywZQBRRVj-b9Zf8CnBrruP5JhcOOY/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21736. --- Resolution: Not A Problem > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* > UPDATE: I replaced the \ with / and it works. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Brigoli updated SPARK-21736: - Description: We are currently running Spark 2.1 in Windows Server 2012 in Production. While trying to upgrade to Spark 2.2.0 spark we started getting this error after submitting job in stand-alone cluster mode: Could not load properties; nested exception is java.io.IOException: java.net.URISyntaxException: Illegal character in opaque part at index 7: file:Z:\SIT1\TreatmentManager\app.properties We passing a -D argument to the spark submit with a value of file:Z:\SIT1\TreatmentManager\app.properties *These URI -D arguments was working fine in Spark 2.1.0* UPDATE: I replaced the \ with / and it works. was: We are currently running Spark 2.1 in Windows Server 2012 in Production. While trying to upgrade to Spark 2.2.0 spark we started getting this error after submitting job in stand-alone cluster mode: Could not load properties; nested exception is java.io.IOException: java.net.URISyntaxException: Illegal character in opaque part at index 7: file:Z:\SIT1\TreatmentManager\app.properties We passing a -D argument to the spark submit with a value of file:Z:\SIT1\TreatmentManager\app.properties *These URI -D arguments was working fine in Spark 2.1.0* > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* > UPDATE: I replaced the \ with / and it works. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127309#comment-16127309 ] Sean Owen commented on SPARK-21736: --- This isn't related to Spark though. The value, property are something specific to your app. > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Brigoli updated SPARK-21736: - Description: We are currently running Spark 2.1 in Windows Server 2012 in Production. While trying to upgrade to Spark 2.2.0 spark we started getting this error after submitting job in stand-alone cluster mode: Could not load properties; nested exception is java.io.IOException: java.net.URISyntaxException: Illegal character in opaque part at index 7: file:Z:\SIT1\TreatmentManager\app.properties We passing a -D argument to the spark submit with a value of file:Z:\SIT1\TreatmentManager\app.properties *These URI -D arguments was working fine in Spark 2.1.0* was: We are currently running Spark 2.1 in Windows Server 2012 in Production. While trying to upgrade to Spark 2.2.0 spark we started getting this error after submitting job in stand-alone cluster mode: Could not load properties; nested exception is java.io.IOException: java.net.URISyntaxException: Illegal character in opaque part at index 7: file:Z:\SIT1\TreatmentManager\app.properties We have this lines in the *spark-defaults.conf* spark.eventLog.dir file:/z:/sparklogs spark.history.fs.logDirectory file:/z:/sparklogs *These URIs worked fine in Spark 2.1.0* Summary: Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties" (was: Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf) > Spark 2.2 in Windows does not recognize the URI > "file:Z:\SIT1\TreatmentManager\app.properties" > -- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We passing a -D argument to the spark submit with a value of > file:Z:\SIT1\TreatmentManager\app.properties > *These URI -D arguments was working fine in Spark 2.1.0* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode
[ https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127288#comment-16127288 ] Liang-Chi Hsieh commented on SPARK-21726: - [~rxin] Thanks for pinging me! Yes, I'm interested in doing this. > Check for structural integrity of the plan in QO in test mode > - > > Key: SPARK-21726 > URL: https://issues.apache.org/jira/browse/SPARK-21726 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > Right now we don't have any checks in the optimizer to check for the > structural integrity of the plan (e.g. resolved). It would be great if in > test mode, we can check whether a plan is still resolved after the execution > of each rule, so we can catch rules that return invalid plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21734) spark job blocked by updateDependencies
[ https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21734. --- Resolution: Invalid Not sure what you mean, but this should start maybe as a question on the mailing list. > spark job blocked by updateDependencies > --- > > Key: SPARK-21734 > URL: https://issues.apache.org/jira/browse/SPARK-21734 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: yarn-cluster >Reporter: Cheng jin > > as above told: I ran a spark job in yarn, it was blocked by bad network here > is the exceptions: > {code:java} > :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > at java.io.InputStream.read(InputStream.java:101) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again
[ https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127204#comment-16127204 ] Thomas Graves commented on SPARK-21714: --- I haven't had time to get to it, so it would be great if you can work on it > SparkSubmit in Yarn Client mode downloads remote files and then reuploads > them again > > > Key: SPARK-21714 > URL: https://issues.apache.org/jira/browse/SPARK-21714 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Priority: Critical > > SPARK-10643 added the ability for spark-submit to download remote file in > client mode. > However in yarn mode this introduced a bug where it downloads them for the > client but then yarn client just reuploads them to HDFS and uses them again. > This should not happen when the remote file is HDFS. This is wasting > resources and its defeating the distributed cache because if the original > object was public it would have been shared by many users. By us downloading > and reuploading, it becomes private. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21734) spark job blocked by updateDependencies
[ https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127203#comment-16127203 ] Cheng jin commented on SPARK-21734: --- we met this problem, in fact , it is like this: {code:java} private def doFetchFile( url: String, targetDir: File, filename: String, conf: SparkConf, securityMgr: SecurityManager, hadoopConf: Configuration) { val targetFile = new File(targetDir, filename) val uri = new URI(url) val fileOverwrite = conf.getBoolean("spark.files.overwrite", defaultValue = false) Option(uri.getScheme).getOrElse("file") match { case "spark" => if (SparkEnv.get == null) { throw new IllegalStateException( "Cannot retrieve files with 'spark' scheme without an active SparkEnv.") } val source = SparkEnv.get.rpcEnv.openChannel(url) val is = Channels.newInputStream(source) downloadFile(url, is, targetFile, fileOverwrite) case "http" | "https" | "ftp" => var uc: URLConnection = null if (securityMgr.isAuthenticationEnabled()) { logDebug("fetchFile with security enabled") val newuri = constructURIForAuthentication(uri, securityMgr) uc = newuri.toURL().openConnection() uc.setAllowUserInteraction(false) } else { logDebug("fetchFile not using security") uc = new URL(url).openConnection() } Utils.setupSecureURLConnection(uc, securityMgr) val timeoutMs = conf.getTimeAsSeconds("spark.files.fetchTimeout", "60s").toInt * 1000 uc.setConnectTimeout(timeoutMs) uc.setReadTimeout(timeoutMs) uc.connect() val in = uc.getInputStream() downloadFile(url, in, targetFile, fileOverwrite) {code} updatedependencies() -> fetchfile() -> dofetchfile() it goes as case "spark" , then open a inputSteam and call read() to download file. However, call() is a blocking method, does adding a readTimeout work? > spark job blocked by updateDependencies > --- > > Key: SPARK-21734 > URL: https://issues.apache.org/jira/browse/SPARK-21734 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: yarn-cluster >Reporter: Cheng jin > > as above told: I ran a spark job in yarn, it was blocked by bad network here > is the exceptions: > {code:java} > :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > at java.io.InputStream.read(InputStream.java:101) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) >
[jira] [Commented] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again
[ https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127199#comment-16127199 ] Saisai Shao commented on SPARK-21714: - Let me take a crack on this if no one is working on it. > SparkSubmit in Yarn Client mode downloads remote files and then reuploads > them again > > > Key: SPARK-21714 > URL: https://issues.apache.org/jira/browse/SPARK-21714 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Priority: Critical > > SPARK-10643 added the ability for spark-submit to download remote file in > client mode. > However in yarn mode this introduced a bug where it downloads them for the > client but then yarn client just reuploads them to HDFS and uses them again. > This should not happen when the remote file is HDFS. This is wasting > resources and its defeating the distributed cache because if the original > object was public it would have been shared by many users. By us downloading > and reuploading, it becomes private. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf
[ https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127172#comment-16127172 ] Sean Owen commented on SPARK-21736: --- Hm, isn't it {{file:///z:/...}}? not sure if that makes a difference. You seem to be showing an error with a URL that isn't the one you use for Spark though. > Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from > spark-defaults.conf > --- > > Key: SPARK-21736 > URL: https://issues.apache.org/jira/browse/SPARK-21736 > Project: Spark > Issue Type: Bug > Components: Windows >Affects Versions: 2.2.0 > Environment: OS: Windows Server 2012 > JVM: Oracle Java 1.8 (hotspot) >Reporter: Ross Brigoli > > We are currently running Spark 2.1 in Windows Server 2012 in Production. > While trying to upgrade to Spark 2.2.0 spark we started getting this error > after submitting job in stand-alone cluster mode: > Could not load properties; nested exception is java.io.IOException: > java.net.URISyntaxException: Illegal character in opaque part at index 7: > file:Z:\SIT1\TreatmentManager\app.properties > We have this lines in the *spark-defaults.conf* > spark.eventLog.dir file:/z:/sparklogs > spark.history.fs.logDirectory file:/z:/sparklogs > *These URIs worked fine in Spark 2.1.0* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf
Ross Brigoli created SPARK-21736: Summary: Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf Key: SPARK-21736 URL: https://issues.apache.org/jira/browse/SPARK-21736 Project: Spark Issue Type: Bug Components: Windows Affects Versions: 2.2.0 Environment: OS: Windows Server 2012 JVM: Oracle Java 1.8 (hotspot) Reporter: Ross Brigoli We are currently running Spark 2.1 in Windows Server 2012 in Production. While trying to upgrade to Spark 2.2.0 spark we started getting this error after submitting job in stand-alone cluster mode: Could not load properties; nested exception is java.io.IOException: java.net.URISyntaxException: Illegal character in opaque part at index 7: file:Z:\SIT1\TreatmentManager\app.properties We have this lines in the *spark-defaults.conf* spark.eventLog.dir file:/z:/sparklogs spark.history.fs.logDirectory file:/z:/sparklogs *These URIs worked fine in Spark 2.1.0* -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21704) Add the description of 'sbin/stop-slave.sh' in spark-standalone.html.
[ https://issues.apache.org/jira/browse/SPARK-21704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21704. --- Resolution: Not A Problem > Add the description of 'sbin/stop-slave.sh' in spark-standalone.html. > -- > > Key: SPARK-21704 > URL: https://issues.apache.org/jira/browse/SPARK-21704 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > 1.Add the description of 'sbin/stop-slave.sh' in spark-standalone.html. > `sbin/stop-slave.sh` - Stops a slave instance on the machine the script is > executed on. > 2.The description of 'sbin/start-slaves.sh' is not accurate.Should be > modified as follows: > Starts all slave instances on each machine specified in the `conf/slaves` > file. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21734) spark job blocked by updateDependencies
[ https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21734: -- Flags: (was: Patch) Target Version/s: (was: 2.2.0) Labels: (was: patch) You need to read http://spark.apache.org/contributing.html first This doesn't describe a particular problem. How does this happen, what problem does it cause, etc. This alone doesn't mean anything and I'd close it. > spark job blocked by updateDependencies > --- > > Key: SPARK-21734 > URL: https://issues.apache.org/jira/browse/SPARK-21734 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: yarn-cluster >Reporter: Cheng jin > > as above told: I ran a spark job in yarn, it was blocked by bad network here > is the exceptions: > {code:java} > :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > at java.io.InputStream.read(InputStream.java:101) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21735) spark job blocked by updateDependencies
[ https://issues.apache.org/jira/browse/SPARK-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21735. --- Resolution: Duplicate Target Version/s: (was: 2.2.0) > spark job blocked by updateDependencies > --- > > Key: SPARK-21735 > URL: https://issues.apache.org/jira/browse/SPARK-21735 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: yarn-cluster >Reporter: Cheng jin > Labels: patch > > as above told: I ran a spark job in yarn, it was blocked by bad network here > is the exceptions: > {code:java} > :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) > at scala.util.Try$.apply(Try.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > at java.io.InputStream.read(InputStream.java:101) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at > org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) > at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) > at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127062#comment-16127062 ] Steve Loughran commented on SPARK-21702: ps, you shouldn't need to set the s3a.impl field; that's handed automatically > Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used > - > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127061#comment-16127061 ] Steve Loughran commented on SPARK-21702: This is interesting. What may be happening is that whatever s3a FS is being created, it's not picking up the options you are setting in spark conf. Set the values in core-site.xml and/or spark-default & see if is picked up there. Otherwise, if you can patch a (short) example with this problem I'll see if I can replicate in an integration test > Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used > - > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21735) spark job blocked by updateDependencies
Cheng jin created SPARK-21735: - Summary: spark job blocked by updateDependencies Key: SPARK-21735 URL: https://issues.apache.org/jira/browse/SPARK-21735 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Environment: yarn-cluster Reporter: Cheng jin as above told: I ran a spark job in yarn, it was blocked by bad network here is the exceptions: {code:java} :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.io.InputStream.read(InputStream.java:101) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21734) spark job blocked by updateDependencies
Cheng jin created SPARK-21734: - Summary: spark job blocked by updateDependencies Key: SPARK-21734 URL: https://issues.apache.org/jira/browse/SPARK-21734 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Environment: yarn-cluster Reporter: Cheng jin as above told: I ran a spark job in yarn, it was blocked by bad network here is the exceptions: {code:java} :at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.io.InputStream.read(InputStream.java:101) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303) at org.apache.spark.util.Utils$.copyStream(Utils.scala:362) at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509) at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:99) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-21681: --- Description: MLOR do not work correctly when featureStd contains zero. We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0) {code} val multinomialDatasetWithZeroVar = { val nPoints = 100 val coefficients = Array( -0.57997, 0.912083, -0.371077, -0.16624, -0.84355, -0.048509) val xMean = Array(5.843, 3.0) val xVariance = Array(0.6856, 0.0) // including zero variance val testData = generateMultinomialLogisticInput( coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0)) df.cache() df } {code} was:MLOR do not work correctly when featureStd contains zero. > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126978#comment-16126978 ] Jepson commented on SPARK-21733: [~sowen], thank you for correcting me, I will notice it at the next time. > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21733. --- Resolution: Not A Problem Fix Version/s: (was: 2.1.1) It wasn't "Fixed"; there was no problem > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-21733: --- > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126974#comment-16126974 ] Saisai Shao commented on SPARK-21733: - Should the resolution of this JIRA be "invalid" or else? Actually nothing is fixed here. > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jepson resolved SPARK-21733. Resolution: Fixed Fix Version/s: 2.1.1 > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Fix For: 2.1.1 > > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126970#comment-16126970 ] Jepson commented on SPARK-21733: I have resolve this issue.Thanks for [~jerryshao] and [~srowen] spark-submit \ --class com.jiuye.KSSH \ --master yarn \ --deploy-mode cluster \ --driver-memory 3g \ --executor-memory 2g \ --executor-cores 2 \ --num-executors 8 \ --jars $(echo /opt/software/spark/spark_on_yarn/libs/*.jar | tr ' ' ',') \ --conf "spark.ui.showConsoleProgress=false" \ --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=2G -XX:+UseConcMarkSweepGC" \ --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC" \ --conf "spark.sql.tungsten.enabled=false" \ --conf "spark.sql.codegen=false" \ --conf "spark.sql.unsafe.enabled=false" \ --conf "spark.streaming.backpressure.enabled=true" \ --conf "spark.streaming.kafka.maxRatePerPartition=1000" \ --conf "spark.locality.wait=1s" \ --conf "spark.streaming.blockInterval=1500ms" \ --conf "spark.shuffle.consolidateFiles=true" \ --conf "spark.executor.heartbeatInterval=36" \ --conf "spark.yarn.am.memoryOverhead=512m" \ --conf "spark.yarn.driver.memoryOverhead=512m" \ --conf "spark.yarn.executor.memoryOverhead=512m" \ /opt/software/spark/spark_on_yarn/KSSH-0.3.jar > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21733: -- Shepherd: (was: -) Flags: (was: Patch,Important) Labels: (was: patch) Yes, this looks normal > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21723) Can't write LibSVM - key not found: numFeatures
[ https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Vršovský updated SPARK-21723: - Description: Writing a dataset to LibSVM format raises an exception {{java.util.NoSuchElementException: key not found: numFeatures}} Happens only when the dataset was NOT read from a LibSVM format before (because otherwise numFeatures is in its metadata). Steps to reproduce: {code} import org.apache.spark.ml.linalg.Vectors val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") dfTemp.coalesce(1).write.format("libsvm").save("...filename...") {code} PR with a fix and unit test is ready - see [https://github.com/apache/spark/pull/18872]. was: Writing a dataset to LibSVM format raises an exception {{java.util.NoSuchElementException: key not found: numFeatures}} Happens only when the dataset was NOT read from a LibSVM format before (because otherwise numFeatures is in its metadata). Steps to reproduce: {code:scala} import org.apache.spark.ml.linalg.Vectors val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") dfTemp.coalesce(1).write.format("libsvm").save("...filename...") {code} PR with a fix and unit test is ready - see [https://github.com/apache/spark/pull/18872]. > Can't write LibSVM - key not found: numFeatures > --- > > Key: SPARK-21723 > URL: https://issues.apache.org/jira/browse/SPARK-21723 > Project: Spark > Issue Type: Bug > Components: Input/Output, ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jan Vršovský > > Writing a dataset to LibSVM format raises an exception > {{java.util.NoSuchElementException: key not found: numFeatures}} > Happens only when the dataset was NOT read from a LibSVM format before > (because otherwise numFeatures is in its metadata). Steps to reproduce: > {code} > import org.apache.spark.ml.linalg.Vectors > val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, > (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) > val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") > dfTemp.coalesce(1).write.format("libsvm").save("...filename...") > {code} > PR with a fix and unit test is ready - see > [https://github.com/apache/spark/pull/18872]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21723) Can't write LibSVM - key not found: numFeatures
[ https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126939#comment-16126939 ] Nick Pentreath commented on SPARK-21723: Yes, we should definitely be able to write LibSVM format regardless of whether the original data was read from that format, and whether we have ML metadata attached to the dataframe. We should be able to inspect the vectors to get the size in the absence of the metadata. > Can't write LibSVM - key not found: numFeatures > --- > > Key: SPARK-21723 > URL: https://issues.apache.org/jira/browse/SPARK-21723 > Project: Spark > Issue Type: Bug > Components: Input/Output, ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jan Vršovský > > Writing a dataset to LibSVM format raises an exception > {{java.util.NoSuchElementException: key not found: numFeatures}} > Happens only when the dataset was NOT read from a LibSVM format before > (because otherwise numFeatures is in its metadata). Steps to reproduce: > {{import org.apache.spark.ml.linalg.Vectors > val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, > (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) > val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") > dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}} > PR with a fix and unit test is ready. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21723) Can't write LibSVM - key not found: numFeatures
[ https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Vršovský updated SPARK-21723: - Description: Writing a dataset to LibSVM format raises an exception {{java.util.NoSuchElementException: key not found: numFeatures}} Happens only when the dataset was NOT read from a LibSVM format before (because otherwise numFeatures is in its metadata). Steps to reproduce: {code:scala} import org.apache.spark.ml.linalg.Vectors val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") dfTemp.coalesce(1).write.format("libsvm").save("...filename...") {code} PR with a fix and unit test is ready - see [https://github.com/apache/spark/pull/18872]. was: Writing a dataset to LibSVM format raises an exception {{java.util.NoSuchElementException: key not found: numFeatures}} Happens only when the dataset was NOT read from a LibSVM format before (because otherwise numFeatures is in its metadata). Steps to reproduce: {{import org.apache.spark.ml.linalg.Vectors val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}} PR with a fix and unit test is ready. > Can't write LibSVM - key not found: numFeatures > --- > > Key: SPARK-21723 > URL: https://issues.apache.org/jira/browse/SPARK-21723 > Project: Spark > Issue Type: Bug > Components: Input/Output, ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Jan Vršovský > > Writing a dataset to LibSVM format raises an exception > {{java.util.NoSuchElementException: key not found: numFeatures}} > Happens only when the dataset was NOT read from a LibSVM format before > (because otherwise numFeatures is in its metadata). Steps to reproduce: > {code:scala} > import org.apache.spark.ml.linalg.Vectors > val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0, > (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0) > val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features") > dfTemp.coalesce(1).write.format("libsvm").save("...filename...") > {code} > PR with a fix and unit test is ready - see > [https://github.com/apache/spark/pull/18872]. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception
[ https://issues.apache.org/jira/browse/SPARK-21691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126947#comment-16126947 ] Bjoern Toldbod commented on SPARK-21691: The suggested solution works for me. Thanks > Accessing canonicalized plan for query with limit throws exception > -- > > Key: SPARK-21691 > URL: https://issues.apache.org/jira/browse/SPARK-21691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bjoern Toldbod > > Accessing the logical, canonicalized plan fails for queries with limits. > The following demonstrates the issue: > {code:java} > val session = SparkSession.builder.master("local").getOrCreate() > // This works > session.sql("select * from (values 0, > 1)").queryExecution.logical.canonicalized > // This fails > session.sql("select * from (values 0, 1) limit > 1").queryExecution.logical.canonicalized > {code} > The message in the thrown exception is somewhat confusing (or at least not > directly related to the limit): > "Invalid call to toAttribute on unresolved object, tree: *" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126917#comment-16126917 ] Takeshi Yamamuro edited comment on SPARK-21652 at 8/15/17 8:08 AM: --- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/18882 was (Author: maropu): The pr is: https://github.com/apache/spark/pull/18882 > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) &
[jira] [Comment Edited] (SPARK-21363) Prevent column name duplication in temporary view
[ https://issues.apache.org/jira/browse/SPARK-21363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126847#comment-16126847 ] Takeshi Yamamuro edited comment on SPARK-21363 at 8/15/17 8:08 AM: --- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/18938 was (Author: maropu): This pr is: https://github.com/apache/spark/pull/18938 > Prevent column name duplication in temporary view > - > > Key: SPARK-21363 > URL: https://issues.apache.org/jira/browse/SPARK-21363 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In SPARK-20460, we refactored some existing checks for column name > duplication. The master currently allows name duplication for temporary views > like an example below though, IMO we better prevent it along with permanent > views (the master prevents the case for the permanent views). > {code} > scala> Seq((1, 1)).toDF("a", "a").createOrReplaceTempView("t") > scala> sql("SELECT * FROM t").show > +---+---+ > | a| a| > +---+---+ > | 1| 1| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries
[ https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126917#comment-16126917 ] Takeshi Yamamuro commented on SPARK-21652: -- The pr is: https://github.com/apache/spark/pull/18882 > Optimizer cannot reach a fixed point on certain queries > --- > > Key: SPARK-21652 > URL: https://issues.apache.org/jira/browse/SPARK-21652 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.2.0 >Reporter: Anton Okolnychyi > > The optimizer cannot reach a fixed point on the following query: > {code} > Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1") > Seq(1, 2).toDF("col").write.saveAsTable("t2") > spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 > = t2.col AND t1.col2 = t2.col").explain(true) > {code} > At some point during the optimization, InferFiltersFromConstraints infers a > new constraint '(col2#33 = col1#32)' that is appended to the join condition, > then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces > '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, > ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally > removes this predicate. However, InferFiltersFromConstraints will again infer > '(col2#33 = col1#32)' on the next iteration and the process will continue > until the limit of iterations is reached. > See below for more details > {noformat} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints === > !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && > (col2#33 = col#34))) > :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > : +- Relation[col1#32,col2#33] parquet > : +- Relation[col1#32,col2#33] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet >+- Relation[col#34] parquet > > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin === > !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = > col#34))) Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) :- Filter (col2#33 = col1#32) > !: +- Relation[col1#32,col2#33] parquet > : +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > : +- Relation[col1#32,col2#33] parquet > ! +- Relation[col#34] parquet > +- Filter ((1 = col#34) && isnotnull(col#34)) > ! >+- Relation[col#34] parquet > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) >Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > !:- Filter (col2#33 = col1#32) >:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && > ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32)) > !: +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) > && (1 = col2#33))) : +- Relation[col1#32,col2#33] parquet > !: +- Relation[col1#32,col2#33] parquet >+- Filter ((1 = col#34) && isnotnull(col#34)) > !+- Filter ((1 = col#34) && isnotnull(col#34)) > +- Relation[col#34] parquet > ! +- Relation[col#34] parquet > > > === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation > === > Join Inner, ((col1#32 = col#34) && (col2#33 = col#34)) > Join Inner, ((col1#32 = col#34) && > (col2#33 = col#34)) > !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && > (1 = col2#33))) && (col2#33 = col1#32)) :- Filter (((isnotnull(c
[jira] [Commented] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
[ https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126914#comment-16126914 ] Yanbo Liang commented on SPARK-21481: - See my comments at https://github.com/apache/spark/pull/18736 . > Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF > - > > Key: SPARK-21481 > URL: https://issues.apache.org/jira/browse/SPARK-21481 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0, 2.2.0 >Reporter: Aseem Bansal > > If we want to find the index of any input based on hashing trick then it is > possible in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF > but not in > https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF. > Should allow that for feature parity -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join
[ https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126902#comment-16126902 ] Takeshi Yamamuro commented on SPARK-21034: -- It seems this is duplicate of SPARK-21707. If so, we better move the discussion there. > Allow filter pushdown filters through non deterministic functions for columns > involved in groupby / join > > > Key: SPARK-21034 > URL: https://issues.apache.org/jira/browse/SPARK-21034 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Abhijit Bhole > > If the column is involved in aggregation / join then pushing down filter > should not change the results. > Here is a sample code - > {code:java} > from pyspark.sql import functions as F > df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" > : 8}, >{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, > "c":7} ]) > df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain() > df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *HashAggregate(keys=[a#15L], functions=[sum(b#16L)]) > +- Exchange hashpartitioning(a#15L, 4) >+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)]) > +- *Project [a#15L, b#16L] > +- *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > >>> > >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain() > == Physical Plan == > *Filter (isnotnull(a#15L) && (a#15L = 1)) > +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)]) >+- Exchange hashpartitioning(a#15L, 4) > +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), > partial_first(c#17L, false)]) > +- Scan ExistingRDD[a#15L,b#16L,c#17L] > {code} > As you can see, the filter is not pushed down when F.first aggregate function > is used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
[ https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126894#comment-16126894 ] Saisai Shao commented on SPARK-21733: - Is there any issue here? > ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM > - > > Key: SPARK-21733 > URL: https://issues.apache.org/jira/browse/SPARK-21733 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 > Environment: Apache Spark2.1.1 > CDH5.12.0 Yarn >Reporter: Jepson > Labels: patch > Original Estimate: 96h > Remaining Estimate: 96h > > Kafka+Spark streaming ,throw these error: > {code:java} > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8003 took 11 ms > 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 > (TID 64178). 1740 bytes result sent to driver > 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002 > 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned > task 64186 > 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 > (TID 64186) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast > variable 8004 > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored > as bytes in memory (estimated size 1895.0 B, free 1643.2 MB) > 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable > 8004 took 8 ms > 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as > values in memory (estimated size 2.9 KB, free 1643.2 MB) > 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the > same as ending offset skipping kssh 5 > 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 > (TID 64186). 1740 bytes result sent to driver > h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED > SIGNAL TERM > 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called > 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org