[jira] [Commented] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128393#comment-16128393
 ] 

Sean Owen commented on SPARK-21742:
---

Is that a bug? Isn't it stochastic and dependent on the data order anyway, 
which could vary if the input varies? Neither answer is wrong. 

> BisectingKMeans generate different models with/without caching
> --
>
> Key: SPARK-21742
> URL: https://issues.apache.org/jira/browse/SPARK-21742
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new 
> BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128392#comment-16128392
 ] 

Sean Owen commented on SPARK-21736:
---

I am still not sure why you're saying it relates to Spark. The prop and file 
are parsed by your app code. 

> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*
> UPDATE: I replaced the \ with / and it works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21422) Depend on Apache ORC 1.4.0

2017-08-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21422.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> Depend on Apache ORC 1.4.0
> --
>
> Key: SPARK-21422
> URL: https://issues.apache.org/jira/browse/SPARK-21422
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Like Parquet, this issue aims to depend on the latest Apache ORC 1.4 for 
> Apache Spark 2.3. There are key benefits for now.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.
> Later, we can get the following two key benefits by adding new ORCFileFormat 
> in SPARK-20728, too.
> - Usability: User can use ORC data sources without hive module, i.e, -Phive.
> - Speed: Use both Spark ColumnarBatch and ORC RowBatch together. This is 
> faster than the current implementation in Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21108) convert LinearSVC to aggregator framework

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21108:
---

Assignee: yuhao yang

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce MutableColumnVector.

2017-08-15 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-21745:
-

 Summary: Refactor ColumnVector hierarchy to make ColumnVector 
read-only and to introduce MutableColumnVector.
 Key: SPARK-21745
 URL: https://issues.apache.org/jira/browse/SPARK-21745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


This is a refactoring of {{ColumnVector}} hierarchy and related classes.

# make {{ColumnVector}} read-only
# introduce {{MutableColumnVector}} with write interface
# remove {{ReadOnlyColumnVector}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Ross Brigoli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128336#comment-16128336
 ] 

Ross Brigoli commented on SPARK-21736:
--

You're right file:/z:/works. But the issue is that file:\z:\... works in Spark 
2.1.0 so we did not bother changing it. Now we had to update all our 
configuration in production just to make it work in Spark 2.2.0

> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*
> UPDATE: I replaced the \ with / and it works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14516) Clustering evaluator

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-14516:
---

Assignee: Marco Gaido

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14516) Clustering evaluator

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14516:

 Shepherd: Yanbo Liang
Affects Version/s: 2.2.0

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21744) Add retry logic when create new broadcast

2017-08-15 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-21744:
-
Description: 
When driver submit new stage and there is a bad disk before spark,then driver 
may will exit caused by exception below:

{code:java}
Job aborted due to stage failure: Task serialization failed: 
java.io.IOException: Failed to create local dir in 
/home/work/hdd5/yarn/xxx/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}
We can add retry logic when create broadcast to lower the probability of this 
scenario occurrence。And there is no side-effect for normal scenario.


  was:
When driver submit new stage and there is a bad disk before spark,then driver 
may will exit caused by exception below:

{code:java}
Job aborted due to stage failure: Task serialization failed: 
java.io.IOException: Failed to create local dir in 
/home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
scala.collection.mutable.ResizableArr

[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-08-15 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128327#comment-16128327
 ] 

Liang-Chi Hsieh commented on SPARK-21726:
-

Submitted a PR at https://github.com/apache/spark/pull/18956

> Check for structural integrity of the plan in QO in test mode
> -
>
> Key: SPARK-21726
> URL: https://issues.apache.org/jira/browse/SPARK-21726
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> Right now we don't have any checks in the optimizer to check for the 
> structural integrity of the plan (e.g. resolved). It would be great if in 
> test mode, we can check whether a plan is still resolved after the execution 
> of each rule, so we can catch rules that return invalid plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21744) Add retry logic when create new broadcast

2017-08-15 Thread zhoukang (JIRA)
zhoukang created SPARK-21744:


 Summary: Add retry logic when create new broadcast
 Key: SPARK-21744
 URL: https://issues.apache.org/jira/browse/SPARK-21744
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0, 1.6.1
Reporter: zhoukang
Priority: Minor


When driver submit new stage and there is a bad disk before spark,then driver 
may will exit caused by exception below:

{code:java}
Job aborted due to stage failure: Task serialization failed: 
java.io.IOException: Failed to create local dir in 
/home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
scala.Option.foreach(Option.scala:236)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{code}
We can add retry logic when create broadcast to lower the probability of this 
scenario occurrence。And there is no side-effect for normal scenario.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-08-15 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128318#comment-16128318
 ] 

Gaurav Shah commented on SPARK-4502:


[~marmbrus] Do you have some time to review this pull request ? It looks in a 
good state.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing support

2017-08-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128305#comment-16128305
 ] 

Tejas Patil commented on SPARK-19256:
-

PR for writer side changes is out : https://github.com/apache/spark/pull/18954
I have documented the semantics and differences with Spark's bucketing in the 
jira description so that it makes it easy for anyone who wants to see what 
changes are done.

Reader side changes are close to completion (core meat is done, but the work is 
unpolished). I assume that the writer PR will have some round of comments which 
will buy me time to work on that. Reader side changes depend on the writer side 
PR to get in so won't publish it until then.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21743) top-most limit should not cause memory leak

2017-08-15 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21743:
---

 Summary: top-most limit should not cause memory leak
 Key: SPARK-21743
 URL: https://issues.apache.org/jira/browse/SPARK-21743
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128271#comment-16128271
 ] 

Nicholas Chammas commented on SPARK-17025:
--

I'm still interested in this but I won't be able to test it until mid-next 
month, unfortunately. I've set myself a reminder to revisit this.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread Feng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128267#comment-16128267
 ] 

Feng Zhu commented on SPARK-21739:
--

In such case, the Cast expression is evaluated directly without resolving 
*timeZoneId* because it is in the execution phase.

{code:java}
Cast(Literal(rawPartValues(partOrdinal)), attr.dataType).eval(null)
{code}

I will check this kind of issues and fix them.



> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.
> Here is the error stack trace
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)
>   
>at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21742) BisectingKMeans generate different models with/without caching

2017-08-15 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-21742:
-
Summary: BisectingKMeans generate different models with/without caching  
(was: BisectingKMeans generate different results with/without caching)

> BisectingKMeans generate different models with/without caching
> --
>
> Key: SPARK-21742
> URL: https://issues.apache.org/jira/browse/SPARK-21742
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>
> I found that {{BisectingKMeans}} will generate different models if the input 
> is cached or not.
> Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
> cache the input, then the number of centers will change from 2 to 3.
> So it looks like a potential bug.
> {code}
> import org.apache.spark.ml.param.ParamMap
> import org.apache.spark.sql.Dataset
> import org.apache.spark.ml.clustering._
> import org.apache.spark.ml.linalg._
> import scala.util.Random
> case class TestRow(features: org.apache.spark.ml.linalg.Vector)
> val rows = 10
> val dim = 1000
> val seed = 42
> val random = new Random(seed)
> val nnz = random.nextInt(dim)
> val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
> random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
> Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
> val sparseDataset = spark.createDataFrame(rdd)
> val k = 5
> val bkm = new 
> BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res0: Int = 2
> sparseDataset.persist()
> val model = bkm.fit(sparseDataset)
> model.clusterCenters.length
> res2: Int = 3
> {code}
> [~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21742) BisectingKMeans generate different results with/without caching

2017-08-15 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-21742:


 Summary: BisectingKMeans generate different results with/without 
caching
 Key: SPARK-21742
 URL: https://issues.apache.org/jira/browse/SPARK-21742
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng


I found that {{BisectingKMeans}} will generate different models if the input is 
cached or not.
Using the same dataset in {{BisectingKMeansSuite}}, we can found that if we 
cache the input, then the number of centers will change from 2 to 3.

So it looks like a potential bug.
{code}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Dataset
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.linalg._
import scala.util.Random
case class TestRow(features: org.apache.spark.ml.linalg.Vector)

val rows = 10
val dim = 1000
val seed = 42

val random = new Random(seed)
val nnz = random.nextInt(dim)
val rdd = sc.parallelize(1 to rows).map(i => Vectors.sparse(dim, 
random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
Array.fill(nnz)(random.nextDouble(.map(v => new TestRow(v))
val sparseDataset = spark.createDataFrame(rdd)

val k = 5
val bkm = new 
BisectingKMeans().setK(k).setMinDivisibleClusterSize(4).setMaxIter(4).setSeed(123)
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res0: Int = 2

sparseDataset.persist()
val model = bkm.fit(sparseDataset)
model.clusterCenters.length
res2: Int = 3
{code}

[~imatiach] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2017-08-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128243#comment-16128243
 ] 

Takeshi Yamamuro commented on SPARK-18394:
--

Any update? I checked and I found the master still has this issue; I just run 
the query above and dump output names in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala#L102.
{code}
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_QUANTITY#9015,L_RETURNFLAG#9019,l_shipdate#9021,L_TAX#9018,L_DISCOUNT#9017,L_LINESTATUS#9020,L_EXTENDEDPRICE#9016
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_RETURNFLAG#9142,L_DISCOUNT#9140,L_EXTENDEDPRICE#9139,L_QUANTITY#9138,L_LINESTATUS#9143,l_shipdate#9144,L_TAX#9141
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_QUANTITY#9305,L_TAX#9308,l_shipdate#9311,L_DISCOUNT#9307,L_RETURNFLAG#9309,L_LINESTATUS#9310,L_EXTENDEDPRICE#9306
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_EXTENDEDPRICE#9451,L_QUANTITY#9450,L_RETURNFLAG#9454,L_TAX#9453,L_DISCOUNT#9452,l_shipdate#9456,L_LINESTATUS#9455
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_LINESTATUS#9600,l_shipdate#9601,L_DISCOUNT#9597,L_TAX#9598,L_EXTENDEDPRICE#9596,L_RETURNFLAG#9599,L_QUANTITY#9595
17/08/16 02:13:13 WARN BaseSessionStateBuilder$$anon$3: 
L_QUANTITY#9740,L_TAX#9743,l_shipdate#9746,L_DISCOUNT#9742,L_EXTENDEDPRICE#9741,L_LINESTATUS#9745,L_RETURNFLAG#9744
...
{code}
The attribute order is different, and then Spark generates different  code in 
`GenerateColumnAccessor`.
Also, I quickly checked `AttributeSet.toSeq` output attributes with a different 
order;
{code}
scala> val attr1 = AttributeReference("c1", IntegerType)(exprId = ExprId(1098))
scala> val attr2 = AttributeReference("c2", IntegerType)(exprId = ExprId(107))
scala> val attr3 = AttributeReference("c3", IntegerType)(exprId = ExprId(838))
scala> val attrSetA = AttributeSet(attr1 :: attr2 :: attr3 :: Nil)
scala> val attr4 = AttributeReference("c4", IntegerType)(exprId = ExprId(389))
scala> val attr5 = AttributeReference("c5", IntegerType)(exprId = ExprId(89329))
scala> val attrSetB = AttributeSet(attr4 :: attr5 :: Nil)
scala> (attrSetA ++ attrSetB).toSeq
res1: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = 
WrappedArray(c3#838, c4#389, c2#107, c5#89329, c1#1098)

scala> val attr1 = AttributeReference("c1", IntegerType)(exprId = ExprId(392))
scala> val attr2 = AttributeReference("c2", IntegerType)(exprId = ExprId(92))
scala> val attr3 = AttributeReference("c3", IntegerType)(exprId = ExprId(87))
scala> val attrSetA = AttributeSet(attr1 :: attr2 :: attr3 :: Nil)
scala> val attr4 = AttributeReference("c4", IntegerType)(exprId = 
ExprId(9023920))
scala> val attr5 = AttributeReference("c5", IntegerType)(exprId = ExprId(522))
scala> val attrSetB = AttributeSet(attr4 :: attr5 :: Nil)
scala> (attrSetA ++ attrSetB).toSeq
res2: Seq[org.apache.spark.sql.catalyst.expressions.Attribute] = 
WrappedArray(c3#87, c1#392, c5#522, c2#92, c4#9023920)
{code}

As suggested, to fix this, `Attribute.toSeq` need to output attributes with a 
consistent order like;
https://github.com/apache/spark/compare/master...maropu:SPARK-18394
 

> Executing the same query twice in a row results in CodeGenerator cache misses
> -
>
> Key: SPARK-18394
> URL: https://issues.apache.org/jira/browse/SPARK-18394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
>Reporter: Jonny Serencsa
>
> Executing the query:
> {noformat}
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> lineitem_1_row
> where
> l_shipdate <= date_sub('1998-12-01', '90')
> group by
> l_returnflag,
> l_linestatus
> ;
> {noformat}
> twice (in succession), will result in CodeGenerator cache misses in BOTH 
> executions. Since the query is identical, I would expect the same code to be 
> generated. 
> Turns out, the generated code is not exactly the same, resulting in cache 
> misses when performing the lookup in the CodeGenerator cache. Yet, the code 
> is equivalent. 
> Below is (some portion of the) generated code for two runs of the query:
> run-1
> {noformat}
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import scala.collection.Iterator;
> import org.apache.spark.sql.types.DataType;
> import org.apache.spark.sql.catalyst.expre

[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-15 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128242#comment-16128242
 ] 

Liang-Chi Hsieh commented on SPARK-21657:
-

[~maropu] I've noticed that change. There is a hotfix trying to revert that: 
https://github.com/apache/spark/pull/17425. But in the end the hotfix doesn't 
revert it back.

Actually I've tried to enable codegen for GenerateExec and ran those tests 
without failure in local. So I'm wondering why we still disable it.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21716) The time-range window can't be applied on the reduce operator

2017-08-15 Thread Fan Donglai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Donglai updated SPARK-21716:

Description: I can't use GroupBy + Window operator to get the newest(the 
maximum event time) row in a window.It should make the window  be applid on the 
reduce operator  (was: I can't use GroupBy + Window operator to get the 
newest(the maximum event time) row in a window.It should make the window can be 
applid on the reduce operator)

>  The time-range window can't be applied on the reduce operator
> --
>
> Key: SPARK-21716
> URL: https://issues.apache.org/jira/browse/SPARK-21716
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Fan Donglai
>
> I can't use GroupBy + Window operator to get the newest(the maximum event 
> time) row in a window.It should make the window  be applid on the reduce 
> operator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2017-08-15 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128226#comment-16128226
 ] 

Weichen Xu commented on SPARK-21741:


OK I will work on this.
I will post a design doc first.

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2017-08-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21741:
---

 Summary: Python API for DataFrame-based multivariate summarizer
 Key: SPARK-21741
 URL: https://issues.apache.org/jira/browse/SPARK-21741
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


We support multivariate summarizer for DataFrame API at SPARK-19634, we should 
also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2017-08-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128223#comment-16128223
 ] 

Yanbo Liang commented on SPARK-21741:
-

[~WeichenXu123] Would you like to work on this? Thanks.

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19634.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.3.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21712:


Assignee: Nicholas Chammas

> Clarify PySpark Column.substr() type checking error message
> ---
>
> Key: SPARK-21712
> URL: https://issues.apache.org/jira/browse/SPARK-21712
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Trivial
> Fix For: 2.3.0
>
>
> https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409
> "Can not mix the type" is really unclear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21712) Clarify PySpark Column.substr() type checking error message

2017-08-15 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21712.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18926
[https://github.com/apache/spark/pull/18926]

> Clarify PySpark Column.substr() type checking error message
> ---
>
> Key: SPARK-21712
> URL: https://issues.apache.org/jira/browse/SPARK-21712
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Nicholas Chammas
>Priority: Trivial
> Fix For: 2.3.0
>
>
> https://github.com/apache/spark/blob/f0169a1c6a1ac06045d57f8aaa2c841bb39e23ac/python/pyspark/sql/column.py#L408-L409
> "Can not mix the type" is really unclear.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Shepherd: Joseph K. Bradley

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Affects Version/s: 2.3.0

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Target Version/s: 2.2.1, 2.3.0

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21681:
-

Assignee: Weichen Xu

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127942#comment-16127942
 ] 

Thomas Graves commented on SPARK-20589:
---

Note that this type of option is also already supported in other big data 
engines like pig and mapreduce.  There is a need for this and I don't think we 
should ask the user to split up their job to make it less optimal.  This should 
not be that complex in the scheme of things.  


> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-15 Thread Amit Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127930#comment-16127930
 ] 

Amit Kumar commented on SPARK-20589:


[~imranr] Like you said, we could restrict the number of executors for the 
whole pipeline, but it would make the throughput too slow or worse, start 
causing cause OOM and shuffle errors.
As for the other option, I'm doing exactly that, breaking up one pipeline into 
multiple stages. But, as you can imagine, it makes your workflow code much more 
longer and complex than desired. Not only do we have to add the additional 
complexity of serializing intermediate data on HDFS , it increases the total 
time.
Also , I don't agree about this being a rare use case. As [~mcnels1], also 
said, we would see this come up more and more as we start linking Apache Spark 
with external storage/querying solution which have a tighter QPS restrictions.



> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-15 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127911#comment-16127911
 ] 

Imran Rashid commented on SPARK-20589:
--

couldn't you just limit the number of executors of your spark app?

I realize that you might want more executors for other parts of your app, but 
you could always break into multiple independent spark apps.  I realize it 
isn't ideal, but I'm worried about the complexity of this for what seems like a 
rare use case.  

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21731) Upgrade scalastyle to 0.9

2017-08-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21731.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.3.0

https://github.com/apache/spark/pull/18943

> Upgrade scalastyle to 0.9
> -
>
> Key: SPARK-21731
> URL: https://issues.apache.org/jira/browse/SPARK-21731
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Trivial
> Fix For: 2.3.0
>
>
> No new features that I'm interested in, but it fixes an issue with the import 
> order checker so that it provides more useful errors 
> (https://github.com/scalastyle/scalastyle/pull/185).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21740) DataFrame.write does not work with Phoenix JDBC Driver

2017-08-15 Thread Paul Wu (JIRA)
Paul Wu created SPARK-21740:
---

 Summary: DataFrame.write does not work with Phoenix JDBC Driver
 Key: SPARK-21740
 URL: https://issues.apache.org/jira/browse/SPARK-21740
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0, 2.0.0
Reporter: Paul Wu


  The reason for this is that Phoenix JDBC driver does not support "INSERT", 
but "UPSERT".
Exception for the following program:
17/08/15 12:18:53 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.phoenix.exception.PhoenixParserException: ERROR 601 (42P00): Syntax 
error. Encountered "INSERT" at line 1, column 1.
at 
org.apache.phoenix.exception.PhoenixParserException.newException(PhoenixParserException.java:33)

{code:java}
public class HbaseJDBCSpark {

private static final SparkSession sparkSession
= SparkSession.builder()
.config("spark.sql.warehouse.dir", "file:///temp")
.config("spark.driver.memory", "5g")
.master("local[*]").appName("Spark2JdbcDs").getOrCreate();

static final String JDBC_URL
= "jdbc:phoenix:somehost:2181:/hbase-unsecure";

public static void main(String[] args) {
final Properties connectionProperties = new Properties();
Dataset jdbcDF
= sparkSession.read()
.jdbc(JDBC_URL, "javatest", connectionProperties);

jdbcDF.show();

String url = JDBC_URL;
Properties p = new Properties();
p.put("driver", "org.apache.phoenix.jdbc.PhoenixDriver");
//p.put("batchsize", "10");
jdbcDF.write().mode(SaveMode.Append).jdbc(url, "javatest", p);
sparkSession.close();

}
// Create variables

}
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17742) Spark Launcher does not get failed state in Listener

2017-08-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17742.

   Resolution: Fixed
Fix Version/s: 2.3.0

https://github.com/apache/spark/pull/18877

> Spark Launcher does not get failed state in Listener 
> -
>
> Key: SPARK-17742
> URL: https://issues.apache.org/jira/browse/SPARK-17742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Aseem Bansal
> Fix For: 2.3.0
>
>
> I tried to launch an application using the below code. This is dummy code to 
> reproduce the problem. I tried exiting spark with status -1, throwing an 
> exception etc. but in no case did the listener give me failed status. But if 
> a spark job returns -1 or throws an exception from the main method it should 
> be considered as a failure. 
> {code}
> package com.example;
> import org.apache.spark.launcher.SparkAppHandle;
> import org.apache.spark.launcher.SparkLauncher;
> import java.io.IOException;
> public class Main2 {
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> SparkLauncher launcher = new SparkLauncher()
> .setSparkHome("/opt/spark2")
> 
> .setAppResource("/home/aseem/projects/testsparkjob/build/libs/testsparkjob-1.0-SNAPSHOT.jar")
> .setMainClass("com.example.Main")
> .setMaster("local[2]");
> launcher.startApplication(new MyListener());
> Thread.sleep(1000 * 60);
> }
> }
> class MyListener implements SparkAppHandle.Listener {
> @Override
> public void stateChanged(SparkAppHandle handle) {
> System.out.println("state changed " + handle.getState());
> }
> @Override
> public void infoChanged(SparkAppHandle handle) {
> System.out.println("info changed " + handle.getState());
> }
> }
> {code}
> The spark job is 
> {code}
> package com.example;
> import org.apache.spark.sql.SparkSession;
> import java.io.IOException;
> public class Main {
> public static void main(String[] args) throws IOException {
> SparkSession sparkSession = SparkSession
> .builder()
> .appName("" + System.currentTimeMillis())
> .getOrCreate();
> try {
> for (int i = 0; i < 15; i++) {
> Thread.sleep(1000);
> System.out.println("sleeping 1");
> }
> } catch (InterruptedException e) {
> e.printStackTrace();
> }
> //sparkSession.stop();
> System.exit(-1);
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-15 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657
 ] 

poplav edited comment on SPARK-21720 at 8/15/17 6:10 PM:
-

[~kiszk] So, to clarify did the fix simply bump up the number of fields before 
we hit a failure and not actually make the number indefinite?  As in we went 
from failing at 400 to failing at 1024.


was (Author: poplav):
So, to clarify did the fix simply bump up the number of fields before we hit a 
failure and not actually make the number indefinite?  As in we went from 
failing at 400 to failing at 1024.

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and  Field177 = "" and  
> Field178 = "" and  Field179 = "" and  Field180 = "" and  Field181 = "" and  
> Field182 = "" and  Field183 = "" and  Field184 = "" and  Field185 =

[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-15 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657
 ] 

poplav edited comment on SPARK-21720 at 8/15/17 6:10 PM:
-

[~kiszk] So, to clarify did the fix simply bump up the number of fields before 
we hit a failure and not actually make the number indefinite?  As in we went 
from failing at 400 to failing at 1024?


was (Author: poplav):
[~kiszk] So, to clarify did the fix simply bump up the number of fields before 
we hit a failure and not actually make the number indefinite?  As in we went 
from failing at 400 to failing at 1024.

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and  Field177 = "" and  
> Field178 = "" and  Field179 = "" and  Field180 = "" and  Field181 = "" and  
> Field182 = "" and  Field183 = "" and  Field184 = "" and  F

[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-15 Thread poplav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127657#comment-16127657
 ] 

poplav commented on SPARK-21720:


So, to clarify did the fix simply bump up the number of fields before we hit a 
failure and not actually make the number indefinite?  As in we went from 
failing at 400 to failing at 1024.

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and  Field177 = "" and  
> Field178 = "" and  Field179 = "" and  Field180 = "" and  Field181 = "" and  
> Field182 = "" and  Field183 = "" and  Field184 = "" and  Field185 = "" and  
> Field186 = "" and  Field187 = "" and  Field188 = "" and  Field189 = "" and  
> Field190 = "" and  Field191 = "" and  Field192 = "" and  Field193 = "" and  
> Field194 = "" and  Field195 = "" and  Field196 = "" and  Field197 = "" and  
> Field198 = "" and  Fie

[jira] [Commented] (SPARK-21738) Thriftserver doesn't cancel jobs when session is closed

2017-08-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127542#comment-16127542
 ] 

Xiao Li commented on SPARK-21738:
-

https://github.com/apache/spark/pull/18951

> Thriftserver doesn't cancel jobs when session is closed
> ---
>
> Key: SPARK-21738
> URL: https://issues.apache.org/jira/browse/SPARK-21738
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> When a session is closed, the jobs launched by that session should be killed 
> in order to avoid waste of resources. Instead, this doesn't happen.
> So at the moment, if a user launches a query and then closes his connection, 
> the query goes on running until completion. This behavior should be changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-15 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127511#comment-16127511
 ] 

Ruslan Dautkhanov edited comment on SPARK-21657 at 8/15/17 4:59 PM:


Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem 
might be worse in 2.2, but I don't think it's a root cause.
As we see the same exponential time complexity to explode a nested array in 
Spark 2.0 and 2.1.


was (Author: tagar):
Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem 
might be worse in 2.2, but I don't think it's a root cause.
As we see the same exponential time complecity to explode in Spark 2.0 and 2.1.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-08-15 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127511#comment-16127511
 ] 

Ruslan Dautkhanov commented on SPARK-21657:
---

Thank you [~maropu] and [~viirya], that commit is for Spark 2.2 so this problem 
might be worse in 2.2, but I don't think it's a root cause.
As we see the same exponential time complecity to explode in Spark 2.0 and 2.1.

> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127477#comment-16127477
 ] 

Kazuaki Ishizaki commented on SPARK-21720:
--

In this case, to add JVM option {{-Xss512m}} eliminates this exception and this 
works well.

When the number of fields is 1024, I got the following exception:
{code}
08:41:40.022 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
...
{code}

I am working for solving this 64KB problem.

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  
> Field142 = "" and  Field143 = "" and  Field144 = "" and  Field145 = "" and  
> Field146 = "" and  Field147 = "" and  Field148 = "" and  Field149 = "" and  
> Field150 = "" and  Field151 = "" and  Field152 = "" and  Field153 = "" and  
> Field154 = "" and  Field155 = "" and  Field156 = "" and  Field157 = "" and  
> Field158 = "" and  Field159 = "" and  Field160 = "" and  Field161 = "" and  
> Field162 = "" and  Field163 = "" and  Field164 = "" and  Field165 = "" and  
> Field166 = "" and  Field167 = "" and  Field168 = "" and  Field169 = "" and  
> Field170 = "" and  Field171 = "" and  Field172 = "" and  Field173 = "" and  
> Field174 = "" and  Field175 = "" and  Field176 = "" and 

[jira] [Comment Edited] (SPARK-21720) Filter predicate with many conditions throw stackoverflow error

2017-08-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127477#comment-16127477
 ] 

Kazuaki Ishizaki edited comment on SPARK-21720 at 8/15/17 4:26 PM:
---

In this case, to add JVM option {{-Xss512m}} eliminates this exception and this 
works well.

However, when the number of fields is 1024, I got the following exception:
{code}
08:41:40.022 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
...
{code}

I am working for solving this 64KB problem.


was (Author: kiszk):
In this case, to add JVM option {{-Xss512m}} eliminates this exception and this 
works well.

When the number of fields is 1024, I got the following exception:
{code}
08:41:40.022 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
...
{code}

I am working for solving this 64KB problem.

> Filter predicate with many conditions throw stackoverflow error
> ---
>
> Key: SPARK-21720
> URL: https://issues.apache.org/jira/browse/SPARK-21720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: srinivasan
>
> When trying to filter on dataset with many predicate conditions on both spark 
> sql and dataset filter transformation as described below, spark throws a 
> stackoverflow exception
> Case 1: Filter Transformation on Data
> Dataset filter = sourceDataset.filter(String.format("not(%s)", 
> buildQuery()));
> filter.show();
> where buildQuery() returns
> Field1 = "" and  Field2 = "" and  Field3 = "" and  Field4 = "" and  Field5 = 
> "" and  BLANK_5 = "" and  Field7 = "" and  Field8 = "" and  Field9 = "" and  
> Field10 = "" and  Field11 = "" and  Field12 = "" and  Field13 = "" and  
> Field14 = "" and  Field15 = "" and  Field16 = "" and  Field17 = "" and  
> Field18 = "" and  Field19 = "" and  Field20 = "" and  Field21 = "" and  
> Field22 = "" and  Field23 = "" and  Field24 = "" and  Field25 = "" and  
> Field26 = "" and  Field27 = "" and  Field28 = "" and  Field29 = "" and  
> Field30 = "" and  Field31 = "" and  Field32 = "" and  Field33 = "" and  
> Field34 = "" and  Field35 = "" and  Field36 = "" and  Field37 = "" and  
> Field38 = "" and  Field39 = "" and  Field40 = "" and  Field41 = "" and  
> Field42 = "" and  Field43 = "" and  Field44 = "" and  Field45 = "" and  
> Field46 = "" and  Field47 = "" and  Field48 = "" and  Field49 = "" and  
> Field50 = "" and  Field51 = "" and  Field52 = "" and  Field53 = "" and  
> Field54 = "" and  Field55 = "" and  Field56 = "" and  Field57 = "" and  
> Field58 = "" and  Field59 = "" and  Field60 = "" and  Field61 = "" and  
> Field62 = "" and  Field63 = "" and  Field64 = "" and  Field65 = "" and  
> Field66 = "" and  Field67 = "" and  Field68 = "" and  Field69 = "" and  
> Field70 = "" and  Field71 = "" and  Field72 = "" and  Field73 = "" and  
> Field74 = "" and  Field75 = "" and  Field76 = "" and  Field77 = "" and  
> Field78 = "" and  Field79 = "" and  Field80 = "" and  Field81 = "" and  
> Field82 = "" and  Field83 = "" and  Field84 = "" and  Field85 = "" and  
> Field86 = "" and  Field87 = "" and  Field88 = "" and  Field89 = "" and  
> Field90 = "" and  Field91 = "" and  Field92 = "" and  Field93 = "" and  
> Field94 = "" and  Field95 = "" and  Field96 = "" and  Field97 = "" and  
> Field98 = "" and  Field99 = "" and  Field100 = "" and  Field101 = "" and  
> Field102 = "" and  Field103 = "" and  Field104 = "" and  Field105 = "" and  
> Field106 = "" and  Field107 = "" and  Field108 = "" and  Field109 = "" and  
> Field110 = "" and  Field111 = "" and  Field112 = "" and  Field113 = "" and  
> Field114 = "" and  Field115 = "" and  Field116 = "" and  Field117 = "" and  
> Field118 = "" and  Field119 = "" and  Field120 = "" and  Field121 = "" and  
> Field122 = "" and  Field123 = "" and  Field124 = "" and  Field125 = "" and  
> Field126 = "" and  Field127 = "" and  Field128 = "" and  Field129 = "" and  
> Field130 = "" and  Field131 = "" and  Field132 = "" and  Field133 = "" and  
> Field134 = "" and  Field135 = "" and  Field136 = "" and  Field137 = "" and  
> Field138 = "" and  Field139 = "" and  Field140 = "" and  Field141 = "" and  

[jira] [Resolved] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join

2017-08-15 Thread Abhijit Bhole (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhijit Bhole resolved SPARK-21034.
---
Resolution: Duplicate

> Allow filter pushdown filters through non deterministic functions for columns 
> involved in groupby / join
> 
>
> Key: SPARK-21034
> URL: https://issues.apache.org/jira/browse/SPARK-21034
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> If the column is involved in aggregation / join then pushing down filter 
> should not change the results.
> Here is a sample code - 
> {code:java}
> from pyspark.sql import functions as F
> df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" 
> : 8},
>{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, 
> "c":7} ])
> df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain()
> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *HashAggregate(keys=[a#15L], functions=[sum(b#16L)])
> +- Exchange hashpartitioning(a#15L, 4)
>+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)])
>   +- *Project [a#15L, b#16L]
>  +- *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> >>>
> >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)])
>+- Exchange hashpartitioning(a#15L, 4)
>   +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), 
> partial_first(c#17L, false)])
>  +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> {code}
> As you can see, the filter is not pushed down when F.first aggregate function 
> is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

Here is the error stack trace

{code: none}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
{code}


  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

Here is the error stack trace

{code:scala}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
{code}



> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.
> Here is the error stack trace
> {code: none}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option

[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

Here is the error stack trace

{code}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
{code}


  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

Here is the error stack trace

{code: none}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
{code}



> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.
> Here is the error stack trace
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)

[jira] [Commented] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join

2017-08-15 Thread Abhijit Bhole (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127454#comment-16127454
 ] 

Abhijit Bhole commented on SPARK-21034:
---

Thank you !!


> Allow filter pushdown filters through non deterministic functions for columns 
> involved in groupby / join
> 
>
> Key: SPARK-21034
> URL: https://issues.apache.org/jira/browse/SPARK-21034
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> If the column is involved in aggregation / join then pushing down filter 
> should not change the results.
> Here is a sample code - 
> {code:java}
> from pyspark.sql import functions as F
> df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" 
> : 8},
>{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, 
> "c":7} ])
> df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain()
> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *HashAggregate(keys=[a#15L], functions=[sum(b#16L)])
> +- Exchange hashpartitioning(a#15L, 4)
>+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)])
>   +- *Project [a#15L, b#16L]
>  +- *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> >>>
> >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)])
>+- Exchange hashpartitioning(a#15L, 4)
>   +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), 
> partial_first(c#17L, false)])
>  +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> {code}
> As you can see, the filter is not pushed down when F.first aggregate function 
> is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

Here is the error stack trace

{code:scala}
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
{code}


  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.
> Here is the error stack trace
> {code:scala}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)
>   
>at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-

[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that 
[TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230)
 try to cast the string to timestamp regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The error stack trace is attached. 
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that 
[TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230)
 try to cast the string to timestamp regardless if the timeZone exists.

  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The error stack trace is attached. 
> The root cause is that 
> [TableReader.scala#230](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L230)
>  try to cast the string to timestamp regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is that TableReader.scala#230 try to cast the string to 
timestamp regardless if the timeZone exists.

  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The error stack trace is attached. 
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:java}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.

  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:scala}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The error stack trace is attached. 
> The root cause is TableReader.scala#230 try to cast the string to timestamp 
> regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Description: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:

{code:scala}
spark.sql("create table test (foo string) parititioned by (ts timestamp)")
spark.sql("insert into table test partition(ts = 1) values('hi')")
spark.table("test").show()
{code}


The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.

  was:
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:
bq. 
bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)")
bq. spark.sql("insert into table test partition(ts = 1) values('hi')")
bq. spark.table("test").show()

The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.


> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:scala}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The error stack trace is attached. 
> The root cause is TableReader.scala#230 try to cast the string to timestamp 
> regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Docs Text:   (was: The spark v2.2.0 introduce TimeZoneAwareExpression, 
which causes bugs if we select data from a table with timestamp partitions.
The steps to reproduce it:
bq. 
bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)")
bq. spark.sql("insert into table test partition(ts = 1) values('hi')")
bq. spark.table("test").show()

The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.)

> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> bq. 
> bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)")
> bq. spark.sql("insert into table test partition(ts = 1) values('hi')")
> bq. spark.table("test").show()
> The error stack trace is attached. 
> The root cause is TableReader.scala#230 try to cast the string to timestamp 
> regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhihao updated SPARK-21739:
---
Docs Text: 
The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:
bq. 
bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)")
bq. spark.sql("insert into table test partition(ts = 1) values('hi')")
bq. spark.table("test").show()

The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.

  was:
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)

 at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.fillPartitionKeys$1(TableReader.scala:228)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.apply(TableReader.scala:235)
  at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5.apply(TableReader.scala:196)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.hive.HadoopTableReader.makeRDDForPartitionedTable(TableReader.scala:196)
  at 
org.apache.spark.sql.hive.HadoopTableReader.makeRDDForPartitionedTable(TableReader.scala:141)
  at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:188)
  at 
org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:188)
  at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2478)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2853)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
  at org.apache.spark.sql.Dataset$$anonfun$55.apply(Dataset.scala:2837)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2836)
  at org.apache.spark.sq

[jira] [Created] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-15 Thread wangzhihao (JIRA)
wangzhihao created SPARK-21739:
--

 Summary: timestamp partition would fail in v2.2.0
 Key: SPARK-21739
 URL: https://issues.apache.org/jira/browse/SPARK-21739
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: wangzhihao


The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
select data from a table with timestamp partitions.
The steps to reproduce it:
bq. 
bq. spark.sql("create table test (foo string) parititioned BY (ts timestamp)")
bq. spark.sql("insert into table test partition(ts = 1) values('hi')")
bq. spark.table("test").show()

The error stack trace is attached. 

The root cause is TableReader.scala#230 try to cast the string to timestamp 
regardless if the timeZone exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21721) Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable

2017-08-15 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21721.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0
   2.2.1
   2.1.2

> Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable
> --
>
> Key: SPARK-21721
> URL: https://issues.apache.org/jira/browse/SPARK-21721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: yzheng616
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> The leak came from org.apache.spark.sql.hive.execution.InsertIntoHiveTable. 
> At line 118, it put a staging path to FileSystem delete cache, and then 
> remove the path from disk at line 385. It does not remove the path from 
> FileSystem cache. If a streaming application keep persisting data to a 
> partitioned hive table, the memory will keep increasing until JVM terminated.
> Below is a simple code to reproduce it.
> {code:java}
> package test
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.FileSystem
> import org.apache.spark.sql.SaveMode
> import java.lang.reflect.Field
> case class PathLeakTest(id: Int, gp: String)
> object StagePathLeak {
>   def main(args: Array[String]): Unit = {
> val spark = 
> SparkSession.builder().master("local[4]").appName("StagePathLeak").enableHiveSupport().getOrCreate()
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> //create a partitioned table
> spark.sql("drop table if exists path_leak");
> spark.sql("create table if not exists path_leak(id int)" +
> " partitioned by (gp String)"+
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[PathLeakTest]()
> // 2 partitions
> for (x <- 1 to 2) {
>   seq += (new PathLeakTest(x, "g" + x))
> }
> val rdd = spark.sparkContext.makeRDD[PathLeakTest](seq)
> //insert 50 records to Hive table
> for (j <- 1 to 50) {
>   val df = spark.createDataFrame(rdd)
>   //#1 InsertIntoHiveTable line 118:  add stage path to FileSystem 
> deleteOnExit cache
>   //#2 InsertIntoHiveTable line 385:  delete the path from disk but not 
> from the FileSystem cache, and it caused the leak
>   df.write.mode(SaveMode.Overwrite).insertInto("path_leak")  
> }
> 
> val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
> val deleteOnExit = getDeleteOnExit(fs.getClass)
> deleteOnExit.setAccessible(true)
> val caches = deleteOnExit.get(fs).asInstanceOf[java.util.TreeSet[Path]]
> //check FileSystem deleteOnExit cache size
> println(caches.size())
> val it = caches.iterator()
> //all starge pathes were still cached even they have already been deleted 
> from the disk
> while(it.hasNext()){
>   println(it.next());
> }
>   }
>   
>   def getDeleteOnExit(cls: Class[_]) : Field = {
> try{
>return cls.getDeclaredField("deleteOnExit")
> }catch{
>   case ex: NoSuchFieldException => return 
> getDeleteOnExit(cls.getSuperclass)
> }
> return null
>   }
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21738) Thriftserver doesn't cancel jobs when session is closed

2017-08-15 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-21738:
---

 Summary: Thriftserver doesn't cancel jobs when session is closed
 Key: SPARK-21738
 URL: https://issues.apache.org/jira/browse/SPARK-21738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


When a session is closed, the jobs launched by that session should be killed in 
order to avoid waste of resources. Instead, this doesn't happen.

So at the moment, if a user launches a query and then closes his connection, 
the query goes on running until completion. This behavior should be changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-08-15 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127339#comment-16127339
 ] 

Hyukjin Kwon commented on SPARK-19019:
--

I think this was backported into Spark 2.1.1. Was your Spark version, 2.1.1+?

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-08-15 Thread Mathias M. Andersen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127330#comment-16127330
 ] 

Mathias M. Andersen commented on SPARK-19019:
-

Just got this error post fix on spark 2.1:

Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in 
_run_module_as_main
  mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in 
_get_module_details
  __import__(pkg_name)
File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, 
in 
  from pyspark.context import SparkContext
File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in 

  from pyspark.java_gateway import launch_gateway
File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 
25, in 
  import platform
File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in 
  "system node release version machine processor")
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 
381, in namedtuple
  cls = _old_namedtuple(*args, **kwargs)
  TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
'rename', and 'module'

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h

[jira] [Comment Edited] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-08-15 Thread Mathias M. Andersen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127330#comment-16127330
 ] 

Mathias M. Andersen edited comment on SPARK-19019 at 8/15/17 2:51 PM:
--

Just got this error post fix on spark 2.1:


{code:java}
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in 
_run_module_as_main
  mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in 
_get_module_details
  __import__(pkg_name)
File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, 
in 
  from pyspark.context import SparkContext
File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in 

  from pyspark.java_gateway import launch_gateway
File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 
25, in 
  import platform
File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in 
  "system node release version machine processor")
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 
381, in namedtuple
  cls = _old_namedtuple(*args, **kwargs)
  TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
'rename', and 'module'
{code}



was (Author: mrmathias):
Just got this error post fix on spark 2.1:

Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.6/runpy.py", line 183, in 
_run_module_as_main
  mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/opt/anaconda3/lib/python3.6/runpy.py", line 109, in 
_get_module_details
  __import__(pkg_name)
File "/usr/hdp/current/spark-client/python/pyspark/__init__.py", line 41, 
in 
  from pyspark.context import SparkContext
File "/usr/hdp/current/spark-client/python/pyspark/context.py", line 33, in 

  from pyspark.java_gateway import launch_gateway
File "/usr/hdp/current/spark-client/python/pyspark/java_gateway.py", line 
25, in 
  import platform
File "/opt/anaconda3/lib/python3.6/platform.py", line 886, in 
  "system node release version machine processor")
File "/usr/hdp/current/spark-client/python/pyspark/serializers.py", line 
381, in namedtuple
  cls = _old_namedtuple(*args, **kwargs)
  TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
'rename', and 'module'

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> valu

[jira] [Created] (SPARK-21737) Create communication channel between arbitrary clients and the Spark AM in YARN mode

2017-08-15 Thread Jong Yoon Lee (JIRA)
Jong Yoon Lee created SPARK-21737:
-

 Summary: Create communication channel between arbitrary clients 
and the Spark AM in YARN mode
 Key: SPARK-21737
 URL: https://issues.apache.org/jira/browse/SPARK-21737
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Jong Yoon Lee
Priority: Minor
 Fix For: 2.1.1


In this JIRA, I develop code to create a communication channel between 
arbitrary clients and a Spark AM on YARN. This code can be utilized to send 
commands such as getting status command, getting history info from the CLI, 
killing the application and pushing new tokens.


Design Doc:
https://docs.google.com/document/d/1QMbWhg13ocIoADywZQBRRVj-b9Zf8CnBrruP5JhcOOY/edit?usp=sharing




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21736.
---
Resolution: Not A Problem

> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*
> UPDATE: I replaced the \ with / and it works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Ross Brigoli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Brigoli updated SPARK-21736:
-
Description: 
We are currently running Spark 2.1 in Windows Server 2012 in Production. While 
trying to upgrade to Spark 2.2.0 spark we started getting this error after 
submitting job in stand-alone cluster mode:

Could not load properties; nested exception is java.io.IOException: 
java.net.URISyntaxException: Illegal character in opaque part at index 7: 
file:Z:\SIT1\TreatmentManager\app.properties 

We passing a -D argument to the spark submit with a value of 
file:Z:\SIT1\TreatmentManager\app.properties

*These URI -D arguments was working fine in Spark 2.1.0*

UPDATE: I replaced the \ with / and it works.

  was:
We are currently running Spark 2.1 in Windows Server 2012 in Production. While 
trying to upgrade to Spark 2.2.0 spark we started getting this error after 
submitting job in stand-alone cluster mode:

Could not load properties; nested exception is java.io.IOException: 
java.net.URISyntaxException: Illegal character in opaque part at index 7: 
file:Z:\SIT1\TreatmentManager\app.properties 

We passing a -D argument to the spark submit with a value of 
file:Z:\SIT1\TreatmentManager\app.properties

*These URI -D arguments was working fine in Spark 2.1.0*


> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*
> UPDATE: I replaced the \ with / and it works.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127309#comment-16127309
 ] 

Sean Owen commented on SPARK-21736:
---

This isn't related to Spark though. The value, property are something specific 
to your app.

> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI "file:Z:\SIT1\TreatmentManager\app.properties"

2017-08-15 Thread Ross Brigoli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Brigoli updated SPARK-21736:
-
Description: 
We are currently running Spark 2.1 in Windows Server 2012 in Production. While 
trying to upgrade to Spark 2.2.0 spark we started getting this error after 
submitting job in stand-alone cluster mode:

Could not load properties; nested exception is java.io.IOException: 
java.net.URISyntaxException: Illegal character in opaque part at index 7: 
file:Z:\SIT1\TreatmentManager\app.properties 

We passing a -D argument to the spark submit with a value of 
file:Z:\SIT1\TreatmentManager\app.properties

*These URI -D arguments was working fine in Spark 2.1.0*

  was:
We are currently running Spark 2.1 in Windows Server 2012 in Production. While 
trying to upgrade to Spark 2.2.0 spark we started getting this error after 
submitting job in stand-alone cluster mode:

Could not load properties; nested exception is java.io.IOException: 
java.net.URISyntaxException: Illegal character in opaque part at index 7: 
file:Z:\SIT1\TreatmentManager\app.properties 

We have this lines in the *spark-defaults.conf*

spark.eventLog.dir file:/z:/sparklogs
spark.history.fs.logDirectory file:/z:/sparklogs

*These URIs worked fine in Spark 2.1.0*

Summary: Spark 2.2 in Windows does not recognize the URI 
"file:Z:\SIT1\TreatmentManager\app.properties"  (was: Spark 2.2 in Windows does 
not recognize the URI file:/z:/sparklogs from spark-defaults.conf)

> Spark 2.2 in Windows does not recognize the URI 
> "file:Z:\SIT1\TreatmentManager\app.properties"
> --
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We passing a -D argument to the spark submit with a value of 
> file:Z:\SIT1\TreatmentManager\app.properties
> *These URI -D arguments was working fine in Spark 2.1.0*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-08-15 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127288#comment-16127288
 ] 

Liang-Chi Hsieh commented on SPARK-21726:
-

[~rxin] Thanks for pinging me! Yes, I'm interested in doing this.

> Check for structural integrity of the plan in QO in test mode
> -
>
> Key: SPARK-21726
> URL: https://issues.apache.org/jira/browse/SPARK-21726
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> Right now we don't have any checks in the optimizer to check for the 
> structural integrity of the plan (e.g. resolved). It would be great if in 
> test mode, we can check whether a plan is still resolved after the execution 
> of each rule, so we can catch rules that return invalid plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21734) spark job blocked by updateDependencies

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21734.
---
Resolution: Invalid

Not sure what you mean, but this should start maybe as a question on the 
mailing list.

> spark job blocked by updateDependencies
> ---
>
> Key: SPARK-21734
> URL: https://issues.apache.org/jira/browse/SPARK-21734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: yarn-cluster
>Reporter: Cheng jin
>
> as above told: I ran a spark job in yarn, it was blocked by bad network here 
> is the exceptions:
> {code:java}
> :at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>   at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>   at java.io.InputStream.read(InputStream.java:101)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>   at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127204#comment-16127204
 ] 

Thomas Graves commented on SPARK-21714:
---

I haven't had time to get to it, so it would be great if you can work on it

> SparkSubmit in Yarn Client mode downloads remote files and then reuploads 
> them again
> 
>
> Key: SPARK-21714
> URL: https://issues.apache.org/jira/browse/SPARK-21714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Priority: Critical
>
> SPARK-10643 added the ability for spark-submit to download remote file in 
> client mode.
> However in yarn mode this introduced a bug where it downloads them for the 
> client but then yarn client just reuploads them to HDFS and uses them again. 
> This should not happen when the remote file is HDFS.  This is wasting 
> resources and its defeating the  distributed cache because if the original 
> object was public it would have been shared by many users. By us downloading 
> and reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21734) spark job blocked by updateDependencies

2017-08-15 Thread Cheng jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127203#comment-16127203
 ] 

Cheng jin commented on SPARK-21734:
---

we met this problem, in fact , it is like this:

{code:java}
private def doFetchFile(
  url: String,
  targetDir: File,
  filename: String,
  conf: SparkConf,
  securityMgr: SecurityManager,
  hadoopConf: Configuration) {
val targetFile = new File(targetDir, filename)
val uri = new URI(url)
val fileOverwrite = conf.getBoolean("spark.files.overwrite", defaultValue = 
false)
Option(uri.getScheme).getOrElse("file") match {
  case "spark" =>
if (SparkEnv.get == null) {
  throw new IllegalStateException(
"Cannot retrieve files with 'spark' scheme without an active 
SparkEnv.")
}
val source = SparkEnv.get.rpcEnv.openChannel(url)
val is = Channels.newInputStream(source)
downloadFile(url, is, targetFile, fileOverwrite)
  case "http" | "https" | "ftp" =>
var uc: URLConnection = null
if (securityMgr.isAuthenticationEnabled()) {
  logDebug("fetchFile with security enabled")
  val newuri = constructURIForAuthentication(uri, securityMgr)
  uc = newuri.toURL().openConnection()
  uc.setAllowUserInteraction(false)
} else {
  logDebug("fetchFile not using security")
  uc = new URL(url).openConnection()
}
Utils.setupSecureURLConnection(uc, securityMgr)

val timeoutMs =
  conf.getTimeAsSeconds("spark.files.fetchTimeout", "60s").toInt * 1000
uc.setConnectTimeout(timeoutMs)
uc.setReadTimeout(timeoutMs)
uc.connect()
val in = uc.getInputStream()
downloadFile(url, in, targetFile, fileOverwrite)
{code}

updatedependencies() -> fetchfile() -> dofetchfile()
it goes  as case "spark" , then open a inputSteam and call read() to download 
file.
However, call() is a blocking method, does adding a readTimeout work?


> spark job blocked by updateDependencies
> ---
>
> Key: SPARK-21734
> URL: https://issues.apache.org/jira/browse/SPARK-21734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: yarn-cluster
>Reporter: Cheng jin
>
> as above told: I ran a spark job in yarn, it was blocked by bad network here 
> is the exceptions:
> {code:java}
> :at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>   at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>   at java.io.InputStream.read(InputStream.java:101)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>   at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   

[jira] [Commented] (SPARK-21714) SparkSubmit in Yarn Client mode downloads remote files and then reuploads them again

2017-08-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127199#comment-16127199
 ] 

Saisai Shao commented on SPARK-21714:
-

Let me take a crack on this if no one is working on it.

> SparkSubmit in Yarn Client mode downloads remote files and then reuploads 
> them again
> 
>
> Key: SPARK-21714
> URL: https://issues.apache.org/jira/browse/SPARK-21714
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Priority: Critical
>
> SPARK-10643 added the ability for spark-submit to download remote file in 
> client mode.
> However in yarn mode this introduced a bug where it downloads them for the 
> client but then yarn client just reuploads them to HDFS and uses them again. 
> This should not happen when the remote file is HDFS.  This is wasting 
> resources and its defeating the  distributed cache because if the original 
> object was public it would have been shared by many users. By us downloading 
> and reuploading, it becomes private.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf

2017-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127172#comment-16127172
 ] 

Sean Owen commented on SPARK-21736:
---

Hm, isn't it {{file:///z:/...}}? not sure if that makes a difference.
You seem to be showing an error with a URL that isn't the one you use for Spark 
though.

> Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from 
> spark-defaults.conf
> ---
>
> Key: SPARK-21736
> URL: https://issues.apache.org/jira/browse/SPARK-21736
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 2.2.0
> Environment: OS: Windows Server 2012
> JVM: Oracle Java 1.8 (hotspot)
>Reporter: Ross Brigoli
>
> We are currently running Spark 2.1 in Windows Server 2012 in Production. 
> While trying to upgrade to Spark 2.2.0 spark we started getting this error 
> after submitting job in stand-alone cluster mode:
> Could not load properties; nested exception is java.io.IOException: 
> java.net.URISyntaxException: Illegal character in opaque part at index 7: 
> file:Z:\SIT1\TreatmentManager\app.properties 
> We have this lines in the *spark-defaults.conf*
> spark.eventLog.dir file:/z:/sparklogs
> spark.history.fs.logDirectory file:/z:/sparklogs
> *These URIs worked fine in Spark 2.1.0*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21736) Spark 2.2 in Windows does not recognize the URI file:/z:/sparklogs from spark-defaults.conf

2017-08-15 Thread Ross Brigoli (JIRA)
Ross Brigoli created SPARK-21736:


 Summary: Spark 2.2 in Windows does not recognize the URI 
file:/z:/sparklogs from spark-defaults.conf
 Key: SPARK-21736
 URL: https://issues.apache.org/jira/browse/SPARK-21736
 Project: Spark
  Issue Type: Bug
  Components: Windows
Affects Versions: 2.2.0
 Environment: OS: Windows Server 2012
JVM: Oracle Java 1.8 (hotspot)
Reporter: Ross Brigoli


We are currently running Spark 2.1 in Windows Server 2012 in Production. While 
trying to upgrade to Spark 2.2.0 spark we started getting this error after 
submitting job in stand-alone cluster mode:

Could not load properties; nested exception is java.io.IOException: 
java.net.URISyntaxException: Illegal character in opaque part at index 7: 
file:Z:\SIT1\TreatmentManager\app.properties 

We have this lines in the *spark-defaults.conf*

spark.eventLog.dir file:/z:/sparklogs
spark.history.fs.logDirectory file:/z:/sparklogs

*These URIs worked fine in Spark 2.1.0*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21704) Add the description of 'sbin/stop-slave.sh' in spark-standalone.html.

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21704.
---
Resolution: Not A Problem

> Add the description of 'sbin/stop-slave.sh'  in spark-standalone.html.
> --
>
> Key: SPARK-21704
> URL: https://issues.apache.org/jira/browse/SPARK-21704
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> 1.Add the description of 'sbin/stop-slave.sh'  in spark-standalone.html.
> `sbin/stop-slave.sh` - Stops a slave instance on the machine the script is 
> executed on.
> 2.The description  of 'sbin/start-slaves.sh' is not accurate.Should be 
> modified as follows:
> Starts all slave instances on each machine specified in the `conf/slaves` 
> file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21734) spark job blocked by updateDependencies

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21734:
--
   Flags:   (was: Patch)
Target Version/s:   (was: 2.2.0)
  Labels:   (was: patch)

You need to read http://spark.apache.org/contributing.html first
This doesn't describe a particular problem. How does this happen, what problem 
does it cause, etc. This alone doesn't mean anything and I'd close it.

> spark job blocked by updateDependencies
> ---
>
> Key: SPARK-21734
> URL: https://issues.apache.org/jira/browse/SPARK-21734
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: yarn-cluster
>Reporter: Cheng jin
>
> as above told: I ran a spark job in yarn, it was blocked by bad network here 
> is the exceptions:
> {code:java}
> :at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>   at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>   at java.io.InputStream.read(InputStream.java:101)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>   at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21735) spark job blocked by updateDependencies

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21735.
---
  Resolution: Duplicate
Target Version/s:   (was: 2.2.0)

> spark job blocked by updateDependencies
> ---
>
> Key: SPARK-21735
> URL: https://issues.apache.org/jira/browse/SPARK-21735
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: yarn-cluster
>Reporter: Cheng jin
>  Labels: patch
>
> as above told: I ran a spark job in yarn, it was blocked by bad network here 
> is the exceptions:
> {code:java}
> :at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>   at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
>   at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
>   at java.io.InputStream.read(InputStream.java:101)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>   at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>   at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
>   at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
>   at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used

2017-08-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127062#comment-16127062
 ] 

Steve Loughran commented on SPARK-21702:


ps, you shouldn't need to set the s3a.impl field; that's handed automatically

> Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
> -
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used

2017-08-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127061#comment-16127061
 ] 

Steve Loughran commented on SPARK-21702:


This is interesting. What may be happening is that whatever s3a FS is being 
created, it's not picking up the options you are setting in spark conf.

Set the values in core-site.xml and/or spark-default & see if is picked up 
there. Otherwise, if you can patch a (short) example with this problem I'll see 
if I can replicate in an integration test

> Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
> -
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21735) spark job blocked by updateDependencies

2017-08-15 Thread Cheng jin (JIRA)
Cheng jin created SPARK-21735:
-

 Summary: spark job blocked by updateDependencies
 Key: SPARK-21735
 URL: https://issues.apache.org/jira/browse/SPARK-21735
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: yarn-cluster
Reporter: Cheng jin


as above told: I ran a spark job in yarn, it was blocked by bad network here is 
the exceptions:

{code:java}
:at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21734) spark job blocked by updateDependencies

2017-08-15 Thread Cheng jin (JIRA)
Cheng jin created SPARK-21734:
-

 Summary: spark job blocked by updateDependencies
 Key: SPARK-21734
 URL: https://issues.apache.org/jira/browse/SPARK-21734
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
 Environment: yarn-cluster
Reporter: Cheng jin


as above told: I ran a spark job in yarn, it was blocked by bad network here is 
the exceptions:

{code:java}
:at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.FileDispatcherImpl.read(FileDispatcherImpl.java:46)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SourceChannelImpl.read(SourceChannelImpl.java:167)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply$mcI$sp(NettyRpcEnv.scala:371)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel$$anonfun$1.apply(NettyRpcEnv.scala:371)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$FileDownloadChannel.read(NettyRpcEnv.scala:371)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
at 
org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
at org.apache.spark.util.Utils$.downloadFile(Utils.scala:509)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:639)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:450)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:659)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:651)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:651)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:297)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-15 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-21681:
---
Description: 
MLOR do not work correctly when featureStd contains zero.
We can reproduce the bug through such dataset (features including zero 
variance), will generate wrong result (all coefficients becomes 0)

{code}
val multinomialDatasetWithZeroVar = {
  val nPoints = 100
  val coefficients = Array(
-0.57997, 0.912083, -0.371077,
-0.16624, -0.84355, -0.048509)

  val xMean = Array(5.843, 3.0)
  val xVariance = Array(0.6856, 0.0)  // including zero variance

  val testData = generateMultinomialLogisticInput(
coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

  val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
  df.cache()
  df
}
{code}



  was:MLOR do not work correctly when featureStd contains zero.


> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Jepson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126978#comment-16126978
 ] 

Jepson commented on SPARK-21733:


[~sowen], thank you for correcting me, I will notice it at the next time.

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21733.
---
   Resolution: Not A Problem
Fix Version/s: (was: 2.1.1)

It wasn't "Fixed"; there was no problem

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-21733:
---

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126974#comment-16126974
 ] 

Saisai Shao commented on SPARK-21733:
-

Should the resolution of this JIRA be "invalid" or else? Actually nothing is 
fixed here.

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Jepson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson resolved SPARK-21733.

   Resolution: Fixed
Fix Version/s: 2.1.1

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
> Fix For: 2.1.1
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Jepson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126970#comment-16126970
 ] 

Jepson commented on SPARK-21733:


I have resolve this issue.Thanks for [~jerryshao] and [~srowen]

spark-submit \
--class com.jiuye.KSSH \
--master yarn \
--deploy-mode cluster \
--driver-memory 3g \
--executor-memory 2g \
--executor-cores   2   \
--num-executors 8 \
--jars $(echo /opt/software/spark/spark_on_yarn/libs/*.jar | tr ' ' ',') \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=2G 
-XX:+UseConcMarkSweepGC" \
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC" \
--conf "spark.sql.tungsten.enabled=false" \
--conf "spark.sql.codegen=false" \
--conf "spark.sql.unsafe.enabled=false" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.streaming.kafka.maxRatePerPartition=1000" \
--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \
--conf "spark.shuffle.consolidateFiles=true" \
--conf "spark.executor.heartbeatInterval=36" \
--conf "spark.yarn.am.memoryOverhead=512m" \
--conf "spark.yarn.driver.memoryOverhead=512m" \
--conf "spark.yarn.executor.memoryOverhead=512m" \
/opt/software/spark/spark_on_yarn/KSSH-0.3.jar

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21733:
--
Shepherd:   (was: -)
   Flags:   (was: Patch,Important)
  Labels:   (was: patch)

Yes, this looks normal

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Vršovský updated SPARK-21723:
-
Description: 
Writing a dataset to LibSVM format raises an exception

{{java.util.NoSuchElementException: key not found: numFeatures}}

Happens only when the dataset was NOT read from a LibSVM format before (because 
otherwise numFeatures is in its metadata). Steps to reproduce:

{code}
import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")
{code}

PR with a fix and unit test is ready - see 
[https://github.com/apache/spark/pull/18872].

  was:
Writing a dataset to LibSVM format raises an exception

{{java.util.NoSuchElementException: key not found: numFeatures}}

Happens only when the dataset was NOT read from a LibSVM format before (because 
otherwise numFeatures is in its metadata). Steps to reproduce:

{code:scala}
import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")
{code}

PR with a fix and unit test is ready - see 
[https://github.com/apache/spark/pull/18872].


> Can't write LibSVM - key not found: numFeatures
> ---
>
> Key: SPARK-21723
> URL: https://issues.apache.org/jira/browse/SPARK-21723
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jan Vršovský
>
> Writing a dataset to LibSVM format raises an exception
> {{java.util.NoSuchElementException: key not found: numFeatures}}
> Happens only when the dataset was NOT read from a LibSVM format before 
> (because otherwise numFeatures is in its metadata). Steps to reproduce:
> {code}
> import org.apache.spark.ml.linalg.Vectors
> val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
>   (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
> val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
> dfTemp.coalesce(1).write.format("libsvm").save("...filename...")
> {code}
> PR with a fix and unit test is ready - see 
> [https://github.com/apache/spark/pull/18872].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-15 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126939#comment-16126939
 ] 

Nick Pentreath commented on SPARK-21723:


Yes, we should definitely be able to write LibSVM format regardless of whether 
the original data was read from that format, and whether we have ML metadata 
attached to the dataframe. We should be able to inspect the vectors to get the 
size in the absence of the metadata.



> Can't write LibSVM - key not found: numFeatures
> ---
>
> Key: SPARK-21723
> URL: https://issues.apache.org/jira/browse/SPARK-21723
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jan Vršovský
>
> Writing a dataset to LibSVM format raises an exception
> {{java.util.NoSuchElementException: key not found: numFeatures}}
> Happens only when the dataset was NOT read from a LibSVM format before 
> (because otherwise numFeatures is in its metadata). Steps to reproduce:
> {{import org.apache.spark.ml.linalg.Vectors
> val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
>   (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
> val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
> dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}}
> PR with a fix and unit test is ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Vršovský updated SPARK-21723:
-
Description: 
Writing a dataset to LibSVM format raises an exception

{{java.util.NoSuchElementException: key not found: numFeatures}}

Happens only when the dataset was NOT read from a LibSVM format before (because 
otherwise numFeatures is in its metadata). Steps to reproduce:

{code:scala}
import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")
{code}

PR with a fix and unit test is ready - see 
[https://github.com/apache/spark/pull/18872].

  was:
Writing a dataset to LibSVM format raises an exception

{{java.util.NoSuchElementException: key not found: numFeatures}}

Happens only when the dataset was NOT read from a LibSVM format before (because 
otherwise numFeatures is in its metadata). Steps to reproduce:

{{import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}}

PR with a fix and unit test is ready.


> Can't write LibSVM - key not found: numFeatures
> ---
>
> Key: SPARK-21723
> URL: https://issues.apache.org/jira/browse/SPARK-21723
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Jan Vršovský
>
> Writing a dataset to LibSVM format raises an exception
> {{java.util.NoSuchElementException: key not found: numFeatures}}
> Happens only when the dataset was NOT read from a LibSVM format before 
> (because otherwise numFeatures is in its metadata). Steps to reproduce:
> {code:scala}
> import org.apache.spark.ml.linalg.Vectors
> val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
>   (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
> val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
> dfTemp.coalesce(1).write.format("libsvm").save("...filename...")
> {code}
> PR with a fix and unit test is ready - see 
> [https://github.com/apache/spark/pull/18872].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception

2017-08-15 Thread Bjoern Toldbod (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126947#comment-16126947
 ] 

Bjoern Toldbod commented on SPARK-21691:


The suggested solution works for me. 

Thanks

> Accessing canonicalized plan for query with limit throws exception
> --
>
> Key: SPARK-21691
> URL: https://issues.apache.org/jira/browse/SPARK-21691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bjoern Toldbod
>
> Accessing the logical, canonicalized plan fails for queries with limits.
> The following demonstrates the issue:
> {code:java}
> val session = SparkSession.builder.master("local").getOrCreate()
> // This works
> session.sql("select * from (values 0, 
> 1)").queryExecution.logical.canonicalized
> // This fails
> session.sql("select * from (values 0, 1) limit 
> 1").queryExecution.logical.canonicalized
> {code}
> The message in the thrown exception is somewhat confusing (or at least not 
> directly related to the limit):
> "Invalid call to toAttribute on unresolved object, tree: *"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126917#comment-16126917
 ] 

Takeshi Yamamuro edited comment on SPARK-21652 at 8/15/17 8:08 AM:
---

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18882


was (Author: maropu):
The pr is: https://github.com/apache/spark/pull/18882

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) &

[jira] [Comment Edited] (SPARK-21363) Prevent column name duplication in temporary view

2017-08-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126847#comment-16126847
 ] 

Takeshi Yamamuro edited comment on SPARK-21363 at 8/15/17 8:08 AM:
---

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18938


was (Author: maropu):
This pr is: https://github.com/apache/spark/pull/18938

> Prevent column name duplication in temporary view
> -
>
> Key: SPARK-21363
> URL: https://issues.apache.org/jira/browse/SPARK-21363
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In SPARK-20460, we refactored some existing checks for column name 
> duplication. The master currently allows name duplication for temporary views 
> like an example below though, IMO we better prevent it along with permanent 
> views (the master prevents the case for the permanent views).
> {code}
> scala> Seq((1, 1)).toDF("a", "a").createOrReplaceTempView("t")
> scala> sql("SELECT * FROM t").show
> +---+---+
> |  a|  a|
> +---+---+
> |  1|  1|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21652) Optimizer cannot reach a fixed point on certain queries

2017-08-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126917#comment-16126917
 ] 

Takeshi Yamamuro commented on SPARK-21652:
--

The pr is: https://github.com/apache/spark/pull/18882

> Optimizer cannot reach a fixed point on certain queries
> ---
>
> Key: SPARK-21652
> URL: https://issues.apache.org/jira/browse/SPARK-21652
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.2.0
>Reporter: Anton Okolnychyi
>
> The optimizer cannot reach a fixed point on the following query:
> {code}
> Seq((1, 2)).toDF("col1", "col2").write.saveAsTable("t1")
> Seq(1, 2).toDF("col").write.saveAsTable("t2")
> spark.sql("SELECT * FROM t1, t2 WHERE t1.col1 = 1 AND 1 = t1.col2 AND t1.col1 
> = t2.col AND t1.col2 = t2.col").explain(true)
> {code}
> At some point during the optimization, InferFiltersFromConstraints infers a 
> new constraint '(col2#33 = col1#32)' that is appended to the join condition, 
> then PushPredicateThroughJoin pushes it down, ConstantPropagation replaces 
> '(col2#33 = col1#32)' with '1 = 1' based on other propagated constraints, 
> ConstantFolding replaces '1 = 1' with 'true and BooleanSimplification finally 
> removes this predicate. However, InferFiltersFromConstraints will again infer 
> '(col2#33 = col1#32)' on the next iteration and the process will continue 
> until the limit of iterations is reached. 
> See below for more details
> {noformat}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
> !Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
> Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && 
> (col2#33 = col#34)))
>  :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
>  :  +- Relation[col1#32,col2#33] parquet  
> :  +- Relation[col1#32,col2#33] parquet
>  +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Filter ((1 = col#34) && isnotnull(col#34))
> +- Relation[col#34] parquet   
>+- Relation[col#34] parquet
> 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
> !Join Inner, ((col2#33 = col1#32) && ((col1#32 = col#34) && (col2#33 = 
> col#34)))  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33)))   :- Filter (col2#33 = col1#32)
> !:  +- Relation[col1#32,col2#33] parquet  
> :  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33)))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
> : +- Relation[col1#32,col2#33] parquet
> !   +- Relation[col#34] parquet   
> +- Filter ((1 = col#34) && isnotnull(col#34))
> ! 
>+- Relation[col#34] parquet
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.CombineFilters ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))
> !:- Filter (col2#33 = col1#32)
>:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && 
> ((col1#32 = 1) && (1 = col2#33))) && (col2#33 = col1#32))
> !:  +- Filter ((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) 
> && (1 = col2#33)))   :  +- Relation[col1#32,col2#33] parquet
> !: +- Relation[col1#32,col2#33] parquet   
>+- Filter ((1 = col#34) && isnotnull(col#34))
> !+- Filter ((1 = col#34) && isnotnull(col#34))
>   +- Relation[col#34] parquet
> !   +- Relation[col#34] parquet   
>
> 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.ConstantPropagation 
> ===
>  Join Inner, ((col1#32 = col#34) && (col2#33 = col#34))   
>  Join Inner, ((col1#32 = col#34) && 
> (col2#33 = col#34))
> !:- Filter (((isnotnull(col1#32) && isnotnull(col2#33)) && ((col1#32 = 1) && 
> (1 = col2#33))) && (col2#33 = col1#32))   :- Filter (((isnotnull(c

[jira] [Commented] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2017-08-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126914#comment-16126914
 ] 

Yanbo Liang commented on SPARK-21481:
-

See my comments at https://github.com/apache/spark/pull/18736 .

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join

2017-08-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126902#comment-16126902
 ] 

Takeshi Yamamuro commented on SPARK-21034:
--

It seems this is duplicate of SPARK-21707. If so, we better move the discussion 
there.

> Allow filter pushdown filters through non deterministic functions for columns 
> involved in groupby / join
> 
>
> Key: SPARK-21034
> URL: https://issues.apache.org/jira/browse/SPARK-21034
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> If the column is involved in aggregation / join then pushing down filter 
> should not change the results.
> Here is a sample code - 
> {code:java}
> from pyspark.sql import functions as F
> df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" 
> : 8},
>{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, 
> "c":7} ])
> df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain()
> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *HashAggregate(keys=[a#15L], functions=[sum(b#16L)])
> +- Exchange hashpartitioning(a#15L, 4)
>+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)])
>   +- *Project [a#15L, b#16L]
>  +- *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> >>>
> >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)])
>+- Exchange hashpartitioning(a#15L, 4)
>   +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), 
> partial_first(c#17L, false)])
>  +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> {code}
> As you can see, the filter is not pushed down when F.first aggregate function 
> is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-15 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126894#comment-16126894
 ] 

Saisai Shao commented on SPARK-21733:
-

Is there any issue here?

> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>  Labels: patch
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
> (TID 64178). 1740 bytes result sent to driver
> 17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
> 17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 64186
> 17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 
> (TID 64186)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
> variable 8004
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8004 took 8 ms
> 17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
> (TID 64186). 1740 bytes result sent to driver
> h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL TERM
> 17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
> 17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >