[jira] [Updated] (SPARK-18949) Add recoverPartitions API to Catalog

2016-12-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18949:

Fix Version/s: (was: 2.11)
   2.1.1

> Add recoverPartitions API to Catalog
> 
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1
>
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE` or 
> `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to 
> remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.
> It is desriable to add it into the Catalog interface so that users can repair 
> the table by
> {noformat}
> spark.catalog.recoverPartitions("testTable")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18949) Add recoverPartitions API to Catalog

2016-12-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18949:

Fix Version/s: (was: 2.2.0)

> Add recoverPartitions API to Catalog
> 
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1
>
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE` or 
> `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to 
> remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.
> It is desriable to add it into the Catalog interface so that users can repair 
> the table by
> {noformat}
> spark.catalog.recoverPartitions("testTable")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18949) Add recoverPartitions API to Catalog

2016-12-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18949.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.11

> Add recoverPartitions API to Catalog
> 
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.11, 2.2.0
>
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE` or 
> `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to 
> remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.
> It is desriable to add it into the Catalog interface so that users can repair 
> the table by
> {noformat}
> spark.catalog.recoverPartitions("testTable")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18932) Partial aggregation for collect_set / collect_list

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766342#comment-15766342
 ] 

Apache Spark commented on SPARK-18932:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16371

> Partial aggregation for collect_set / collect_list
> --
>
> Key: SPARK-18932
> URL: https://issues.apache.org/jira/browse/SPARK-18932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Reporter: Michael Armbrust
>
> The lack of partial aggregation here is blocking us from using these in 
> streaming.  It still won't be fast, but it would be nice to at least be able 
> to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18932) Partial aggregation for collect_set / collect_list

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18932:


Assignee: Apache Spark

> Partial aggregation for collect_set / collect_list
> --
>
> Key: SPARK-18932
> URL: https://issues.apache.org/jira/browse/SPARK-18932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> The lack of partial aggregation here is blocking us from using these in 
> streaming.  It still won't be fast, but it would be nice to at least be able 
> to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18932) Partial aggregation for collect_set / collect_list

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18932:


Assignee: (was: Apache Spark)

> Partial aggregation for collect_set / collect_list
> --
>
> Key: SPARK-18932
> URL: https://issues.apache.org/jira/browse/SPARK-18932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Reporter: Michael Armbrust
>
> The lack of partial aggregation here is blocking us from using these in 
> streaming.  It still won't be fast, but it would be nice to at least be able 
> to use them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18946) treeAggregate will be low effficiency when aggregate high dimension vectors in ML algorithm

2016-12-20 Thread zunwen you (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zunwen you closed SPARK-18946.
--
Resolution: Duplicate

> treeAggregate will be low effficiency when aggregate high dimension vectors 
> in ML algorithm
> ---
>
> Key: SPARK-18946
> URL: https://issues.apache.org/jira/browse/SPARK-18946
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zunwen you
>  Labels: features
>
> In many machine learning algorithms, we have to treeAggregate large 
> vectors/arrays due to the large number of features. Unfortunately, the 
> treeAggregate operation of RDD will be low efficiency when the dimension of 
> vectors/arrays is bigger than million. Because high dimension of vector/array 
> always occupy more than 100MB Memory, transferring a 100MB element among 
> executors is pretty low efficiency in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18821) Bisecting k-means wrapper in SparkR

2016-12-20 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766299#comment-15766299
 ] 

Miao Wang commented on SPARK-18821:
---

I can work on this one, if it is not urgent. Thanks!

> Bisecting k-means wrapper in SparkR
> ---
>
> Key: SPARK-18821
> URL: https://issues.apache.org/jira/browse/SPARK-18821
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Felix Cheung
>
> Implement a wrapper in SparkR to support bisecting k-means



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18960) Avoid double reading file which is being copied.

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18960:


Assignee: Apache Spark

> Avoid double reading file which is being copied.
> 
>
> Key: SPARK-18960
> URL: https://issues.apache.org/jira/browse/SPARK-18960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>Assignee: Apache Spark
>
> In HDFS, when we copy a file into target directory, there will a temporary 
> {{._COPY_}} file for a period of time. The duration depends on file size. If 
> we do not skip this file, we will may read the same data for two times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18960) Avoid double reading file which is being copied.

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766274#comment-15766274
 ] 

Apache Spark commented on SPARK-18960:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16370

> Avoid double reading file which is being copied.
> 
>
> Key: SPARK-18960
> URL: https://issues.apache.org/jira/browse/SPARK-18960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>
> In HDFS, when we copy a file into target directory, there will a temporary 
> {{._COPY_}} file for a period of time. The duration depends on file size. If 
> we do not skip this file, we will may read the same data for two times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18960) Avoid double reading file which is being copied.

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18960:


Assignee: (was: Apache Spark)

> Avoid double reading file which is being copied.
> 
>
> Key: SPARK-18960
> URL: https://issues.apache.org/jira/browse/SPARK-18960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Genmao Yu
>
> In HDFS, when we copy a file into target directory, there will a temporary 
> {{._COPY_}} file for a period of time. The duration depends on file size. If 
> we do not skip this file, we will may read the same data for two times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18960) Avoid double reading file which is being copied.

2016-12-20 Thread Genmao Yu (JIRA)
Genmao Yu created SPARK-18960:
-

 Summary: Avoid double reading file which is being copied.
 Key: SPARK-18960
 URL: https://issues.apache.org/jira/browse/SPARK-18960
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.0.2
Reporter: Genmao Yu


In HDFS, when we copy a file into target directory, there will a temporary 
{{._COPY_}} file for a period of time. The duration depends on file size. If we 
do not skip this file, we will may read the same data for two times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18036) Decision Trees do not handle edge cases

2016-12-20 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766243#comment-15766243
 ] 

Weichen Xu commented on SPARK-18036:


Oh, I'm too busy recently to work on it, it would be great if you can resolve 
it, thanks! 

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18036) Decision Trees do not handle edge cases

2016-12-20 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18036:
---
Comment: was deleted

(was: i am working on this... )

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18956) Python API should reuse existing SparkSession while creating new SQLContext instances

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18956:


Assignee: (was: Apache Spark)

> Python API should reuse existing SparkSession while creating new SQLContext 
> instances
> -
>
> Key: SPARK-18956
> URL: https://issues.apache.org/jira/browse/SPARK-18956
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>
> We did this for the Scala API for Spark 2.0 but didn't update the Python API 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18956) Python API should reuse existing SparkSession while creating new SQLContext instances

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766118#comment-15766118
 ] 

Apache Spark commented on SPARK-18956:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16369

> Python API should reuse existing SparkSession while creating new SQLContext 
> instances
> -
>
> Key: SPARK-18956
> URL: https://issues.apache.org/jira/browse/SPARK-18956
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>
> We did this for the Scala API for Spark 2.0 but didn't update the Python API 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18956) Python API should reuse existing SparkSession while creating new SQLContext instances

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18956:


Assignee: Apache Spark

> Python API should reuse existing SparkSession while creating new SQLContext 
> instances
> -
>
> Key: SPARK-18956
> URL: https://issues.apache.org/jira/browse/SPARK-18956
> Project: Spark
>  Issue Type: Bug
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> We did this for the Scala API for Spark 2.0 but didn't update the Python API 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18800) Correct the assert in UnsafeKVExternalSorter which ensures array size

2016-12-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766039#comment-15766039
 ] 

Liang-Chi Hsieh commented on SPARK-18800:
-

Note: this jia is motivated by the issue reported on dev mailing list at 
http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-tc20108.html

As we don't have a repro, we can't make sure where the root cause is exactly. 
But I fixed the assert at UnsafeKVExternalSorter, so we can easily make sure 
whether this array location is the problem in the future.

> Correct the assert in UnsafeKVExternalSorter which ensures array size
> -
>
> Key: SPARK-18800
> URL: https://issues.apache.org/jira/browse/SPARK-18800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> UnsafeKVExternalSorter uses UnsafeInMemorySorter to sort the records of 
> BytesToBytesMap if it is given a map.
> Currently we use the number of keys in BytesToBytesMap to determine if the 
> array used for sort is enough or not. We has an assert that ensures the size 
> of the array is enough: map.numKeys() <= map.getArray().size() / 2.
> However, each record in the map takes two entries in the array, one is record 
> pointer, another is key prefix. So the correct assert should be map.numKeys() 
> * 2 <= map.getArray().size() / 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18959) invalid resource statistics for standalone cluster

2016-12-20 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-18959:

Attachment: 屏幕快照 2016-12-21 11.49.12.png

The attachment is the master page

> invalid resource statistics for standalone cluster
> --
>
> Key: SPARK-18959
> URL: https://issues.apache.org/jira/browse/SPARK-18959
> Project: Spark
>  Issue Type: Bug
>Reporter: hustfxj
> Attachments: 屏幕快照 2016-12-21 11.49.12.png
>
>
> Workers
> Worker Id Address State   Cores   Memory
> worker-20161220162751-10.125.6.222-59295  10.125.6.222:59295  ALIVE   
> 4 (-1 Used) 6.8 GB (-1073741824.0 B Used)
> worker-20161220164233-10.218.135.80-10944 10.218.135.80:10944 ALIVE   
> 4 (0 Used)  6.8 GB (0.0 B Used)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18959) invalid resource statistics for standalone cluster

2016-12-20 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15766020#comment-15766020
 ] 

hustfxj commented on SPARK-18959:
-

the attachment is the master page 

> invalid resource statistics for standalone cluster
> --
>
> Key: SPARK-18959
> URL: https://issues.apache.org/jira/browse/SPARK-18959
> Project: Spark
>  Issue Type: Bug
>Reporter: hustfxj
>
> Workers
> Worker Id Address State   Cores   Memory
> worker-20161220162751-10.125.6.222-59295  10.125.6.222:59295  ALIVE   
> 4 (-1 Used) 6.8 GB (-1073741824.0 B Used)
> worker-20161220164233-10.218.135.80-10944 10.218.135.80:10944 ALIVE   
> 4 (0 Used)  6.8 GB (0.0 B Used)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18959) invalid resource statistics for standalone cluster

2016-12-20 Thread hustfxj (JIRA)
hustfxj created SPARK-18959:
---

 Summary: invalid resource statistics for standalone cluster
 Key: SPARK-18959
 URL: https://issues.apache.org/jira/browse/SPARK-18959
 Project: Spark
  Issue Type: Bug
Reporter: hustfxj



Workers

Worker Id   Address State   Cores   Memory
worker-20161220162751-10.125.6.222-5929510.125.6.222:59295  ALIVE   
4 (-1 Used) 6.8 GB (-1073741824.0 B Used)
worker-20161220164233-10.218.135.80-10944   10.218.135.80:10944 ALIVE   
4 (0 Used)  6.8 GB (0.0 B Used)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18900) Flaky Test: StateStoreSuite.maintenance

2016-12-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-18900.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.1

> Flaky Test: StateStoreSuite.maintenance
> ---
>
> Key: SPARK-18900
> URL: https://issues.apache.org/jira/browse/SPARK-18900
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70223/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18900) Flaky Test: StateStoreSuite.maintenance

2016-12-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-18900:
--
Fix Version/s: 2.2.0

> Flaky Test: StateStoreSuite.maintenance
> ---
>
> Key: SPARK-18900
> URL: https://issues.apache.org/jira/browse/SPARK-18900
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70223/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765965#comment-15765965
 ] 

Apache Spark commented on SPARK-18958:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16368

> SparkR should support toJSON on DataFrame
> -
>
> Key: SPARK-18958
> URL: https://issues.apache.org/jira/browse/SPARK-18958
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> It makes it easier to interop with other component (esp. since R does not 
> have json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18958:


Assignee: Felix Cheung  (was: Apache Spark)

> SparkR should support toJSON on DataFrame
> -
>
> Key: SPARK-18958
> URL: https://issues.apache.org/jira/browse/SPARK-18958
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> It makes it easier to interop with other component (esp. since R does not 
> have json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18958:


Assignee: Apache Spark  (was: Felix Cheung)

> SparkR should support toJSON on DataFrame
> -
>
> Key: SPARK-18958
> URL: https://issues.apache.org/jira/browse/SPARK-18958
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> It makes it easier to interop with other component (esp. since R does not 
> have json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18958) SparkR should support toJSON on DataFrame

2016-12-20 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-18958:


 Summary: SparkR should support toJSON on DataFrame
 Key: SPARK-18958
 URL: https://issues.apache.org/jira/browse/SPARK-18958
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung
Priority: Minor


It makes it easier to interop with other component (esp. since R does not have 
json support built in)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system

2016-12-20 Thread luat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luat updated SPARK-18941:
-
Description: Spark thrift server, Spark 2.0.2, The "drop table" command 
doesn't delete the directory associated with the Hive table (not EXTERNAL 
table) from the HDFS file system.  (was: Spark thrift server, Spark 2.0.2, The 
"drop table" command doesn't delete the directory associated with the table 
(not EXTERNAL table) from the file system.)
Summary: Spark thrift server, Spark 2.0.2, The "drop table" command 
doesn't delete the directory associated with the Hive table (not EXTERNAL 
table) from the HDFS file system  (was: Spark thrift server, Spark 2.0.2, The 
"drop table" command doesn't delete the directory associated with the table 
(not EXTERNAL table) from the file system)

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system
> -
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the Hive table (not EXTERNAL table) from the HDFS 
> file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765900#comment-15765900
 ] 

Apache Spark commented on SPARK-18903:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16367

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18903:


Assignee: (was: Apache Spark)

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18903) uiWebUrl is not accessible to SparkR

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18903:


Assignee: Apache Spark

> uiWebUrl is not accessible to SparkR
> 
>
> Key: SPARK-18903
> URL: https://issues.apache.org/jira/browse/SPARK-18903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Web UI
>Affects Versions: 2.0.2
>Reporter: Diogo Munaro Vieira
>Assignee: Apache Spark
>Priority: Minor
>
> Like https://issues.apache.org/jira/browse/SPARK-17437 uiWebUrl is not 
> accessible to SparkR context



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18957) when WAL time out, loss data

2016-12-20 Thread xy7 (JIRA)
xy7 created SPARK-18957:
---

 Summary: when WAL time out, loss data
 Key: SPARK-18957
 URL: https://issues.apache.org/jira/browse/SPARK-18957
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.6.2
 Environment: hadoop2.7.1 spark1.6.2
Reporter: xy7
Priority: Critical
 Fix For: 1.6.2


When sparkStreaming's blockManager writeAheadLog timeout, it cause 
keepPushBlocks to be stoped, with this warn log
"WARN scheduler.ReceiverTracker: Error reported by receiver for stream 0: Error 
in block pushing thread - java.util.concurrent.TimeoutException: Futures timed 
out after [30 seconds]".
But the receiver continued to get data from MQ, so the data lost. In sparkUI, 
we would found all the batchs turned to 0 events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18956) Python API should reuse existing SparkSession while creating new SQLContext instances

2016-12-20 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18956:
--

 Summary: Python API should reuse existing SparkSession while 
creating new SQLContext instances
 Key: SPARK-18956
 URL: https://issues.apache.org/jira/browse/SPARK-18956
 Project: Spark
  Issue Type: Bug
Reporter: Cheng Lian


We did this for the Scala API for Spark 2.0 but didn't update the Python API 
respectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18931) Create empty staging directory in partitioned table on insert

2016-12-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765755#comment-15765755
 ] 

Dongjoon Hyun commented on SPARK-18931:
---

Hi, [~smilegator].
This seems to be related to SPARK-18931 .
Could you test this together when you make a PR for SPARK-18931 for master 
branch?

> Create empty staging directory in partitioned table on insert
> -
>
> Key: SPARK-18931
> URL: https://issues.apache.org/jira/browse/SPARK-18931
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> On every 
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> new directory 
> ".hive-staging_hive_2016-12-19_15-55-11_298_3412488541559534475-4" created on 
> HDFS.  It's big issue, because I insert every day and bunch of empty dirs on 
> HDFS is very bad for HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18955) Add ability to emit kafka events to DStream or KafkaDStream

2016-12-20 Thread Russell Jurney (JIRA)
Russell Jurney created SPARK-18955:
--

 Summary: Add ability to emit kafka events to DStream or 
KafkaDStream
 Key: SPARK-18955
 URL: https://issues.apache.org/jira/browse/SPARK-18955
 Project: Spark
  Issue Type: New Feature
  Components: DStreams, PySpark
Affects Versions: 2.0.2
Reporter: Russell Jurney


Any I/O that needs doing in Spark Streaming seems to have to be done in a 
DStream.foreachRDD loop. For instance, in PySpark if I want to emit Kafka 
events for each record... I have to DStream.foreachRDD and use kafka-python to 
emit a Kafka event for each record.

This really seems like I/O like this should be part of the pyspark.streaming or 
pyspark.streaming.kafka API and the equivalent Scala APIs. Something like 
DStream.emitKafkaEvents or KafkaDStream.emitKafkaEvents would seem to make 
sense.

If this is a good idea, and it seems feasible, I'd like to take a crack at it 
as my first patch for Spark. Advice would be appreciated. What would need to be 
modified to make this happen?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2016-12-20 Thread Wayne Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765733#comment-15765733
 ] 

Wayne Zhang commented on SPARK-18710:
-

[~yanboliang] Thanks for the suggestion. I think the issue is a bit different 
in this case. The IRWLS relies on the _reweightFunc_, which is hard-coded to 
take an _Instance_ class:
{code}  
val reweightFunc: (Instance, WeightedLeastSquaresModel) => (Double, Double)
{code}

I need to pass the offset column to this reweight function. Creating another 
GLRInstance won't solve the problem, will it?

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18953) Do not show the link to a dead worker on the master page

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18953:


Assignee: (was: Apache Spark)

> Do not show the link to a dead worker on the master page
> 
>
> Key: SPARK-18953
> URL: https://issues.apache.org/jira/browse/SPARK-18953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>
> The master page seems still show links to dead workers. For a dead worker, we 
> will not be able to see its worker page anyway. Seems makes sense to not show 
> links to dead workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18953) Do not show the link to a dead worker on the master page

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18953:


Assignee: Apache Spark

> Do not show the link to a dead worker on the master page
> 
>
> Key: SPARK-18953
> URL: https://issues.apache.org/jira/browse/SPARK-18953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> The master page seems still show links to dead workers. For a dead worker, we 
> will not be able to see its worker page anyway. Seems makes sense to not show 
> links to dead workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18953) Do not show the link to a dead worker on the master page

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765726#comment-15765726
 ] 

Apache Spark commented on SPARK-18953:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/16366

> Do not show the link to a dead worker on the master page
> 
>
> Key: SPARK-18953
> URL: https://issues.apache.org/jira/browse/SPARK-18953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>
> The master page seems still show links to dead workers. For a dead worker, we 
> will not be able to see its worker page anyway. Seems makes sense to not show 
> links to dead workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18953) Do not show the link to a dead worker on the master page

2016-12-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765690#comment-15765690
 ] 

Dongjoon Hyun commented on SPARK-18953:
---

Hi, [~yhuai].
If you didn't start this yet, may I make a PR for this?

> Do not show the link to a dead worker on the master page
> 
>
> Key: SPARK-18953
> URL: https://issues.apache.org/jira/browse/SPARK-18953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Yin Huai
>
> The master page seems still show links to dead workers. For a dead worker, we 
> will not be able to see its worker page anyway. Seems makes sense to not show 
> links to dead workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18928:
-
Fix Version/s: 2.0.3

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18950:


Assignee: (was: Apache Spark)

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765624#comment-15765624
 ] 

Apache Spark commented on SPARK-18950:
--

User 'bravo-zhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/16365

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18950:


Assignee: Apache Spark

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18761:
-
Fix Version/s: 2.1.1
   2.0.3

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18301) VectorAssembler does not support StructTypes

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765611#comment-15765611
 ] 

Ilya Matiach commented on SPARK-18301:
--

For example, when I use HashingTF I get the same sort of error:

case class Bar(x: String, y: String)
case class Foo(bar: Bar)
val df = sc.parallelize(Seq(Foo(Bar("hello", "world".toDF

at this point I see:
scala> df.printSchema
root
 |-- bar: struct (nullable = true)
 ||-- x: string (nullable = true)
 ||-- y: string (nullable = true)

Using a hashing TF:
val htf = new HashingTF().setInputCol("bar.x")
htf.transform(df)

Error:
java.lang.IllegalArgumentException: Field "bar.x" does not exist.

> VectorAssembler does not support StructTypes
> 
>
> Key: SPARK-18301
> URL: https://issues.apache.org/jira/browse/SPARK-18301
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Windows Standalone Mode, Java
>Reporter: Steffen Herbold
>Priority: Minor
>
> I tried to transform a structured type using the VectorAssembler as follows:
> {code:java}
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
> { "metrics.Line", "metrics.McCC" }).setOutputCol("features");
> dataframe= va.transform(dataframe);
> {code}
> This yields the following exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Field 
> "metrics.McCC" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
>   at 
> de.ugoe.cs.smartshark.jobs.DefectPredictionExample.main(DefectPredictionExample.java:53)
> {code}
> The schema of the dataframe is:
> {noformat}
>  |-- metrics: struct (nullable = true)
>  ||-- Line: double (nullable = true)
>  ||-- McCC: double (nullable = true)
> ...
> {noformat}
> The transfomation works, if I first use withColumn to make "metrics.Line" and 
> "metrics.McCC" into columns of the dataframe:
> {code:java}
> dataframe.withColumn("Line", data.col("metrics.Line")
> dataframe.withColumn("McCC", data.col("metrics.McCC")
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
> { "metrics.McCC", "metrics.NL" }).setOutputCol("features");
> fileState = va.transform(dataframe);
> {code}
> However, this workaround is quite costly and direct support to access the 
> nested values would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18576) Expose basic TaskContext info in PySpark

2016-12-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18576.
-
   Resolution: Fixed
 Assignee: holdenk
Fix Version/s: 2.2.0

> Expose basic TaskContext info in PySpark
> 
>
> Key: SPARK-18576
> URL: https://issues.apache.org/jira/browse/SPARK-18576
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently the TaskContext info isn't exposed in PySpark. While we don't need 
> to expose the full TaskContext information in Python it would make sense to 
> expose the public APIs in Python for users who are doing custom logging or 
> job handling with task id or retry attempt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18301) VectorAssembler does not support StructTypes

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765596#comment-15765596
 ] 

Ilya Matiach commented on SPARK-18301:
--

I am able to reproduce this, but I'm not sure if this is actually a bug or a 
feature request.  Are any other spark transformers or estimators able to work 
on structured types like this?

> VectorAssembler does not support StructTypes
> 
>
> Key: SPARK-18301
> URL: https://issues.apache.org/jira/browse/SPARK-18301
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Windows Standalone Mode, Java
>Reporter: Steffen Herbold
>Priority: Minor
>
> I tried to transform a structured type using the VectorAssembler as follows:
> {code:java}
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
> { "metrics.Line", "metrics.McCC" }).setOutputCol("features");
> dataframe= va.transform(dataframe);
> {code}
> This yields the following exception:
> {code:java}
> Exception in thread "main" java.lang.IllegalArgumentException: Field 
> "metrics.McCC" does not exist.
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
>   at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>   at scala.collection.AbstractMap.getOrElse(Map.scala:59)
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
>   at 
> org.apache.spark.ml.feature.VectorAssembler$$anonfun$5.apply(VectorAssembler.scala:116)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:116)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at 
> org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
>   at 
> de.ugoe.cs.smartshark.jobs.DefectPredictionExample.main(DefectPredictionExample.java:53)
> {code}
> The schema of the dataframe is:
> {noformat}
>  |-- metrics: struct (nullable = true)
>  ||-- Line: double (nullable = true)
>  ||-- McCC: double (nullable = true)
> ...
> {noformat}
> The transfomation works, if I first use withColumn to make "metrics.Line" and 
> "metrics.McCC" into columns of the dataframe:
> {code:java}
> dataframe.withColumn("Line", data.col("metrics.Line")
> dataframe.withColumn("McCC", data.col("metrics.McCC")
> VectorAssembler va = new VectorAssembler().setInputCols(new String[]
> { "metrics.McCC", "metrics.NL" }).setOutputCol("features");
> fileState = va.transform(dataframe);
> {code}
> However, this workaround is quite costly and direct support to access the 
> nested values would be very helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18954) Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and window

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18954:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and 
> window
> -
>
> Key: SPARK-18954
> URL: https://issues.apache.org/jira/browse/SPARK-18954
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.BasicOperationsSuite_name=rdd+cleanup+-+map+and+window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18954) Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and window

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765581#comment-15765581
 ] 

Apache Spark commented on SPARK-18954:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16362

> Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and 
> window
> -
>
> Key: SPARK-18954
> URL: https://issues.apache.org/jira/browse/SPARK-18954
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.BasicOperationsSuite_name=rdd+cleanup+-+map+and+window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18954) Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and window

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18954:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and 
> window
> -
>
> Key: SPARK-18954
> URL: https://issues.apache.org/jira/browse/SPARK-18954
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>  Labels: flaky-test
>
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.BasicOperationsSuite_name=rdd+cleanup+-+map+and+window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18954) Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and window

2016-12-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18954:


 Summary: Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd 
cleanup - map and window
 Key: SPARK-18954
 URL: https://issues.apache.org/jira/browse/SPARK-18954
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.BasicOperationsSuite_name=rdd+cleanup+-+map+and+window



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5632) not able to resolve dot('.') in field name

2016-12-20 Thread William Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765558#comment-15765558
 ] 

William Shen commented on SPARK-5632:
-

[~marmbrus],
Found another weirdness with dot in field name, that might be of interest too.
in 1.5.0,
{code}
import org.apache.spark.sql.functions._
val data = Seq(("test1","test2","test3")).toDF("col1", "col with . in it", 
"col3"); data.withColumn("col1", trim(data("col1")))
{code}
fails with
{code}
org.apache.spark.sql.AnalysisException: cannot resolve 'col with . in it' given 
input columns col1, col with . in it, col3;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:293)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:292)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:279)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:108)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:118)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:122)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:126)
at 

[jira] [Assigned] (SPARK-18952) regex strings not properly escaped in codegen for aggregations

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18952:


Assignee: Apache Spark

> regex strings not properly escaped in codegen for aggregations
> --
>
> Key: SPARK-18952
> URL: https://issues.apache.org/jira/browse/SPARK-18952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> If I use the function regexp_extract, and then in my regex string, use `\`, 
> i.e. escape character, this fails codegen, because the `\` character is not 
> properly escaped when codegen'd.
> Example stack trace:
> {code}
> /* 059 */ private int maxSteps = 2;
> /* 060 */ private int numRows = 0;
> /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
> -MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
> /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
> 1)", org.apache.spark.sql.types.DataTypes.StringType);
> /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.LongType);
> /* 064 */ private Object emptyVBase;
> ...
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 62, Column 58: Invalid escape sequence
>   at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
>   at org.codehaus.janino.Scanner.produce(Scanner.java:604)
>   at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
>   at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
>   at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
> {code}
> In the codegend expression, the literal should use `\\` instead of `\`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18952) regex strings not properly escaped in codegen for aggregations

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765539#comment-15765539
 ] 

Apache Spark commented on SPARK-18952:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16361

> regex strings not properly escaped in codegen for aggregations
> --
>
> Key: SPARK-18952
> URL: https://issues.apache.org/jira/browse/SPARK-18952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>
> If I use the function regexp_extract, and then in my regex string, use `\`, 
> i.e. escape character, this fails codegen, because the `\` character is not 
> properly escaped when codegen'd.
> Example stack trace:
> {code}
> /* 059 */ private int maxSteps = 2;
> /* 060 */ private int numRows = 0;
> /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
> -MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
> /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
> 1)", org.apache.spark.sql.types.DataTypes.StringType);
> /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.LongType);
> /* 064 */ private Object emptyVBase;
> ...
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 62, Column 58: Invalid escape sequence
>   at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
>   at org.codehaus.janino.Scanner.produce(Scanner.java:604)
>   at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
>   at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
>   at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
> {code}
> In the codegend expression, the literal should use `\\` instead of `\`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18952) regex strings not properly escaped in codegen for aggregations

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18952:


Assignee: (was: Apache Spark)

> regex strings not properly escaped in codegen for aggregations
> --
>
> Key: SPARK-18952
> URL: https://issues.apache.org/jira/browse/SPARK-18952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>
> If I use the function regexp_extract, and then in my regex string, use `\`, 
> i.e. escape character, this fails codegen, because the `\` character is not 
> properly escaped when codegen'd.
> Example stack trace:
> {code}
> /* 059 */ private int maxSteps = 2;
> /* 060 */ private int numRows = 0;
> /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
> -MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
> /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
> 1)", org.apache.spark.sql.types.DataTypes.StringType);
> /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.LongType);
> /* 064 */ private Object emptyVBase;
> ...
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 62, Column 58: Invalid escape sequence
>   at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
>   at org.codehaus.janino.Scanner.produce(Scanner.java:604)
>   at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
>   at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
>   at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
> {code}
> In the codegend expression, the literal should use `\\` instead of `\`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18952) regex strings not properly escaped in codegen for aggregations

2016-12-20 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18952:

Summary: regex strings not properly escaped in codegen for aggregations  
(was: regex strings not properly escaped in codegen)

> regex strings not properly escaped in codegen for aggregations
> --
>
> Key: SPARK-18952
> URL: https://issues.apache.org/jira/browse/SPARK-18952
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>
> If I use the function regexp_extract, and then in my regex string, use `\`, 
> i.e. escape character, this fails codegen, because the `\` character is not 
> properly escaped when codegen'd.
> Example stack trace:
> {code}
> /* 059 */ private int maxSteps = 2;
> /* 060 */ private int numRows = 0;
> /* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
> org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
> -MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
> /* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
> 1)", org.apache.spark.sql.types.DataTypes.StringType);
> /* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
> org.apache.spark.sql.types.StructType().add("sum", 
> org.apache.spark.sql.types.DataTypes.LongType);
> /* 064 */ private Object emptyVBase;
> ...
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 62, Column 58: Invalid escape sequence
>   at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
>   at org.codehaus.janino.Scanner.produce(Scanner.java:604)
>   at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
>   at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
>   at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
>   at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
>   at 
> org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
>   at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
> {code}
> In the codegend expression, the literal should use `\\` instead of `\`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Incorrect error message on calling registered UDF.

2016-12-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765511#comment-15765511
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Sorry, it turns out that the case I met yesterday only occurs in `spark-shell` 
due to class loader.
It's totally different from this issue.
So far, when I tried `spark-sql`, there seems to be no problem.
I mean I cannot reproduce yours so far in `spark-sql` unfortunately.
Could you provide more reproducible example?

> Spark SQL: Incorrect error message on calling registered UDF.
> -
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Lokesh Yadav
>
> On calling a registered UDF in metastore from spark-sql CLI, it gives a 
> generic error:
> Error in query: Undefined function: 'Sample_UDF'. This function is neither a 
> registered temporary function nor a permanent function registered in the 
> database 'default'.
> The functions is registered and it shoes up in the list output by 'show 
> functions'.
> I am using a Hive UDTF, registering it using the statement: create function 
> Sample_UDF as 'com.udf.Sample_UDF' using JAR 
> '/local/jar/path/containing/the/class';
> and I am calling the functions from spark-sql CLI as: SELECT 
> Sample_UDF("input_1", "input_2" )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18953) Do not show the link to a dead worker on the master page

2016-12-20 Thread Yin Huai (JIRA)
Yin Huai created SPARK-18953:


 Summary: Do not show the link to a dead worker on the master page
 Key: SPARK-18953
 URL: https://issues.apache.org/jira/browse/SPARK-18953
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Yin Huai


The master page seems still show links to dead workers. For a dead worker, we 
will not be able to see its worker page anyway. Seems makes sense to not show 
links to dead workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18234) Update mode in structured streaming

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18234:


Assignee: (was: Apache Spark)

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18234) Update mode in structured streaming

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18234:


Assignee: Apache Spark

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18234) Update mode in structured streaming

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765449#comment-15765449
 ] 

Apache Spark commented on SPARK-18234:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/16360

> Update mode in structured streaming
> ---
>
> Key: SPARK-18234
> URL: https://issues.apache.org/jira/browse/SPARK-18234
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Priority: Critical
>
> We have this internal, but we should nail down the semantics and expose it to 
> users.  The idea of update mode is that any tuple that changes will be 
> emitted.  Open questions:
>  - do we need to reason about the {{keys}} for a given stream?  For things 
> like the {{foreach}} sink its up to the user.  However, for more end to end 
> use cases such as a JDBC sink, we need to know which row downstream is being 
> updated.
>  - okay to not support files?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18927) MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in conf

2016-12-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18927.
--
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.2.0
   2.1.1

> MemorySink for StructuredStreaming can't recover from checkpoint if location 
> is provided in conf
> 
>
> Key: SPARK-18927
> URL: https://issues.apache.org/jira/browse/SPARK-18927
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6

2016-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18951:
-
Description: 
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
jackson fail to handle byte array defined in a case class. Then I find 
https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
that it is caused by a bug in paranamer. Let's upgrade paranamer. 

Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade paranamer 
to 2.6. 

  was:
I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
jackson fail to handle byte array defined in a case class. Then I find 
https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
that it is caused by a bug in paranamer. Let's upgrade paranamer. 

Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade 
paranamer to 2.6. 


> Upgrade com.thoughtworks.paranamer/paranamer to 2.6
> ---
>
> Key: SPARK-18951
> URL: https://issues.apache.org/jira/browse/SPARK-18951
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
> jackson fail to handle byte array defined in a case class. Then I find 
> https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
> that it is caused by a bug in paranamer. Let's upgrade paranamer. 
> Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
> com.thoughtworks.paranamer/paranamer 2.6, I suggests that we upgrade 
> paranamer to 2.6. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765366#comment-15765366
 ] 

Apache Spark commented on SPARK-18951:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16359

> Upgrade com.thoughtworks.paranamer/paranamer to 2.6
> ---
>
> Key: SPARK-18951
> URL: https://issues.apache.org/jira/browse/SPARK-18951
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
> jackson fail to handle byte array defined in a case class. Then I find 
> https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
> that it is caused by a bug in paranamer. Let's upgrade paranamer. 
> Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
> com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade 
> paranamer to 2.6. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18952) regex strings not properly escaped in codegen

2016-12-20 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18952:
---

 Summary: regex strings not properly escaped in codegen
 Key: SPARK-18952
 URL: https://issues.apache.org/jira/browse/SPARK-18952
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Burak Yavuz


If I use the function regexp_extract, and then in my regex string, use `\`, 
i.e. escape character, this fails codegen, because the `\` character is not 
properly escaped when codegen'd.

Example stack trace:
{code}
/* 059 */ private int maxSteps = 2;
/* 060 */ private int numRows = 0;
/* 061 */ private org.apache.spark.sql.types.StructType keySchema = new 
org.apache.spark.sql.types.StructType().add("date_format(window#325.start, 
-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
/* 062 */ .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 
1)", org.apache.spark.sql.types.DataTypes.StringType);
/* 063 */ private org.apache.spark.sql.types.StructType valueSchema = new 
org.apache.spark.sql.types.StructType().add("sum", 
org.apache.spark.sql.types.DataTypes.LongType);
/* 064 */ private Object emptyVBase;

...

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, 
Column 58: Invalid escape sequence
at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
at org.codehaus.janino.Scanner.produce(Scanner.java:604)
at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
at 
org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
{code}

In the codegend expression, the literal should use `\\` instead of `\`




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Bravo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765361#comment-15765361
 ] 

Bravo Zhang commented on SPARK-18950:
-

I'll work on this, thanks

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18950:
---
Labels: starter  (was: )

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Priority: Minor
>  Labels: starter
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-20 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765322#comment-15765322
 ] 

Imran Rashid commented on SPARK-18886:
--

You understand correctly -- that is precisely what I'm proposing.

The scenario with multiple waves is a good example for why I think this is a 
*good* change.  If only 1% of your cluster can take advantage of locality, then 
99% of your cluster goes unused across all those waves.  That may be an extreme 
(though a case I have actually seen in practice on large clusters).  even if 
its 50%, then you have 50% of your cluster going unused.  Unless local tasks 
are more than 2x faster, it would make more sense to make the change I'm 
proposing.  What's the worst case after this change?  All but one executor are 
local -- the result is that you have one task running slower.  But the more 
waves there are, the less the downside.  Eg., you complete 10 waves on the 
local executors, and only 8 waves on the non-local one.

The worst case is if there is only one wave, there is a huge gap (multiples Xs) 
in runtime between local and non-local execution, and moments after you 
schedule on non-local resources, some local resource would become available.  I 
think this situation is not very common -- in particular, there normally isn't 
*such* an enormous gap between local and non-local that users would prefer 
their non-local resources sit idle indefinitely.  I'd argue that if such a use 
case is important, we should add a special conf for that in particular.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6

2016-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-18951:


Assignee: Yin Huai

> Upgrade com.thoughtworks.paranamer/paranamer to 2.6
> ---
>
> Key: SPARK-18951
> URL: https://issues.apache.org/jira/browse/SPARK-18951
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
> jackson fail to handle byte array defined in a case class. Then I find 
> https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
> that it is caused by a bug in paranamer. Let's upgrade paranamer. 
> Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
> com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade 
> paranamer to 2.6. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer

2016-12-20 Thread Yin Huai (JIRA)
Yin Huai created SPARK-18951:


 Summary: Upgrade com.thoughtworks.paranamer/paranamer
 Key: SPARK-18951
 URL: https://issues.apache.org/jira/browse/SPARK-18951
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yin Huai


I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
jackson fail to handle byte array defined in a case class. Then I find 
https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
that it is caused by a bug in paranamer. Let's upgrade paranamer. 

Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade 
paranamer to 2.6. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18951) Upgrade com.thoughtworks.paranamer/paranamer to 2.6

2016-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18951:
-
Summary: Upgrade com.thoughtworks.paranamer/paranamer to 2.6  (was: Upgrade 
com.thoughtworks.paranamer/paranamer)

> Upgrade com.thoughtworks.paranamer/paranamer to 2.6
> ---
>
> Key: SPARK-18951
> URL: https://issues.apache.org/jira/browse/SPARK-18951
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>
> I recently hit a bug of com.thoughtworks.paranamer/paranamer, which causes 
> jackson fail to handle byte array defined in a case class. Then I find 
> https://github.com/FasterXML/jackson-module-scala/issues/48, which suggests 
> that it is caused by a bug in paranamer. Let's upgrade paranamer. 
> Since we are using jackson 2.6.5 and jackson-module-paranamer 2.6.5 use 
> com.thoughtworks.paranamer/paranamer uses 2.6, I suggests that we upgrade 
> paranamer to 2.6. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2016-12-20 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-18950:
--

 Summary: Report conflicting fields when merging two StructTypes.
 Key: SPARK-18950
 URL: https://issues.apache.org/jira/browse/SPARK-18950
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Minor


Currently, {{StructType.merge()}} only reports data types of conflicting fields 
when merging two incompatible schemas. It would be nice to also report the 
field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-12-20 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762805#comment-15762805
 ] 

Barry Becker edited comment on SPARK-16845 at 12/20/16 9:24 PM:


I found a workaround that allows me to avoid the 64 KB error, but it still 
reuns much slower than I expected. I switched to use a batch select statement 
insted of calls to withColumns in a loop. 
Here is an example of what I did
Old way:
{code}
stringCols.foreach(column => {  
  val qCol = col(column)
  datasetDf = datasetDf
.withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, 
lit(MISSING)).otherwise(qCol))
})
{code}
New way:
{code}
val replaceStringNull = udf((s: String) => if (s == null) MISSING else s)
var newCols = datasetDf.columns.map(column =>
  if (stringCols.contains(column))
replaceStringNull(col(column)).as(column + CLEAN_SUFFIX)
  else col(column))
datasetDf = datasetDf.select(newCols:_*)
{code}
This workaround only works on spark 2.0.2. I still get the 64 KB limit error 
when running the same thing with 1.6.3.


was (Author: barrybecker4):
I found a workaround that allows me to avoid the 64 KB error, but it still 
reuns much slower than I expected. I switched to use a bacth select statement 
insted of calls to withColumns in a loop. 
Here is an example of what I did
Old way:
{code}
stringCols.foreach(column => {  
  val qCol = col(column)
  datasetDf = datasetDf
.withColumn(column + CLEAN_SUFFIX, when(qCol.isNull, 
lit(MISSING)).otherwise(qCol))
})
{code}
New way:
{code}
val replaceStringNull = udf((s: String) => if (s == null) MISSING else s)
var newCols = datasetDf.columns.map(column =>
  if (stringCols.contains(column))
replaceStringNull(col(column)).as(column + CLEAN_SUFFIX)
  else col(column))
datasetDf = datasetDf.select(newCols:_*)
{code}

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18281.

   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16263
[https://github.com/apache/spark/pull/16263]

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
> Fix For: 2.1.1, 2.0.3
>
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18761) Uncancellable / unkillable tasks may starve jobs of resoures

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765200#comment-15765200
 ] 

Apache Spark commented on SPARK-18761:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16358

> Uncancellable / unkillable tasks may starve jobs of resoures
> 
>
> Key: SPARK-18761
> URL: https://issues.apache.org/jira/browse/SPARK-18761
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> Spark's current task cancellation / task killing mechanism is "best effort" 
> in the sense that some tasks may not be interruptible and may not respond to 
> their "killed" flags being set. If a significant fraction of a cluster's task 
> slots are occupied by tasks that have been marked as killed but remain 
> running then this can lead to a situation where new jobs and tasks are 
> starved of resources because zombie tasks are holding resources.
> I propose to address this problem by introducing a "task reaper" mechanism in 
> executors to monitor tasks after they are marked for killing in order to 
> periodically re-attempt the task kill, capture and log stacktraces / warnings 
> if tasks do not exit in a timely manner, and, optionally, kill the entire 
> executor JVM if cancelled tasks cannot be killed within some timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765124#comment-15765124
 ] 

Ilya Matiach commented on SPARK-12965:
--

Can the ML component be removed from this Jira?  It looks like this is a spark 
core bug only.

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 1.6.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18949) Add recoverPartitions API to Catalog

2016-12-20 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18949:

Description: 
Currently, we only have a SQL interface for recovering all the partitions in 
the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER 
TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` 
and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more 
important for making visible the data in the created data source partitioned 
table.

It is desriable to add it into the Catalog interface so that users can repair 
the table by
{noformat}
spark.catalog.recoverPartitions("testTable")
{noformat}


  was:
Currently, we only have a SQL interface for recovering all the partitions in 
the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
(Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more 
important for making visible the data in the created data source partitioned 
table.


> Add recoverPartitions API to Catalog
> 
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE` or 
> `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to 
> remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.
> It is desriable to add it into the Catalog interface so that users can repair 
> the table by
> {noformat}
> spark.catalog.recoverPartitions("testTable")
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18949) Add recoverPartitions API to Catalog

2016-12-20 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18949:

Summary: Add recoverPartitions API to Catalog  (was: Add repairTable API to 
Catalog)

> Add recoverPartitions API to Catalog
> 
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
> (Actually, very hard for me to remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their stop() methods

2016-12-20 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765096#comment-15765096
 ] 

Barry Becker commented on SPARK-11293:
--

Not sure if this is related, but I am running on spark 2.0.2 through spark 
job-server and see tons of messages like this:
{code}
[2016-12-20 11:49:28,662] WARN  he.spark.executor.Executor [] 
[akka://JobServer/user/context-supervisor/sql-context] - Managed memory leak 
detected; size = 5762976 bytes, TID = 42621
[2016-12-20 11:49:28,662] WARN  k.memory.TaskMemoryManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - leak 5.5 MB memory 
from org.apache.spark.util.collection.ExternalSorter@35f81493
[2016-12-20 11:49:28,662] WARN  he.spark.executor.Executor [] 
[akka://JobServer/user/context-supervisor/sql-context] - Managed memory leak 
detected; size = 5762976 bytes, TID = 42622
[2016-12-20 11:49:28,702] WARN  k.memory.TaskMemoryManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - leak 5.5 MB memory 
from org.apache.spark.util.collection.ExternalSorter@16da7c1a
[2016-12-20 11:49:28,702] WARN  he.spark.executor.Executor [] 
[akka://JobServer/user/context-supervisor/sql-context] - Managed memory leak 
detected; size = 5762976 bytes, TID = 42623
[2016-12-20 11:49:28,702] WARN  k.memory.TaskMemoryManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - leak 5.5 MB memory 
from org.apache.spark.util.collection.ExternalSorter@151060cf
[2016-12-20 11:49:28,702] WARN  he.spark.executor.Executor [] 
[akka://JobServer/user/context-supervisor/sql-context] - Managed memory leak 
detected; size = 5762976 bytes, TID = 42624
[Stage 5700:=>(44 + 4) / 
92][2016-12-20 11:49:35,479] WARN  k.memory.TaskMemoryManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - l
{code}
Are managed memory leaks ever expected behavior? Or do they always indicate a 
memory leak problem? I don't really see the memory going up much in jVisualVM.

> ExternalSorter and ExternalAppendOnlyMap should free shuffle memory in their 
> stop() methods
> ---
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18928) FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765087#comment-15765087
 ] 

Apache Spark commented on SPARK-18928:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16357

> FileScanRDD, JDBCRDD, and UnsafeSorter should support task cancellation
> ---
>
> Key: SPARK-18928
> URL: https://issues.apache.org/jira/browse/SPARK-18928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.1, 2.2.0
>
>
> Spark tasks respond to cancellation by checking 
> {{TaskContext.isInterrupted()}}, but this check is missing on a few critical 
> paths used in Spark SQL, including FileScanRDD, JDBCRDD, and 
> UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to 
> continue running and become zombies.
> Here's an example: first, create a giant text file. In my case, I just 
> concatenated /usr/share/dict/words a bunch of times to produce a 2.75 gig 
> file. Then, run a really slow query over that file and try to cancel it:
> {code}
> spark.read.text("/tmp/words").selectExpr("value + value + value").collect()
> {code}
> This will sit and churn at 100% CPU for a minute or two because the task 
> isn't checking the interrupted flag.
> The solution here is to add InterruptedIterator-style checks to a few 
> locations where they're currently missing in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18949) Add repairTable API to Catalog

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765054#comment-15765054
 ] 

Apache Spark commented on SPARK-18949:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16356

> Add repairTable API to Catalog
> --
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
> (Actually, very hard for me to remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18949) Add repairTable API to Catalog

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18949:


Assignee: Xiao Li  (was: Apache Spark)

> Add repairTable API to Catalog
> --
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
> (Actually, very hard for me to remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18949) Add repairTable API to Catalog

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18949:


Assignee: Apache Spark  (was: Xiao Li)

> Add repairTable API to Catalog
> --
>
> Key: SPARK-18949
> URL: https://issues.apache.org/jira/browse/SPARK-18949
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we only have a SQL interface for recovering all the partitions in 
> the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
> (Actually, very hard for me to remember `MSCK` and have no clue what it means)
> After the new "Scalable Partition Handling", the table repair becomes much 
> more important for making visible the data in the created data source 
> partitioned table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18949) Add repairTable API to Catalog

2016-12-20 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18949:
---

 Summary: Add repairTable API to Catalog
 Key: SPARK-18949
 URL: https://issues.apache.org/jira/browse/SPARK-18949
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


Currently, we only have a SQL interface for recovering all the partitions in 
the directory of a table and update the catalog. `MSCK REPAIR TABLE`. 
(Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more 
important for making visible the data in the created data source partitioned 
table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18036) Decision Trees do not handle edge cases

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765032#comment-15765032
 ] 

Ilya Matiach commented on SPARK-18036:
--

Weichen Xu, are you working on this issue or have you resolved it?  I am 
interested in investigating this bug.

> Decision Trees do not handle edge cases
> ---
>
> Key: SPARK-18036
> URL: https://issues.apache.org/jira/browse/SPARK-18036
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Decision trees/GBT/RF do not handle edge cases such as constant features or 
> empty features. For example:
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.max
>   at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
>   at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
>   at 
> org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:207)
>   at org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:93)
>   at 
> org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:46)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   ... 52 elided
> {code}
> as well as 
> {code}
> val dt = new DecisionTreeRegressor()
> val data = Seq(LabeledPoint(1.0, Vectors.dense(0.0, 0.0, 0.0))).toDF()
> dt.fit(data)
> java.lang.UnsupportedOperationException: empty.maxBy
> at scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:236)
> at 
> scala.collection.SeqViewLike$AbstractTransformed.maxBy(SeqViewLike.scala:37)
> at 
> org.apache.spark.ml.tree.impl.RandomForest$.binsToBestSplit(RandomForest.scala:846)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765027#comment-15765027
 ] 

Ilya Matiach commented on SPARK-16473:
--

Do you have a smaller dataset than the one in the description which can 
reproduce the bug?

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the table (not EXTERNAL table) from the file system

2016-12-20 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15765024#comment-15765024
 ] 

Dongjoon Hyun commented on SPARK-18941:
---

Hi, [~luatnc].
For me, 2.0.2 and the current master deletes the directory created by `CREATE 
TABLE` successfully.
Could you give us more reproducible description?

> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the table (not EXTERNAL table) from the file system
> ---
>
> Key: SPARK-18941
> URL: https://issues.apache.org/jira/browse/SPARK-18941
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: luat
>
> Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the 
> directory associated with the table (not EXTERNAL table) from the file system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-20 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764995#comment-15764995
 ] 

Shixiong Zhu commented on SPARK-18820:
--

This was fixed in SPARK-13112. It's for 2.0.0 not backported to 1.6+. Could you 
upgrade to 2.0+, please?

> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18820) Driver may send "LaunchTask" before executor receive "RegisteredExecutor"

2016-12-20 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-18820.
--
Resolution: Duplicate

> Driver may send "LaunchTask" before executor receive "RegisteredExecutor"
> -
>
> Key: SPARK-18820
> URL: https://issues.apache.org/jira/browse/SPARK-18820
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.3
> Environment: spark-1.6.3
>Reporter: jin xing
>
> CoarseGrainedSchedulerBackend will update executorDataMap after receiving 
> "RegisterExecutor", thus task scheduler may assign tasks on to this executor;
> If LaunchTask arrives at CoarseGrainedExecutorBackend before 
> RegisteredExecutor, it will result in NullPointerException and executor 
> backend will exit;
> Is it a bug? If so can I make a pr? I think driver should send "LaunchTask" 
> after "RegisteredExecutor" is already received.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764964#comment-15764964
 ] 

Ilya Matiach commented on SPARK-16473:
--

If you could put the sample dataset on google drive or one drive and send me 
the link that would be great.  Putting the dataset on github would work too.  
How large is the dataset?

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764909#comment-15764909
 ] 

Ilya Matiach commented on SPARK-16473:
--

I've added a pull request here:
https://github.com/apache/spark/pull/16355

It would be nice to add a test case in spark itself to verify the code fix.

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16473:


Assignee: (was: Apache Spark)

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764905#comment-15764905
 ] 

Apache Spark commented on SPARK-16473:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/16355

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16473:


Assignee: Apache Spark

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>Assignee: Apache Spark
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16473) BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key not found

2016-12-20 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764892#comment-15764892
 ] 

Ilya Matiach commented on SPARK-16473:
--

I will start a pull request for the change.  I would like to add a test case 
that verifies the bug is fixed though.  Maybe you can send the sample dataset 
through github, and I can take a look?

> BisectingKMeans Algorithm failing with java.util.NoSuchElementException: key 
> not found
> --
>
> Key: SPARK-16473
> URL: https://issues.apache.org/jira/browse/SPARK-16473
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.1, 2.0.0
> Environment: AWS EC2 linux instance. 
>Reporter: Alok Bhandari
>
> Hello , 
> I am using apache spark 1.6.1. 
> I am executing bisecting k means algorithm on a specific dataset .
> Dataset details :- 
> K=100,
> input vector =100K*100k
> Memory assigned 16GB per node ,
> number of nodes =2.
>  Till K=75 it os working fine , but when I set k=100 , it fails with 
> java.util.NoSuchElementException: key not found. 
> *I suspect it is failing because of lack of some resources , but somehow 
> exception does not convey anything as why this spark job failed.* 
> Please can someone point me to root cause of this exception , why it is 
> failing. 
> This is the exception stack-trace:- 
> {code}
> java.util.NoSuchElementException: key not found: 166 
> at scala.collection.MapLike$class.default(MapLike.scala:228) 
> at scala.collection.AbstractMap.default(Map.scala:58) 
> at scala.collection.MapLike$class.apply(MapLike.scala:141) 
> at scala.collection.AbstractMap.apply(Map.scala:58) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
> at 
> scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
>  
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>  
> at scala.collection.immutable.List.foldLeft(List.scala:84) 
> at 
> scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
>  
> at scala.collection.immutable.List.reduceLeft(List.scala:84) 
> at 
> scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231) 
> at scala.collection.AbstractTraversable.minBy(Traversable.scala:105) 
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
>  
> at 
> org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
>  
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) 
> {code}
> Issue is that , it is failing but not giving any explicit message as to why 
> it failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-20 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764876#comment-15764876
 ] 

Mridul Muralidharan commented on SPARK-18886:
-

I am not sure what is described will work as expected [~imranr].
Consider a taskset which has number of tasks as many multiples of number of 
executors (fairly common scenario).
In this case, if the timer is never reset, you will effectively make delay to 0 
once it expires, across all waves for the taskset.
(I am assuming I understood the proposal right).

[~kayousterhout] and [~markhamstra] might have more comments though - in case I 
am missing something here.


> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18836) Serialize Task Metrics once per stage

2016-12-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-18836:
--
Fix Version/s: (was: 1.3.0)
   2.2.0

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.2.0
>
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764795#comment-15764795
 ] 

Apache Spark commented on SPARK-18886:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/16354

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2016-12-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18886:


Assignee: Apache Spark

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >