[jira] [Updated] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it

2017-08-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-18355:
--
Affects Version/s: 2.2.0

> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it
> 
>
> Key: SPARK-18355
> URL: https://issues.apache.org/jira/browse/SPARK-18355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0, 2.2.0
> Environment: Centos6
>Reporter: Sandeep Nemuri
>
> *PROBLEM*:
> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it.
> Below is the exception:
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> *STEPS TO SIMULATE THIS ISSUE*:
> 1) Create table in hive.
> {code}
> CREATE TABLE `testorc`( 
> `click_id` string, 
> `search_id` string, 
> `uid` bigint)
> PARTITIONED BY ( 
> `ts` string, 
> `hour` string) 
> STORED AS ORC; 
> {code}
> 2) Load data into table :
> {code}
> INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES 
> (12,2,12345);
> {code}
> 3) Select through spark shell (This works)
> {code}
> sqlContext.sql("select click_id,search_id from testorc").show
> {code}
> 4) Now add column to hive table
> {code}
> ALTER TABLE testorc ADD COLUMNS (dummy string);
> {code}
> 5) Now again select from spark shell
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.cata

[jira] [Closed] (SPARK-21815) Undeterministic group labeling within small connected component

2017-08-24 Thread nguyen duc tuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nguyen duc tuan closed SPARK-21815.
---
Resolution: Not A Bug

> Undeterministic  group labeling within small connected component
> 
>
> Key: SPARK-21815
> URL: https://issues.apache.org/jira/browse/SPARK-21815
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.6.3, 2.2.0
>Reporter: nguyen duc tuan
>Priority: Trivial
>  Labels: easyfix
>
> As I look in the code 
> https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L61,
>  when the number of vertices in each community is small and the number of 
> iteration is large enough, all candidates will have same scores. Due to order 
> in the set, each vertex will be assigned to  different community id. By 
> ordering vertexId, the problem solved.
> Sample code to reproduce this error:
> val vertices = spark.sparkContext.parallelize(Seq((1l,1), (2l, 1)))
> val edges = spark.sparkContext.parallelize(Seq(Edge(1l,2l, 1)))
> val g = Graph(vertices, edges)
> val c =LabelPropagation.run(g, 5)
> c.vertices.map(x => (x._1, x._2)).toDF.show



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21255) NPE when creating encoder for enum

2017-08-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21255:
-

Assignee: Mike

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
>Assignee: Mike
> Fix For: 2.3.0
>
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21255) NPE when creating encoder for enum

2017-08-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21255.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18488
[https://github.com/apache/spark/pull/18488]

> NPE when creating encoder for enum
> --
>
> Key: SPARK-21255
> URL: https://issues.apache.org/jira/browse/SPARK-21255
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
> Environment: org.apache.spark:spark-core_2.10:2.1.0
> org.apache.spark:spark-sql_2.10:2.1.0
>Reporter: Mike
> Fix For: 2.3.0
>
>
> When you try to create an encoder for Enum type (or bean with enum property) 
> via Encoders.bean(...), it fails with NullPointerException at TypeToken:495.
> I did a little research and it turns out, that in JavaTypeInference:126 
> following code 
> {code:java}
> val beanInfo = Introspector.getBeanInfo(typeToken.getRawType)
> val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == 
> "class")
> val fields = properties.map { property =>
>   val returnType = 
> typeToken.method(property.getReadMethod).getReturnType
>   val (dataType, nullable) = inferDataType(returnType)
>   new StructField(property.getName, dataType, nullable)
> }
> (new StructType(fields), true)
> {code}
> filters out properties named "class", because we wouldn't want to serialize 
> that. But enum types have another property of type Class named 
> "declaringClass", which we are trying to inspect recursively. Eventually we 
> try to inspect ClassLoader class, which has property "defaultAssertionStatus" 
> with no read method, which leads to NPE at TypeToken:495.
> I think adding property name "declaringClass" to filtering will resolve this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception

2017-08-24 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141248#comment-16141248
 ] 

Anton Okolnychyi commented on SPARK-21691:
--

[~Robin Shao] My suggestion was to modify the output method in 'Project'. 

[~smilegator] Could you, please, provide your opinion on the proposed solution 
in the first comment?

> Accessing canonicalized plan for query with limit throws exception
> --
>
> Key: SPARK-21691
> URL: https://issues.apache.org/jira/browse/SPARK-21691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bjoern Toldbod
>
> Accessing the logical, canonicalized plan fails for queries with limits.
> The following demonstrates the issue:
> {code:java}
> val session = SparkSession.builder.master("local").getOrCreate()
> // This works
> session.sql("select * from (values 0, 
> 1)").queryExecution.logical.canonicalized
> // This fails
> session.sql("select * from (values 0, 1) limit 
> 1").queryExecution.logical.canonicalized
> {code}
> The message in the thrown exception is somewhat confusing (or at least not 
> directly related to the limit):
> "Invalid call to toAttribute on unresolved object, tree: *"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21828:
-
Flags:   (was: Important)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21828:
-
Component/s: (was: ML)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141234#comment-16141234
 ] 

Takeshi Yamamuro commented on SPARK-21828:
--

You can't set target version or something and these fields are reserved for 
committers. You need to first see: http://spark.apache.org/contributing.html

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141233#comment-16141233
 ] 

Liang-Chi Hsieh commented on SPARK-21835:
-

Submitted PR at https://github.com/apache/spark/pull/19050

> RewritePredicateSubquery should not produce unresolved query plans
> --
>
> Key: SPARK-21835
> URL: https://issues.apache.org/jira/browse/SPARK-21835
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
> During the structural integrity, I found {[RewritePredicateSubquery}} can 
> produce unresolved query plans due to conflicting attributes. We should not 
> let {{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21828:
-
Target Version/s:   (was: 2.1.0, 2.2.0)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans

2017-08-24 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-21835:
---

 Summary: RewritePredicateSubquery should not produce unresolved 
query plans
 Key: SPARK-21835
 URL: https://issues.apache.org/jira/browse/SPARK-21835
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh


{{RewritePredicateSubquery}} rewrites correlated subquery to join operations. 
During the structural integrity, I found {[RewritePredicateSubquery}} can 
produce unresolved query plans due to conflicting attributes. We should not let 
{{RewritePredicateSubquery}} produce unresolved plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141138#comment-16141138
 ] 

Otis Smart edited comment on SPARK-21828 at 8/25/17 4:19 AM:
-

Hi KI: I thank you for the expedient reply!
* Here (below text) is example code that generates the error in PySpark 2.1.
* Please forgive me...I initially inadvertently applied this code on a Spark 
2.1 (rather than Spark 2.2) cluster; but I moments ago began a test on a Spark 
2.2 cluster (definitely this time).  Nonetheless, a troubleshoot + 
investigation of the aforementioned error may aid others on Spark 2.1 if my 
ongoing test yields no error in Pyspark 2.2.

Gratefully + Best,

OS


{code:python}
# OTIS SMART: 24.08.2017 (https://issues.apache.org/jira/browse/SPARK-21828)


# 
--
#
# 
--

#
from pyspark import SparkConf, SparkContext

#
from pyspark.sql import HiveContext
from pyspark.sql.functions import col, lit

#
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

#
from pyspark.mllib.random import RandomRDDs as randrdd


# 
--
#
# 
--

# 
def computeDT(df):

#
namevAll = df.columns
namevP = namevAll[:-1]
namevC = namevAll[-1]

#
training_data, testing_data = df.randomSplit([0.70, 0.30])
print "done: randomSplit"

# 
variableoutStringIndexer = StringIndexer(inputCol=namevC, 
outputCol='variableout')
print "done: StringIndexer"

# 
variablesinAssembler = VectorAssembler(inputCols=namevP, 
outputCol='variablesin')
print "done: VectorAssembler"

# 
dTree = DecisionTreeClassifier(labelCol='variableout',
   featuresCol='variablesin',
   impurity='gini',
   maxBins=3)
print "done: DecisionTreeClassifier"

# 
pipeline = Pipeline(stages=[variableoutStringIndexer, variablesinAssembler, 
dTree])
print "done: Pipeline"

# 
paramgrid = ParamGridBuilder().addGrid(dTree.maxDepth, [3, 7, 8]).build()
print "done: ParamGridBuilder"

#
evaluator = MulticlassClassificationEvaluator(labelCol='variableout',
  predictionCol='prediction', 
metricName="f1")
print "done: MulticlassClassificationEvaluator"

#
CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramgrid,
  evaluator=evaluator,
  numFolds=10).fit(training_data)
print "done: CrossValidatorModel"


#
return True


# 
--
#
# 
--

#
def main():

#
numcols = 1234
numrows = 23456

#
x = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=1)
y = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=2)

#
dfx = HiveContext.createDataFrame(hiveContext,
x.map(lambda v: tuple([float(ii) for ii in v])).collect(), 
["v{0:0>4}".format(jj) for jj in range(0, numcols)])

#
dfy = HiveContext.createDataFrame(hiveContext,
y.map(lambda v: tuple([float(ii) for ii in v])).collect(), 
["v{0:0>4}".format(jj) for jj in range(0, numcols)])

#
dfx = dfx.withColumn("v{0:0>4}".format(numcols), 
dfx["v{0:0>4}".format(numcols - 1)] + 5).withColumn("V", lit('a'))
dfy = dfy.withColumn("v{0:0>4}".format(numcols), 
dfy["v{0:0>4}".format(numcols - 1)] - 5).withColumn("V", lit('b'))

#
df = dfx.union(dfy)

#
df.cache()

#
df.printSchema()
df.show(n=10, truncate=False)

#
computeDT(df)

#
df.unpersist()

#
return True


# 
--
# CONFIGURE SPARK CONTEXT THEN RUN 'MAIN' ANALYSIS
# 
--

#
if __name__ == "__main__":

#
conf = 
SparkConf().setAppName("TroubleshootPyspark.ASFJira2

[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141138#comment-16141138
 ] 

Otis Smart commented on SPARK-21828:


Hi KI: I thank you for the expedient reply!
* Here (below text) is example code that generates the error in PySpark 2.1.
* Please forgive me...I initially inadvertently applied this code on a Spark 
2.1 (rather than Spark 2.2) cluster; but I moments ago began a test on a Spark 
2.2 cluster (definitely this time).  Nonetheless, a troubleshoot + 
investigation of the aforementioned error may aid others on Spark 2.1 if my 
ongoing test yields no error in Pyspark 2.2.

Gratefully + Best,

OS


# OTIS SMART: 24.08.2017 (https://issues.apache.org/jira/browse/SPARK-21828)


# 
--
#
# 
--

#
from pyspark import SparkConf, SparkContext

#
from pyspark.sql import HiveContext
from pyspark.sql.functions import col, lit

#
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

#
from pyspark.mllib.random import RandomRDDs as randrdd


# 
--
#
# 
--

# 
def computeDT(df):

#
namevAll = df.columns
namevP = namevAll[:-1]
namevC = namevAll[-1]

#
training_data, testing_data = df.randomSplit([0.70, 0.30])
print "done: randomSplit"

# 
variableoutStringIndexer = StringIndexer(inputCol=namevC, 
outputCol='variableout')
print "done: StringIndexer"

# 
variablesinAssembler = VectorAssembler(inputCols=namevP, 
outputCol='variablesin')
print "done: VectorAssembler"

# 
dTree = DecisionTreeClassifier(labelCol='variableout',
   featuresCol='variablesin',
   impurity='gini',
   maxBins=3)
print "done: DecisionTreeClassifier"

# 
pipeline = Pipeline(stages=[variableoutStringIndexer, variablesinAssembler, 
dTree])
print "done: Pipeline"

# 
paramgrid = ParamGridBuilder().addGrid(dTree.maxDepth, [3, 7, 8]).build()
print "done: ParamGridBuilder"

#
evaluator = MulticlassClassificationEvaluator(labelCol='variableout',
  predictionCol='prediction', 
metricName="f1")
print "done: MulticlassClassificationEvaluator"

#
CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramgrid,
  evaluator=evaluator,
  numFolds=10).fit(training_data)
print "done: CrossValidatorModel"


#
return True


# 
--
#
# 
--

#
def main():

#
numcols = 1234
numrows = 23456

#
x = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=1)
y = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=2)

#
dfx = HiveContext.createDataFrame(hiveContext,
x.map(lambda v: tuple([float(ii) for ii in v])).collect(), 
["v{0:0>4}".format(jj) for jj in range(0, numcols)])

#
dfy = HiveContext.createDataFrame(hiveContext,
y.map(lambda v: tuple([float(ii) for ii in v])).collect(), 
["v{0:0>4}".format(jj) for jj in range(0, numcols)])

#
dfx = dfx.withColumn("v{0:0>4}".format(numcols), 
dfx["v{0:0>4}".format(numcols - 1)] + 5).withColumn("V", lit('a'))
dfy = dfy.withColumn("v{0:0>4}".format(numcols), 
dfy["v{0:0>4}".format(numcols - 1)] - 5).withColumn("V", lit('b'))

#
df = dfx.union(dfy)

#
df.cache()

#
df.printSchema()
df.show(n=10, truncate=False)

#
computeDT(df)

#
df.unpersist()

#
return True


# 
--
# CONFIGURE SPARK CONTEXT THEN RUN 'MAIN' ANALYSIS
# 
--

#
if __name__ == "__main__":

#
conf = 
SparkConf().setAppName("TroubleshootPyspark.ASFJira21828.Otis+Kazuaki"). \
set("spark.sql.tungsten.enabled"

[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Smart updated SPARK-21828:
---
Affects Version/s: (was: 2.2.0)
   2.1.0

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Smart updated SPARK-21828:
---
Target Version/s: 2.2.0, 2.1.0  (was: 2.2.0)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.1.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-24 Thread jincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141091#comment-16141091
 ] 

jincheng commented on SPARK-18085:
--

it is really a good idea to speed up history server with levelDB.  I met a 
problem with following exception.

{code:java}
java.lang.IndexOutOfBoundsException: Page 1 is out of range. Please select a 
page number between 1 and 0.
at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:56)
at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:108)
at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:703)
at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:296)
at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:285)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82)
at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:88)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at 
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
at 
org.apache.spark.deploy.history.ApplicationCacheCheckFilter.doFilter(ApplicationCache.scala:438)
at 
org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at 
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at 
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at 
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at 
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at 
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at 
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:524)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at 
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at 
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95)
at 
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at 
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at 
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
{code}

it will occur when we clicking stages with no tasks successful.


> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2017-08-24 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141081#comment-16141081
 ] 

Saisai Shao commented on SPARK-21829:
-

Cross post the comment here. Since you're running Spark on YARN, so I think 
node label should well address your scenario.

The changes you made in BlacklistTracker seems break the design purpose of 
backlist. The blacklist in Spark as well as in MR/TEZ assumes bad 
nodes/executors will be back to normal in several hours, so it always has a 
timeout for blacklist.

In your case, the problem is not bad nodes/executors, it is that you don't what 
to start executors on some nodes (like slow nodes). This is more like a cluster 
manager problem rather than Spark problem. To summarize your problem, you want 
your Spark application runs on some specific nodes.

To solve your problem, for YARN you could use node label and Spark on YARN 
already support node label. You could google node label to know the details.

For standalone, simply you should not start worker on such nodes you don't want.

For Mesos I'm not sure, I guess it should also has similar approaches.

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141030#comment-16141030
 ] 

lufei edited comment on SPARK-21822 at 8/25/17 1:26 AM:


I'm so sorry for close this issue by mistake,so I reopen it.


was (Author: figo):
I close this issue by mistake.

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufei reopened SPARK-21822:
---

I close this issue by mistake.

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141026#comment-16141026
 ] 

lufei commented on SPARK-21822:
---

[~sowen]I'm sorry,I didn't get your meaning before, I have already submitted a 
pull request(https://github.com/apache/spark/pull/19035).
Thank you for your time.

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufei updated SPARK-21822:
--
Comment: was deleted

(was: [~sowen]Ok,I got it.I will close this issue immediately,thanks.)

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5

2017-08-24 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141014#comment-16141014
 ] 

Hyukjin Kwon commented on SPARK-19571:
--

Sure, thanks!

> tests are failing to run on Windows with another instance Derby error with 
> Hadoop 2.6.5
> ---
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21834) Incorrect executor request in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-21834:
---

 Summary: Incorrect executor request in case of dynamic allocation
 Key: SPARK-21834
 URL: https://issues.apache.org/jira/browse/SPARK-21834
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Sital Kedia


killExecutor api currently does not allow killing an executor without updating 
the total number of executors needed. In case of dynamic allocation is turned 
on and the allocator tries to kill an executor, the scheduler reduces the total 
number of executors needed ( see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
 which is incorrect because the allocator already takes care of setting the 
required number of executors itself. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19571:
-
Fix Version/s: 2.2.0

> tests are failing to run on Windows with another instance Derby error with 
> Hadoop 2.6.5
> ---
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
> Fix For: 2.2.0
>
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufei closed SPARK-21822.
-
Resolution: Invalid

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140983#comment-16140983
 ] 

lufei commented on SPARK-21822:
---

[~sowen]Ok,I got it.I will close this issue immediately,thanks.

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21830) Bump the dependency of ANTLR to version 4.7

2017-08-24 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21830.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Bump the dependency of ANTLR to version 4.7
> ---
>
> Key: SPARK-21830
> URL: https://issues.apache.org/jira/browse/SPARK-21830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.3.0
>
>
> There are a few minor issues with the current ANTLR version. Version 4.7 
> fixes most of these. I'd like to bump the version to that one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21830) Bump the dependency of ANTLR to version 4.7

2017-08-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140939#comment-16140939
 ] 

Xiao Li commented on SPARK-21830:
-

https://github.com/apache/spark/pull/19042


> Bump the dependency of ANTLR to version 4.7
> ---
>
> Key: SPARK-21830
> URL: https://issues.apache.org/jira/browse/SPARK-21830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.3.0
>
>
> There are a few minor issues with the current ANTLR version. Version 4.7 
> fixes most of these. I'd like to bump the version to that one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21798) No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server

2017-08-24 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140833#comment-16140833
 ] 

Parth Gandhi commented on SPARK-21798:
--

Filed a pull request for this ticket:
https://github.com/apache/spark/pull/19047

> No config to replace deprecated SPARK_CLASSPATH config for launching daemons 
> like History Server
> 
>
> Key: SPARK-21798
> URL: https://issues.apache.org/jira/browse/SPARK-21798
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sanket Reddy
>Priority: Minor
>
> History Server Launch uses SparkClassCommandBuilder for launching the server. 
> It is observed that SPARK_CLASSPATH has been removed and deprecated. For 
> spark-submit this takes a different route and spark.driver.extraClasspath 
> takes care of specifying additional jars in the classpath that were 
> previously specified in the SPARK_CLASSPATH. Right now the only way specify 
> the additional jars for launching daemons such as history server is using 
> SPARK_DIST_CLASSPATH 
> (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I 
> presume is a distribution classpath. It would be nice to have a similar 
> config like spark.driver.extraClasspath for launching daemons similar to 
> history server. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-21701:


Assignee: Xu Zhang

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Assignee: Xu Zhang
>Priority: Trivial
> Fix For: 2.3.0
>
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client

2017-08-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21701.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add TCP send/rcv buffer size support for RPC client
> ---
>
> Key: SPARK-21701
> URL: https://issues.apache.org/jira/browse/SPARK-21701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Xu Zhang
>Priority: Trivial
> Fix For: 2.3.0
>
>
> For TransportClientFactory class, there are no params derived from SparkConf 
> to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. 
> Increasing the receive buffer size can increase the I/O performance for 
> high-volume transport.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it

2017-08-24 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140552#comment-16140552
 ] 

Dongjoon Hyun commented on SPARK-18355:
---

This will be resolved via Apache ORC 1.4.0.

> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it
> 
>
> Key: SPARK-18355
> URL: https://issues.apache.org/jira/browse/SPARK-18355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
> Environment: Centos6
>Reporter: Sandeep Nemuri
>
> *PROBLEM*:
> Spark SQL fails to read data from a ORC hive table that has a new column 
> added to it.
> Below is the exception:
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> *STEPS TO SIMULATE THIS ISSUE*:
> 1) Create table in hive.
> {code}
> CREATE TABLE `testorc`( 
> `click_id` string, 
> `search_id` string, 
> `uid` bigint)
> PARTITIONED BY ( 
> `ts` string, 
> `hour` string) 
> STORED AS ORC; 
> {code}
> 2) Load data into table :
> {code}
> INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES 
> (12,2,12345);
> {code}
> 3) Select through spark shell (This works)
> {code}
> sqlContext.sql("select click_id,search_id from testorc").show
> {code}
> 4) Now add column to hive table
> {code}
> ALTER TABLE testorc ADD COLUMNS (dummy string);
> {code}
> 5) Now again select from spark shell
> {code}
> scala> sqlContext.sql("select click_id,search_id from testorc").show
> 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select 
> click_id,search_id from testorc
> 16/11/03 22:17:54 INFO ParseDriver: Parse Completed
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1

[jira] [Closed] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia closed SPARK-21833.
---
Resolution: Duplicate

Duplicate of SPARK-20540

> CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
> ---
>
> Key: SPARK-21833
> URL: https://issues.apache.org/jira/browse/SPARK-21833
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We have seen this issue in coarse grained scheduler that in case of dynamic 
> executor allocation is turned on, the scheduler asks for more executors than 
> needed. Consider the following situation where there are excutor allocation 
> manager is ramping down the number of executors. It will lower the executor 
> targer number by calling requestTotalExecutors api. 
> Later,  when the allocation manager finds some executors to be idle, it will 
> call killExecutor api. The coarse grain scheduler, in the killExecutor 
> function replaces the total executor needed to current  + pending which 
> overrides the earlier target set by the allocation manager. 
> This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-21833:

Description: 
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api. 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api. The coarse grain scheduler, in the killExecutor function 
replaces the total executor needed to current  + pending which overrides the 
earlier target set by the allocation manager. 

This results in scheduler spawning more executors than actually needed. 

  was:
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 


> CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
> ---
>
> Key: SPARK-21833
> URL: https://issues.apache.org/jira/browse/SPARK-21833
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We have seen this issue in coarse grained scheduler that in case of dynamic 
> executor allocation is turned on, the scheduler asks for more executors than 
> needed. Consider the following situation where there are excutor allocation 
> manager is ramping down the number of executors. It will lower the executor 
> targer number by calling requestTotalExecutors api. 
> Later,  when the allocation manager finds some executors to be idle, it will 
> call killExecutor api. The coarse grain scheduler, in the killExecutor 
> function replaces the total executor needed to current  + pending which 
> overrides the earlier target set by the allocation manager. 
> This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140533#comment-16140533
 ] 

Sital Kedia commented on SPARK-21833:
-

Actually, SPARK-20540 already addressed this issue on latest trunk. Will close 
this issue. 

> CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
> ---
>
> Key: SPARK-21833
> URL: https://issues.apache.org/jira/browse/SPARK-21833
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We have seen this issue in coarse grained scheduler that in case of dynamic 
> executor allocation is turned on, the scheduler asks for more executors than 
> needed. Consider the following situation where there are excutor allocation 
> manager is ramping down the number of executors. It will lower the executor 
> targer number by calling requestTotalExecutors api. 
> Later,  when the allocation manager finds some executors to be idle, it will 
> call killExecutor api. The coarse grain scheduler, in the killExecutor 
> function replaces the total executor needed to current  + pending which 
> overrides the earlier target set by the allocation manager. 
> This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-21833:

Description: 
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 

  was:
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 


> CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
> ---
>
> Key: SPARK-21833
> URL: https://issues.apache.org/jira/browse/SPARK-21833
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We have seen this issue in coarse grained scheduler that in case of dynamic 
> executor allocation is turned on, the scheduler asks for more executors than 
> needed. Consider the following situation where there are excutor allocation 
> manager is ramping down the number of executors. It will lower the executor 
> targer number by calling requestTotalExecutors api (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
>  
> Later,  when the allocation manager finds some executors to be idle, it will 
> call killExecutor api 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447).
>  The coarse grain scheduler, in the killExecutor function replaces the total 
> executor needed to current  + pending which overrides the earlier target set 
> by the allocation manager 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
>  
> This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-21833:

Description: 
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 

  was:
We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L321).
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 


> CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
> ---
>
> Key: SPARK-21833
> URL: https://issues.apache.org/jira/browse/SPARK-21833
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>
> We have seen this issue in coarse grained scheduler that in case of dynamic 
> executor allocation is turned on, the scheduler asks for more executors than 
> needed. Consider the following situation where there are excutor allocation 
> manager is ramping down the number of executors. It will lower the executor 
> targer number by calling requestTotalExecutors api (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326.
>  
> Later,  when the allocation manager finds some executors to be idle, it will 
> call killExecutor api 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421).
>  The coarse grain scheduler, in the killExecutor function replaces the total 
> executor needed to current  + pending which overrides the earlier target set 
> by the allocation manager 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
>  
> This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation

2017-08-24 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-21833:
---

 Summary: CoarseGrainedSchedulerBackend leaks executors in case of 
dynamic allocation
 Key: SPARK-21833
 URL: https://issues.apache.org/jira/browse/SPARK-21833
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Sital Kedia


We have seen this issue in coarse grained scheduler that in case of dynamic 
executor allocation is turned on, the scheduler asks for more executors than 
needed. Consider the following situation where there are excutor allocation 
manager is ramping down the number of executors. It will lower the executor 
targer number by calling requestTotalExecutors api (see 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L321).
 

Later,  when the allocation manager finds some executors to be idle, it will 
call killExecutor api 
(https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421).
 The coarse grain scheduler, in the killExecutor function replaces the total 
executor needed to current  + pending which overrides the earlier target set by 
the allocation manager 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523.
 

This results in scheduler spawning more executors than actually needed. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21832) Merge SQLBuilderTest into ExpressionSQLBuilderSuite

2017-08-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21832:
-

 Summary: Merge SQLBuilderTest into ExpressionSQLBuilderSuite
 Key: SPARK-21832
 URL: https://issues.apache.org/jira/browse/SPARK-21832
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun
Priority: Minor


After SPARK-19025, there is no need to keep SQLBuilderTest. 
ExpressionSQLBuilderSuite is the only place to use it.
This issue aims to remove SQLBuilderTest.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21788) Handle more exceptions when stopping a streaming query

2017-08-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21788:
-
Fix Version/s: (was: 3.0.0)
   2.3.0

> Handle more exceptions when stopping a streaming query
> --
>
> Key: SPARK-21788
> URL: https://issues.apache.org/jira/browse/SPARK-21788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21831) Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite

2017-08-24 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21831:
--
Summary: Remove `spark.sql.hive.convertMetastoreOrc` config in 
HiveCompatibilitySuite  (was: Remove obsolete 
`spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite)

> Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
> 
>
> Key: SPARK-21831
> URL: https://issues.apache.org/jira/browse/SPARK-21831
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> SPARK-19025 removes SQLBuilder, so we need to remove the following in 
> HiveCompatibilitySuite.
> {code}
> // Ensures that the plans generation use metastore relation and not 
> OrcRelation
> // Was done because SqlBuilder does not work with plans having logical 
> relation
> TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21831) Remove obsolete `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite

2017-08-24 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21831:
-

 Summary: Remove obsolete `spark.sql.hive.convertMetastoreOrc` 
config in HiveCompatibilitySuite
 Key: SPARK-21831
 URL: https://issues.apache.org/jira/browse/SPARK-21831
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun
Priority: Minor


SPARK-19025 removes SQLBuilder, so we need to remove the following in 
HiveCompatibilitySuite.

{code}
// Ensures that the plans generation use metastore relation and not OrcRelation
// Was done because SqlBuilder does not work with plans having logical relation
TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-21797:
---
Environment: Amazon EMR

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140408#comment-16140408
 ] 

Steve Loughran edited comment on SPARK-21797 at 8/24/17 5:56 PM:
-

This is happening deep in the Amazon EMR team's closed source 
{{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; 
I'm confident S3A will handle it pretty similarly though, either in the open() 
call or shortly afterwards, in the first read(). All we could do there is 
convert to a more meaningful error, or actually check to see if the file is 
valid at open() time & again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens



was (Author: ste...@apache.org):
This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, 
so nothing anyone here at the ASF can deal with directly; I'm confident S3A 
will handle it pretty similarly though, either in the open() call or shortly 
afterwards, in the first read(). All we could do there is convert to a more 
meaningful error, or actually check to see if the file is valid at open() time 
& again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140408#comment-16140408
 ] 

Steve Loughran commented on SPARK-21797:


This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, 
so nothing anyone here at the ASF can deal with directly; I'm confident S3A 
will handle it pretty similarly though, either in the open() call or shortly 
afterwards, in the first read(). All we could do there is convert to a more 
meaningful error, or actually check to see if the file is valid at open() time 
& again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21788) Handle more exceptions when stopping a streaming query

2017-08-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-21788.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 18997
[https://github.com/apache/spark/pull/18997]

> Handle more exceptions when stopping a streaming query
> --
>
> Key: SPARK-21788
> URL: https://issues.apache.org/jira/browse/SPARK-21788
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21681:
--
Labels: correctness  (was: )

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>  Labels: correctness
> Fix For: 2.2.1, 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21681) MLOR do not work correctly when featureStd contains zero

2017-08-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21681.
---
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 19026
[https://github.com/apache/spark/pull/19026]

> MLOR do not work correctly when featureStd contains zero
> 
>
> Key: SPARK-21681
> URL: https://issues.apache.org/jira/browse/SPARK-21681
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.2.1, 2.3.0
>
>
> MLOR do not work correctly when featureStd contains zero.
> We can reproduce the bug through such dataset (features including zero 
> variance), will generate wrong result (all coefficients becomes 0)
> {code}
> val multinomialDatasetWithZeroVar = {
>   val nPoints = 100
>   val coefficients = Array(
> -0.57997, 0.912083, -0.371077,
> -0.16624, -0.84355, -0.048509)
>   val xMean = Array(5.843, 3.0)
>   val xVariance = Array(0.6856, 0.0)  // including zero variance
>   val testData = generateMultinomialLogisticInput(
> coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
>   val df = sc.parallelize(testData, 4).toDF().withColumn("weight", 
> lit(1.0))
>   df.cache()
>   df
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-08-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140293#comment-16140293
 ] 

Marcelo Vanzin commented on SPARK-16742:


Both renewal and creating new tickets after the TTL (those are different 
things).

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Arthur Rand
> Fix For: 2.3.0
>
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140282#comment-16140282
 ] 

Kazuaki Ishizaki commented on SPARK-21828:
--

Thank you for reporting a problem.
First, IIUC, this PR (https://github.com/apache/spark/pull/15480) has been 
included in the latest release. Thus, the test case "SPARK-16845..." in 
{{OrderingSuite.scala}} does not fail.

Could you please put a program that can reproduce this issue? Then, I will 
investigate this.

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.2.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140261#comment-16140261
 ] 

Boris Clémençon  commented on SPARK-21797:
--

Steve, 

This is the stacks:


{noformat}
WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, 
ip-172-31-42-242.eu-west-1.compute.internal, executor 1): java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: 
5DD5BEBB8173977D), S3 Extended Request ID: 
K9bDwhm32CFHeg5zgVfW/T1A/vB4e8gqQ/p7E0Ze9ZG55UFoDP7hgnkQxLIwYX9i2LEcKwrR+lo=
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.handleAmazonServiceException(Jets3tNativeFileSystemStore.java:434)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:461)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:439)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy30.retrievePair(Unknown Source)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1201)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:491)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485)
at 
scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
at 
scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
at 
scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
at 
scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
at 
scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
at 
scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at 
scala.collection.parallel.AdaptiveWorkStealin

[jira] [Created] (SPARK-21830) Bump the dependency of ANTLR to version 4.7

2017-08-24 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-21830:
-

 Summary: Bump the dependency of ANTLR to version 4.7
 Key: SPARK-21830
 URL: https://issues.apache.org/jira/browse/SPARK-21830
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Herman van Hovell
Assignee: Herman van Hovell


There are a few minor issues with the current ANTLR version. Version 4.7 fixes 
most of these. I'd like to bump the version to that one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-24 Thread zakaria hili (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140183#comment-16140183
 ] 

zakaria hili commented on SPARK-21799:
--

[~WeichenXu123], df.rdd.getStorageLevel return none even if df is cached.
The solution is to check cache parameter on df.
When i create PR SPARK-18356 , we didn't have df.storageLevel.
But now, there are some people working on this solution.

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21829) Enable config to permanently blacklist a list of nodes

2017-08-24 Thread Luca Canali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-21829:

Summary: Enable config to permanently blacklist a list of nodes  (was: 
Prevent running executors/tasks on a user-specified list of cluster nodes)

> Enable config to permanently blacklist a list of nodes
> --
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-24 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140162#comment-16140162
 ] 

Weichen Xu commented on SPARK-21799:


I suggest check both `df.storageLevel` and `df.rdd.getStorageLevel` for input 
data, only when none of them is cached, do cache before `train`.

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140161#comment-16140161
 ] 

Steve Loughran commented on SPARK-21817:


FYI, this is now fixed in hadoop trunk/3.0-beta-1

> Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
> --
>
> Key: SPARK-21817
> URL: https://issues.apache.org/jira/browse/SPARK-21817
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ewan Higgs
>Priority: Minor
> Attachments: SPARK-21817.001.patch
>
>
> The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to 
> pull out the ACL and other information. Therefore passing in a {{null}} is no 
> longer adequate and hence causes a NPE when listing files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140147#comment-16140147
 ] 

Steve Loughran commented on SPARK-21797:


Note that if it is just during spark partition calculation, it's probable that 
it is going down the directory tree and inspecting the files through HEAD 
requests, maybe looking at metadata entries too. So do attach the s3a & spark 
trace so we can see what's going on, as something may be over enthusastic about 
looking at files, or we could have something recognise the problem and recover 
from it.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140138#comment-16140138
 ] 

Steve Loughran commented on SPARK-21797:


I was talking about the cost and time of getting data from Glacier. If that's 
the only place where data lives, then its slow and expensive. And that's the 
bit I'm describing as niche. Given I've been working full time on S3A, I'm 
reasonably confident it gets used a lot.

If you talk to data in S3 that has been backed up to glacier, you *wlll get a 
403*: According to Jeff Barr himself: 
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/

bq. If you archive objects using the Glacier storage option, you must inspect 
the storage class of an object before you attempt to retrieve it. The customary 
GET request will work as expected if the object is stored in S3 Standard or 
Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object 
is archived in Glacier. In this case, you must use the RESTORE operation 
(described below) to make your data available in S3.

bq. You use S3’s new RESTORE operation to access an object archived in Glacier. 
As part of the request, you need to specify a retention period in days. 
Restoring an object will generally take 3 to 5 hours. Your restored object will 
remain in both Glacier and S3’s Reduced Redundancy Storage (RRS) for the 
duration of the retention period. At the end of the retention period the 
object’s data will be removed from S3; the object will remain in Glacier.

Like I said, I'd be interested in getting the full stack trace if you try to 
read this with an S3A client. Not for fixing, but for better reporting. 
Probably point them at Jeff's blog entry. Or this JIRA :)


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21826) outer broadcast hash join should not throw NPE

2017-08-24 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-21826.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Fixed per https://github.com/apache/spark/pull/19036

> outer broadcast hash join should not throw NPE
> --
>
> Key: SPARK-21826
> URL: https://issues.apache.org/jira/browse/SPARK-21826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21826) outer broadcast hash join should not throw NPE

2017-08-24 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-21826:
-

Assignee: Wenchen Fan

> outer broadcast hash join should not throw NPE
> --
>
> Key: SPARK-21826
> URL: https://issues.apache.org/jira/browse/SPARK-21826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-08-24 Thread Arthur Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140111#comment-16140111
 ] 

Arthur Rand commented on SPARK-16742:
-

Hello [~vanzin], I'm assuming you're talking about automatic ticket renewal, 
correct? I was just starting to look into that w.r.t. Mesos, I'll create a 
ticket. 

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Arthur Rand
> Fix For: 2.3.0
>
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-08-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094
 ] 

Li Jin edited comment on SPARK-21190 at 8/24/17 2:33 PM:
-

[~ueshin], Got it. I'd actually prefer doing it this way:

{code}
@pandas_udf(LongType())
def f0(v):
return pd.Series(1).repeat(len(v))

df.select(f0(F.lit(0)))
{code}

Instead of passing the size as a scalar to the function.

Passing size to f0 feels unintuitive to me because f0 is defined as a function 
that takes one argument - def f0(size), but being invoked with 0 args - f0().


was (Author: icexelloss):
[~ueshin], Got it. I'd actually prefer doing it this way:

{code}
@pandas_udf(LongType())
def f0(v):
return pd.Series(1).repeat(len(v))

df.select(f0(F.lit(0)))
{code}

Instead of passing the size as a scalar to the function.

Passing size to f0 feels unintuitive to me because f0 is defined as a function 
that takes one argument - size, but being invoked with 0 args - f0().

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, 

[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-08-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094
 ] 

Li Jin edited comment on SPARK-21190 at 8/24/17 2:33 PM:
-

[~ueshin], Got it. I'd actually prefer doing it this way:

{code}
@pandas_udf(LongType())
def f0(v):
return pd.Series(1).repeat(len(v))

df.select(f0(F.lit(0)))
{code}

Instead of passing the size as a scalar to the function.

Passing size to f0 feels unintuitive to me because f0 is defined as a function 
that takes one argument - size, but being invoked with 0 args - f0().


was (Author: icexelloss):
[~ueshin], Got it. I'd actually prefer doing it this way:

{code}
@pandas_udf(LongType())
def f0(v):
return pd.Series(1).repeat(len(v))

df.select(f0(F.lit(0)))
{code}

Instead of passing the size as a scalar to the function.

Passing size to f0 feels intuitive to me because f0 is defined as a function 
that takes one argument - size, but being invoked with 0 args - f0().

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-08-24 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094
 ] 

Li Jin commented on SPARK-21190:


[~ueshin], Got it. I'd actually prefer doing it this way:

{code}
@pandas_udf(LongType())
def f0(v):
return pd.Series(1).repeat(len(v))

df.select(f0(F.lit(0)))
{code}

Instead of passing the size as a scalar to the function.

Passing size to f0 feels intuitive to me because f0 is defined as a function 
that takes one argument - size, but being invoked with 0 args - f0().

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---

[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes

2017-08-24 Thread Luca Canali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140065#comment-16140065
 ] 

Luca Canali commented on SPARK-21829:
-

https://github.com/apache/spark/pull/19039

> Prevent running executors/tasks on a user-specified list of cluster nodes
> -
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21527) Use buffer limit in order to take advantage of JAVA NIO Util's buffercache

2017-08-24 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang reopened SPARK-21527:
--

> Use buffer limit in order to take advantage of  JAVA NIO Util's buffercache
> ---
>
> Key: SPARK-21527
> URL: https://issues.apache.org/jira/browse/SPARK-21527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zhoukang
>
> Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe 
> code in java nio Util below:
> {code:java}
> public static ByteBuffer More ...getTemporaryDirectBuffer(int size) {
> BufferCache cache = bufferCache.get();
> ByteBuffer buf = cache.get(size);
> if (buf != null) {
> return buf;
> } else {
> // No suitable buffer in the cache so we need to allocate a new
> // one. To avoid the cache growing then we remove the first
> // buffer from the cache and free it.
> if (!cache.isEmpty()) {
> buf = cache.removeFirst();
> free(buf);
> }
> return ByteBuffer.allocateDirect(size);
> }
> }
> {code}
> If we slice first with a fixed size, we can use buffer cache and only need to 
> allocate at the first write call.
> Since we allocate new buffer, we can not control the free time of this 
> buffer.This once cause memory issue in our production cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21527) Use buffer limit in order to take advantage of JAVA NIO Util's buffercache

2017-08-24 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140062#comment-16140062
 ] 

zhoukang commented on SPARK-21527:
--

https://github.com/apache/spark/pull/18730

> Use buffer limit in order to take advantage of  JAVA NIO Util's buffercache
> ---
>
> Key: SPARK-21527
> URL: https://issues.apache.org/jira/browse/SPARK-21527
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zhoukang
>
> Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe 
> code in java nio Util below:
> {code:java}
> public static ByteBuffer More ...getTemporaryDirectBuffer(int size) {
> BufferCache cache = bufferCache.get();
> ByteBuffer buf = cache.get(size);
> if (buf != null) {
> return buf;
> } else {
> // No suitable buffer in the cache so we need to allocate a new
> // one. To avoid the cache growing then we remove the first
> // buffer from the cache and free it.
> if (!cache.isEmpty()) {
> buf = cache.removeFirst();
> free(buf);
> }
> return ByteBuffer.allocateDirect(size);
> }
> }
> {code}
> If we slice first with a fixed size, we can use buffer cache and only need to 
> allocate at the first write call.
> Since we allocate new buffer, we can not control the free time of this 
> buffer.This once cause memory issue in our production cluster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery

2017-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21759:
---

Assignee: Liang-Chi Hsieh

> In.checkInputDataTypes should not wrongly report unresolved plans for IN 
> correlated subquery
> 
>
> Key: SPARK-21759
> URL: https://issues.apache.org/jira/browse/SPARK-21759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> With the check for structural integrity proposed in SPARK-21726, I found that 
> an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
> plans.
> For a correlated IN query like:
> {code}
> Project [a#0]
> +- Filter a#0 IN (list#4 [b#1])
>:  +- Project [c#2]
>: +- Filter (outer(b#1) < d#3)
>:+- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> After {{PullupCorrelatedPredicates}}, it produces query plan like:
> {code}
> 'Project [a#0]
> +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> Because the correlated predicate involves another attribute {{d#3}} in 
> subquery, it has been pulled out and added into the {{Project}} on the top of 
> the subquery.
> When {{list}} in {{In}} contains just one {{ListQuery}}, 
> {{In.checkInputDataTypes}} checks if the size of {{value}} expressions 
> matches the output size of subquery. In the above example, there is only 
> {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, 
> so it fails the check and {{In.resolved}} returns {{false}}.
> We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans 
> to fail the structural integrity check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery

2017-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21759.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18968
[https://github.com/apache/spark/pull/18968]

> In.checkInputDataTypes should not wrongly report unresolved plans for IN 
> correlated subquery
> 
>
> Key: SPARK-21759
> URL: https://issues.apache.org/jira/browse/SPARK-21759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> With the check for structural integrity proposed in SPARK-21726, I found that 
> an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
> plans.
> For a correlated IN query like:
> {code}
> Project [a#0]
> +- Filter a#0 IN (list#4 [b#1])
>:  +- Project [c#2]
>: +- Filter (outer(b#1) < d#3)
>:+- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> After {{PullupCorrelatedPredicates}}, it produces query plan like:
> {code}
> 'Project [a#0]
> +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> Because the correlated predicate involves another attribute {{d#3}} in 
> subquery, it has been pulled out and added into the {{Project}} on the top of 
> the subquery.
> When {{list}} in {{In}} contains just one {{ListQuery}}, 
> {{In.checkInputDataTypes}} checks if the size of {{value}} expressions 
> matches the output size of subquery. In the above example, there is only 
> {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, 
> so it fails the check and {{In.resolved}} returns {{false}}.
> We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans 
> to fail the structural integrity check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes

2017-08-24 Thread Luca Canali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140041#comment-16140041
 ] 

Luca Canali commented on SPARK-21829:
-

I think what I am addressing may be a rare case (but it actually happened to 
us): that is we are on a shared YARN cluster and we have a workload that runs 
slow on a couple of nodes, however the nodes are fine to run other types of 
jobs, so we want to have them in the cluster. The actual problem comes from 
reading from an external file system, and apparently only for this specific 
workload (which is only one of many workloads run on this cluster). What I have 
done as a workaround to make the job run faster is just killing the executors 
on the 2 "slow nodes" and the job could finish faster as it avoided the 
painfully slow long tail of execution on the affected nodes. The proposed 
patch/feature is an attempt to address this case in a more soft way than just 
going on the nodes and killing executors.


> Prevent running executors/tasks on a user-specified list of cluster nodes
> -
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes

2017-08-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140030#comment-16140030
 ] 

Sean Owen commented on SPARK-21829:
---

Why not just take them out of the resource manager entirely? This sounds like a 
difficult way to try to reimplement that in Spark.

> Prevent running executors/tasks on a user-specified list of cluster nodes
> -
>
> Key: SPARK-21829
> URL: https://issues.apache.org/jira/browse/SPARK-21829
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Luca Canali
>Priority: Minor
>
> The idea for this proposal comes from a performance incident in a local 
> cluster where a job was found very slow because of a log tail of stragglers 
> due to 2 nodes in the cluster being slow to access a remote filesystem.
> The issue was limited to the 2 machines and was related to external 
> configurations: the 2 machines that performed badly when accessing the remote 
> file system were behaving normally for other jobs in the cluster (a shared 
> YARN cluster).
> With this new feature I propose to introduce a mechanism to allow users to 
> specify a list of nodes in the cluster where executors/tasks should not run 
> for a specific job.
> The proposed implementation that I tested (see PR) uses the Spark blacklist 
> mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list 
> of user-specified nodes is added to the blacklist at the start of the Spark 
> Context and it is never expired. 
> I have tested this on a YARN cluster on a case taken from the original 
> production problem and I confirm a performance improvement of about 5x for 
> the specific test case I have. I imagine that there can be other cases where 
> Spark users may want to blacklist a set of nodes. This can be used for 
> troubleshooting, including cases where certain nodes/executors are slow for a 
> given workload and this is caused by external agents, so the anomaly is not 
> picked up by the cluster manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes

2017-08-24 Thread Luca Canali (JIRA)
Luca Canali created SPARK-21829:
---

 Summary: Prevent running executors/tasks on a user-specified list 
of cluster nodes
 Key: SPARK-21829
 URL: https://issues.apache.org/jira/browse/SPARK-21829
 Project: Spark
  Issue Type: New Feature
  Components: Scheduler, Spark Core
Affects Versions: 2.2.0, 2.1.1
Reporter: Luca Canali
Priority: Minor


The idea for this proposal comes from a performance incident in a local cluster 
where a job was found very slow because of a log tail of stragglers due to 2 
nodes in the cluster being slow to access a remote filesystem.
The issue was limited to the 2 machines and was related to external 
configurations: the 2 machines that performed badly when accessing the remote 
file system were behaving normally for other jobs in the cluster (a shared YARN 
cluster).
With this new feature I propose to introduce a mechanism to allow users to 
specify a list of nodes in the cluster where executors/tasks should not run for 
a specific job.
The proposed implementation that I tested (see PR) uses the Spark blacklist 
mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list of 
user-specified nodes is added to the blacklist at the start of the Spark 
Context and it is never expired. 
I have tested this on a YARN cluster on a case taken from the original 
production problem and I confirm a performance improvement of about 5x for the 
specific test case I have. I imagine that there can be other cases where Spark 
users may want to blacklist a set of nodes. This can be used for 
troubleshooting, including cases where certain nodes/executors are slow for a 
given workload and this is caused by external agents, so the anomaly is not 
picked up by the cluster manager.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.

2017-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21745.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18958
[https://github.com/apache/spark/pull/18958]

> Refactor ColumnVector hierarchy to make ColumnVector read-only and to 
> introduce WritableColumnVector.
> -
>
> Key: SPARK-21745
> URL: https://issues.apache.org/jira/browse/SPARK-21745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
> Fix For: 2.3.0
>
>
> This is a refactoring of {{ColumnVector}} hierarchy and related classes.
> # make {{ColumnVector}} read-only
> # introduce {{WritableColumnVector}} with write interface
> # remove {{ReadOnlyColumnVector}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21822:
--
Priority: Minor  (was: Major)

[~figo] although the exact meanings of priorities are a little debatable, 
please don't reverse a change a committer has made. You haven't identified a 
realistic problem that this even creates, so can't see that it's significant. 
You also have not proposed a pull request.

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>Priority: Minor
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.

2017-08-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21745:
---

Assignee: Takuya Ueshin

> Refactor ColumnVector hierarchy to make ColumnVector read-only and to 
> introduce WritableColumnVector.
> -
>
> Key: SPARK-21745
> URL: https://issues.apache.org/jira/browse/SPARK-21745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.3.0
>
>
> This is a refactoring of {{ColumnVector}} hierarchy and related classes.
> # make {{ColumnVector}} read-only
> # introduce {{WritableColumnVector}} with write interface
> # remove {{ReadOnlyColumnVector}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Smart updated SPARK-21828:
---
Shepherd: Kazuaki Ishizaki  (was: Liwei Lin)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.2.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Smart updated SPARK-21828:
---
Component/s: SQL

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB...again
> -
>
> Key: SPARK-21828
> URL: https://issues.apache.org/jira/browse/SPARK-21828
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 2.2.0
>Reporter: Otis Smart
>Priority: Critical
>
> Hello!
> 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., 
> dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of 
> CrossValidator() that includes Pipeline() that includes StringIndexer(), 
> VectorAssembler() and DecisionTreeClassifier()).
> 2. Was the aforementioned patch (aka 
> fix(https://github.com/apache/spark/pull/15480) not included in the latest 
> release; what are the reason and (source) of and solution to this persistent 
> issue please?
> py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 
> in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 
> 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> /* 001 */ public SpecificOrdering generate(Object[] references)
> { /* 002 */ return new SpecificOrdering(references); /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> /* 006 */
> /* 007 */ private Object[] references;
> /* 008 */
> /* 009 */
> /* 010 */ public SpecificOrdering(Object[] references)
> { /* 011 */ this.references = references; /* 012 */ /* 013 */ }
> /* 014 */
> /* 015 */
> /* 016 */
> /* 017 */ public int compare(InternalRow a, InternalRow b) {
> /* 018 */ InternalRow i = null; // Holds current row being evaluated.
> /* 019 */
> /* 020 */ i = a;
> /* 021 */ boolean isNullA;
> /* 022 */ double primitiveA;
> /* 023 */
> { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = 
> false; /* 027 */ primitiveA = value; /* 028 */ }
> /* 029 */ i = b;
> /* 030 */ boolean isNullB;
> /* 031 */ double primitiveB;
> /* 032 */
> { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = 
> false; /* 036 */ primitiveB = value; /* 037 */ }
> /* 038 */ if (isNullA && isNullB)
> { /* 039 */ // Nothing /* 040 */ }
> else if (isNullA)
> { /* 041 */ return -1; /* 042 */ }
> else if (isNullB)
> { /* 043 */ return 1; /* 044 */ }
> else {
> /* 045 */ int comp = 
> org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
> /* 046 */ if (comp != 0)
> { /* 047 */ return comp; /* 048 */ }
> /* 049 */ }
> /* 050 */
> /* 051 */
> ...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again

2017-08-24 Thread Otis Smart (JIRA)
Otis Smart created SPARK-21828:
--

 Summary: 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB...again
 Key: SPARK-21828
 URL: https://issues.apache.org/jira/browse/SPARK-21828
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Otis Smart
Priority: Critical


Hello!
1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe 
with ~5 rows x 1100+ columns as input to ".fit()" method of 
CrossValidator() that includes Pipeline() that includes StringIndexer(), 
VectorAssembler() and DecisionTreeClassifier()).
2. Was the aforementioned patch (aka 
fix(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?
py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references)
{ /* 002 */ return new SpecificOrdering(references); /* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */ private Object[] references;
/* 008 */
/* 009 */
/* 010 */ public SpecificOrdering(Object[] references)
{ /* 011 */ this.references = references; /* 012 */ /* 013 */ }
/* 014 */
/* 015 */
/* 016 */
/* 017 */ public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null; // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */
{ /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; 
/* 027 */ primitiveA = value; /* 028 */ }
/* 029 */ i = b;
/* 030 */ boolean isNullB;
/* 031 */ double primitiveB;
/* 032 */
{ /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; 
/* 036 */ primitiveB = value; /* 037 */ }
/* 038 */ if (isNullA && isNullB)
{ /* 039 */ // Nothing /* 040 */ }
else if (isNullA)
{ /* 041 */ return -1; /* 042 */ }
else if (isNullB)
{ /* 043 */ return 1; /* 044 */ }
else {
/* 045 */ int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
/* 046 */ if (comp != 0)
{ /* 047 */ return comp; /* 048 */ }
/* 049 */ }
/* 050 */
/* 051 */
...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir

2017-08-24 Thread lufei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufei updated SPARK-21822:
--
Priority: Major  (was: Minor)

> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir
> -
>
> Key: SPARK-21822
> URL: https://issues.apache.org/jira/browse/SPARK-21822
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: lufei
>
> When insert Hive Table is finished, it is better to clean out the tmpLocation 
> dir(the temp directories like 
> ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" 
> or "/tmp/hive/..." for an old spark version).
> Otherwise, when lots of spark job are executed, millions of temporary 
> directories are left in HDFS. And these temporary directories can only be 
> deleted by the maintainer through the shell script.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139986#comment-16139986
 ] 

Sean Owen commented on SPARK-21797:
---

Sure, but in all events, this is an operation that is fine with Spark, but not 
fine with something between the AWS SDK and AWS. It's not something Spark can 
fix.

If source data is in S3, there's no way to avoid copying it from S3. 
Intermediate data produced by Spark can't live on S3 as it's too eventually 
consistent. Some final result could. And yeah you pay to read/write S3 so in 
some use cases might be more economical to keep intensely read/written data 
close to the compute workers for a time, rather than write/read to S3 between 
several closely related jobs.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978
 ] 

Boris Clémençon  edited comment on SPARK-21797 at 8/24/17 12:37 PM:


Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do we understand each other? 

Besides, why do you say that it a niche use case? Reading partitioned data on 
S3 seems quite normal to me.

and why do you say reading data from S3 is a "very, very expensive way to work 
with data"? According to our tests, reading on S3 in maximum 20% slower than 
reading from HDFS, and we operate from within AWS with a EMR cluster, so we 
should not pay data IO from S3. On the other hand, copying the dataset on HDFS 
has a time overhead and you need a large enough cluster with enough disk to 
store the whole dataset, or at least the relevant dates (whereas you may want 
to process a few columns, ie a fraction of the initial dataset). I would like 
you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.


was (Author: clemencb):
Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do we understand each other? 

Besides, why do you say that it a niche use case? and why do you say reading 
data from S3 is a "very, very expensive way to work with data"? According to 
our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we 
operate from within AWS with a EMR cluster, so we should not pay data IO from 
S3. On the other hand, copying the dataset on HDFS has a time overhead and you 
need a large enough cluster with enough disk to store the whole dataset, or at 
least the relevant dates (whereas you may want to process a few columns, ie a 
fraction of the initial dataset). I would like you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-08-24 Thread Otis Smart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969
 ] 

Otis Smart edited comment on SPARK-16845 at 8/24/17 12:37 PM:
--

Hello!

1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe 
with ~5 rows x 1100+ columns as input to ".fit()" method of 
CrossValidator() that includes Pipeline() that includes StringIndexer(), 
VectorAssembler() and DecisionTreeClassifier()).

2. Was the aforementioned patch (aka fix) 
(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?

py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */
/* 016 */
/* 017 */   public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null;  // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */ {
/* 024 */
/* 025 */   double value = i.getDouble(0);
/* 026 */   isNullA = false;
/* 027 */   primitiveA = value;
/* 028 */ }
/* 029 */ i = b;
/* 030 */ boolean isNullB;
/* 031 */ double primitiveB;
/* 032 */ {
/* 033 */
/* 034 */   double value = i.getDouble(0);
/* 035 */   isNullB = false;
/* 036 */   primitiveB = value;
/* 037 */ }
/* 038 */ if (isNullA && isNullB) {
/* 039 */   // Nothing
/* 040 */ } else if (isNullA) {
/* 041 */   return -1;
/* 042 */ } else if (isNullB) {
/* 043 */   return 1;
/* 044 */ } else {
/* 045 */   int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
/* 046 */   if (comp != 0) {
/* 047 */ return comp;
/* 048 */   }
/* 049 */ }
/* 050 */
/* 051 */



was (Author: otissmart):
Hello!

1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe 
with ~5 rows x 1100+ columns as input to ".fit()" method of 
CrossValidator() that includes Pipeline() that includes StringIndexer(), 
VectorAssembler() and DecisionTreeClassifier()).

2. Was the aforementioned patch (aka 
fix(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?

py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */
/* 016 */
/* 017 */   public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null;  // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */ {
/* 024 */
/* 025 */   double value = i.getDouble(0);
/* 026 */   isNullA = false;
/* 027 */   primitiveA = value;
/* 028 */ }
/* 029 */ i = b;

[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978
 ] 

Boris Clémençon  edited comment on SPARK-21797 at 8/24/17 12:37 PM:


Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do we understand each other? 

Besides, why do you say that it a niche use case? Reading partitioned data on 
S3 seems quite normal to me.

And why do you say reading data from S3 is a "very, very expensive way to work 
with data"? According to our tests, reading on S3 in maximum 20% slower than 
reading from HDFS, and we operate from within AWS with a EMR cluster, so we 
should not pay data IO from S3. On the other hand, copying the dataset on HDFS 
has a time overhead and you need a large enough cluster with enough disk to 
store the whole dataset, or at least the relevant dates (whereas you may want 
to process a few columns, ie a fraction of the initial dataset). I would like 
you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.


was (Author: clemencb):
Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do we understand each other? 

Besides, why do you say that it a niche use case? Reading partitioned data on 
S3 seems quite normal to me.

and why do you say reading data from S3 is a "very, very expensive way to work 
with data"? According to our tests, reading on S3 in maximum 20% slower than 
reading from HDFS, and we operate from within AWS with a EMR cluster, so we 
should not pay data IO from S3. On the other hand, copying the dataset on HDFS 
has a time overhead and you need a large enough cluster with enough disk to 
store the whole dataset, or at least the relevant dates (whereas you may want 
to process a few columns, ie a fraction of the initial dataset). I would like 
you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978
 ] 

Boris Clémençon  edited comment on SPARK-21797 at 8/24/17 12:36 PM:


Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do we understand each other? 

Besides, why do you say that it a niche use case? and why do you say reading 
data from S3 is a "very, very expensive way to work with data"? According to 
our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we 
operate from within AWS with a EMR cluster, so we should not pay data IO from 
S3. On the other hand, copying the dataset on HDFS has a time overhead and you 
need a large enough cluster with enough disk to store the whole dataset, or at 
least the relevant dates (whereas you may want to process a few columns, ie a 
fraction of the initial dataset). I would like you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.


was (Author: clemencb):
Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do you understand each other? 

Besides, why do you say that it a niche use case? and why do you say reading 
data from S3 is a "very, very expensive way to work with data"? According to 
our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we 
operate from within AWS with a EMR cluster, so we should not pay data IO from 
S3. On the other hand, copying the dataset on HDFS has a time overhead and you 
need a large enough cluster with enough disk to store the whole dataset, or at 
least the relevant dates (whereas you may want to process a few columns, ie a 
fraction of the initial dataset). I would like you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978
 ] 

Boris Clémençon  commented on SPARK-21797:
--

Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do you understand each other? 

Besides, why do you say that it a niche use case? and why do you say reading 
data from S3 is a "very, very expensive way to work with data"? According to 
our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we 
operate from within AWS with a EMR cluster, so we should not pay data IO from 
S3. On the other hand, copying the dataset on HDFS has a time overhead and you 
need a large enough cluster with enough disk to store the whole dataset, or at 
least the relevant dates (whereas you may want to process a few columns, ie a 
fraction of the initial dataset). I would like you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-08-24 Thread Otis Smart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969
 ] 

Otis Smart edited comment on SPARK-16845 at 8/24/17 12:31 PM:
--

Hello!

1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe 
with ~5 rows x 1100+ columns as input to ".fit()" method of 
CrossValidator() that includes Pipeline() that includes StringIndexer(), 
VectorAssembler() and DecisionTreeClassifier()).

2. Was the aforementioned patch (aka 
fix(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?

py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */
/* 016 */
/* 017 */   public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null;  // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */ {
/* 024 */
/* 025 */   double value = i.getDouble(0);
/* 026 */   isNullA = false;
/* 027 */   primitiveA = value;
/* 028 */ }
/* 029 */ i = b;
/* 030 */ boolean isNullB;
/* 031 */ double primitiveB;
/* 032 */ {
/* 033 */
/* 034 */   double value = i.getDouble(0);
/* 035 */   isNullB = false;
/* 036 */   primitiveB = value;
/* 037 */ }
/* 038 */ if (isNullA && isNullB) {
/* 039 */   // Nothing
/* 040 */ } else if (isNullA) {
/* 041 */   return -1;
/* 042 */ } else if (isNullB) {
/* 043 */   return 1;
/* 044 */ } else {
/* 045 */   int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
/* 046 */   if (comp != 0) {
/* 047 */ return comp;
/* 048 */   }
/* 049 */ }
/* 050 */
/* 051 */



was (Author: otissmart):
Hello!

1. I encounter a similar issue (see below text) on Pyspark 2.2.

2. Was the aforementioned patch (aka 
fix(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?

py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */
/* 016 */
/* 017 */   public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null;  // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */ {
/* 024 */
/* 025 */   double value = i.getDouble(0);
/* 026 */   isNullA = false;
/* 027 */   primitiveA = value;
/* 028 */ }
/* 029 */ i = b;
/* 030 */ boolean isNullB;
/* 031 */ double primitiveB;
/* 032 */ {
/* 033 */
/* 034 */   double value = i.getDouble(0);
/* 035 */   isNullB = false;
/* 036 */   primitiveB = value;

[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-08-24 Thread Otis Smart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969
 ] 

Otis Smart commented on SPARK-16845:


Hello!

1. I encounter a similar issue (see below text) on Pyspark 2.2.

2. Was the aforementioned patch (aka 
fix(https://github.com/apache/spark/pull/15480) not included in the latest 
release; what are the reason and (source) of and solution to this persistent 
issue please?

py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in 
stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 
(TID 1996, ip-10-0-14-83.ec2.internal, executor 4): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
"compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
grows beyond 64 KB
/* 001 */ public SpecificOrdering generate(Object[] references) {
/* 002 */   return new SpecificOrdering(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */
/* 009 */
/* 010 */   public SpecificOrdering(Object[] references) {
/* 011 */ this.references = references;
/* 012 */
/* 013 */   }
/* 014 */
/* 015 */
/* 016 */
/* 017 */   public int compare(InternalRow a, InternalRow b) {
/* 018 */ InternalRow i = null;  // Holds current row being evaluated.
/* 019 */
/* 020 */ i = a;
/* 021 */ boolean isNullA;
/* 022 */ double primitiveA;
/* 023 */ {
/* 024 */
/* 025 */   double value = i.getDouble(0);
/* 026 */   isNullA = false;
/* 027 */   primitiveA = value;
/* 028 */ }
/* 029 */ i = b;
/* 030 */ boolean isNullB;
/* 031 */ double primitiveB;
/* 032 */ {
/* 033 */
/* 034 */   double value = i.getDouble(0);
/* 035 */   isNullB = false;
/* 036 */   primitiveB = value;
/* 037 */ }
/* 038 */ if (isNullA && isNullB) {
/* 039 */   // Nothing
/* 040 */ } else if (isNullA) {
/* 041 */   return -1;
/* 042 */ } else if (isNullB) {
/* 043 */   return 1;
/* 044 */ } else {
/* 045 */   int comp = 
org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB);
/* 046 */   if (comp != 0) {
/* 047 */ return comp;
/* 048 */   }
/* 049 */ }
/* 050 */
/* 051 */


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21825) improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in ExchangeCoordinatorSuite

2017-08-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139957#comment-16139957
 ] 

Sean Owen commented on SPARK-21825:
---

I don't think this is even something we should merge.

> improving  "assert(exchanges.map(_.outputPartitioning.numPartitions)" in 
> ExchangeCoordinatorSuite
> -
>
> Key: SPARK-21825
> URL: https://issues.apache.org/jira/browse/SPARK-21825
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: iamhumanbeing
>Priority: Trivial
>
> ExchangeCoordinatorSuite.scala
> Line 424: assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === 
> Set(2, 3)) is not so precisely.
> change to 
>  assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq === Seq(2, 2, 
> 2, 3))
> Line 476: 
> assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === Set(5, 3)) 
> is not so precisely.
> change to 
> assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq 
> === Seq(5, 3, 5))



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21825) improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in ExchangeCoordinatorSuite

2017-08-24 Thread Nikhil Bhide (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139953#comment-16139953
 ] 

Nikhil Bhide commented on SPARK-21825:
--

Please assign this issue to me.

> improving  "assert(exchanges.map(_.outputPartitioning.numPartitions)" in 
> ExchangeCoordinatorSuite
> -
>
> Key: SPARK-21825
> URL: https://issues.apache.org/jira/browse/SPARK-21825
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: iamhumanbeing
>Priority: Trivial
>
> ExchangeCoordinatorSuite.scala
> Line 424: assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === 
> Set(2, 3)) is not so precisely.
> change to 
>  assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq === Seq(2, 2, 
> 2, 3))
> Line 476: 
> assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === Set(5, 3)) 
> is not so precisely.
> change to 
> assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq 
> === Seq(5, 3, 5))



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark

2017-08-24 Thread Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139950#comment-16139950
 ] 

Davis commented on SPARK-19123:
---

Please add the following config entry to the SparkSessionBuilder to skip the 
key decryption

"fs.azure.account.keyprovider..blob.core.windows.net"

 with value

 "org.apache.hadoop.fs.azure.SimpleKeyProvider"


> KeyProviderException when reading Azure Blobs from Apache Spark
> ---
>
> Key: SPARK-19123
> URL: https://issues.apache.org/jira/browse/SPARK-19123
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output, Java API
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster 
> version 3.5 with Hadoop version 2.7.3
>Reporter: Saulo Ricci
>Priority: Minor
>
> I created a Spark job and it's intended to read a set of json files from a 
> Azure Blob container. I set the key and reference to my storage and I'm 
> reading the files as showed in the snippet bellow:
> {code:java}
> SparkSession
> sparkSession =
> SparkSession.builder().appName("Pipeline")
> .master("yarn")
> .config("fs.azure", 
> "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
> 
> .config("fs.azure.account.key..blob.core.windows.net","")
> .getOrCreate();
> Dataset txs = sparkSession.read().json("wasb://path_to_files");
> {code}
> The point is that I'm unfortunately getting a 
> `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from 
> the azure storage. According to the trace showed bellow it seems the header 
> too long but still trying to figure out what exactly that means:
> {code:java}
> 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: 
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
> org.apache.hadoop.fs.azure.AzureException: 
> org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException 
> exitCode=2: Error reading S/MIME message
> 140473279682200:error:0D07207B:asn1 encoding 
> routines:ASN1_get_object:header too long:asn1_lib.c:157:
> 140473279682200:error:0D0D106E:asn1 encoding 
> routines:B64_READ_ASN1:decode error:asn_mime.c:192:
> 140473279682200:error:0D0D40CB:asn1 encoding 
> routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517:
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953)
>   at 
> org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
>   at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249)
>   at 
> taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(Native

[jira] [Assigned] (SPARK-19159) PySpark UDF API improvements

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-19159:


Assignee: Maciej Szymkiewicz

> PySpark UDF API improvements
> 
>
> Key: SPARK-19159
> URL: https://issues.apache.org/jira/browse/SPARK-19159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> At this moment PySpark UDF API is a bit rough around the edges. This is an an 
> umbrella ticket for possible API improvements.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139919#comment-16139919
 ] 

Steve Loughran commented on SPARK-21797:


If you are using S3// URLs then its the AWS team's problem. If you were using 
s3a://, then it'd be something you ask the hadoop team to look at, but we'd say 
no as

* it's a niche use case
* It's really slow, as in "read() takes so long other bits of the system will 
start to think your worker is hanging". Which means if you have speculative 
execution turned on, they kick off other workers to read the data.
* t's a very, very expensive way to work with data; $0.03/GB, which ramps up 
fast once multiple spark workers start reading the same datasets in parallel.
* Finally, it's been rejected on the server with a 403 response. That's Amazon 
S3 saying "no", not any of the clients.

You shouldn't be trying to process data direct from S3. Copy to S3 or a 
transient HDFS cluster, maybe as part of an oozie or airflow workflow.

Be curious about the fulll stack trace you see if you do try this with s3a://, 
even though it'll still be a WONTFIX. We could at least go for a more 
meaningful exception translation, and the retry logic needs to know that it 
won't go away if you try again

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-18777:


Assignee: Maciej Szymkiewicz

> Return UDF objects when registering from Python
> ---
>
> Key: SPARK-18777
> URL: https://issues.apache.org/jira/browse/SPARK-18777
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>Assignee: Maciej Szymkiewicz
>
> In Scala when registering a UDF it gives you back a UDF object that you can 
> use in the Dataset/DataFrame API as well as with SQL expressions. We can do 
> the same in Python, for both Python UDFs and Java UDFs registered from Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18777) Return UDF objects when registering from Python

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-18777.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/17831

> Return UDF objects when registering from Python
> ---
>
> Key: SPARK-18777
> URL: https://issues.apache.org/jira/browse/SPARK-18777
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>
> In Scala when registering a UDF it gives you back a UDF object that you can 
> use in the Dataset/DataFrame API as well as with SQL expressions. We can do 
> the same in Python, for both Python UDFs and Java UDFs registered from Python.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19159) PySpark UDF API improvements

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19159:
-
Fix Version/s: 2.3.0

> PySpark UDF API improvements
> 
>
> Key: SPARK-19159
> URL: https://issues.apache.org/jira/browse/SPARK-19159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> At this moment PySpark UDF API is a bit rough around the edges. This is an an 
> umbrella ticket for possible API improvements.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19159) PySpark UDF API improvements

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19159.
--
Resolution: Done

> PySpark UDF API improvements
> 
>
> Key: SPARK-19159
> URL: https://issues.apache.org/jira/browse/SPARK-19159
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>
> At this moment PySpark UDF API is a bit rough around the edges. This is an an 
> umbrella ticket for possible API improvements.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19165) UserDefinedFunction should verify call arguments and provide readable exception in case of mismatch

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-19165:


Assignee: Hyukjin Kwon

> UserDefinedFunction should verify call arguments and provide readable 
> exception in case of mismatch
> ---
>
> Key: SPARK-19165
> URL: https://issues.apache.org/jira/browse/SPARK-19165
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> Invalid arguments to UDF call fail with a bit cryptic Py4J errors:
> {code}
> In [5]: g = udf(lambda x: x)
> In [6]: df.select(f([]))
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 df.select(f([]))
> 
> Py4JError: An error occurred while calling 
> z:org.apache.spark.sql.functions.col. Trace:
> py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> It is pretty easy to perform basic input validation:
> {code}
> In [8]: f = udf(lambda x: x)
> In [9]: f(1)
> ---
> TypeError Traceback (most recent call last)
> ...
> TypeError: All arguments should be Columns or strings representing column 
> names. Got 1 of type 
> {code}
> This can be further extended to check for expected number of arguments or 
> even, with some type of annotations, SQL types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19165) UserDefinedFunction should verify call arguments and provide readable exception in case of mismatch

2017-08-24 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19165.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19027
[https://github.com/apache/spark/pull/19027]

> UserDefinedFunction should verify call arguments and provide readable 
> exception in case of mismatch
> ---
>
> Key: SPARK-19165
> URL: https://issues.apache.org/jira/browse/SPARK-19165
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.3.0
>
>
> Invalid arguments to UDF call fail with a bit cryptic Py4J errors:
> {code}
> In [5]: g = udf(lambda x: x)
> In [6]: df.select(f([]))
> ---
> Py4JError Traceback (most recent call last)
>  in ()
> > 1 df.select(f([]))
> 
> Py4JError: An error occurred while calling 
> z:org.apache.spark.sql.functions.col. Trace:
> py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> It is pretty easy to perform basic input validation:
> {code}
> In [8]: f = udf(lambda x: x)
> In [9]: f(1)
> ---
> TypeError Traceback (most recent call last)
> ...
> TypeError: All arguments should be Columns or strings representing column 
> names. Got 1 of type 
> {code}
> This can be further extended to check for expected number of arguments or 
> even, with some type of annotations, SQL types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21172) EOFException reached end of stream in UnsafeRowSerializer

2017-08-24 Thread liupengcheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139910#comment-16139910
 ] 

liupengcheng commented on SPARK-21172:
--

[~lasanthafdo] This may caused by packets bits error in network transfer, or 
disk bits error of shuffle data in parent stage.
May be some hack of throwing FetchFailedException to retry parent stage may 
solve this problem.

it seems like the community has already introduced a mechanism to check disk 
bits error and retry parent stage.

> EOFException reached end of stream in UnsafeRowSerializer
> -
>
> Key: SPARK-21172
> URL: https://issues.apache.org/jira/browse/SPARK-21172
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.1
>Reporter: liupengcheng
>  Labels: shuffle
>
> Spark sql job failed because of the following Exception. Seems like a bug in 
> shuffle stage. 
> Shuffle read size for single task is tens of GB
> {code}
> org.apache.spark.SparkException: Task failed while writing rows
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.EOFException: reached end of stream after reading 9034374 
> bytes; 1684891936 bytes expected
>   at 
> org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:259)
>   ... 8 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139898#comment-16139898
 ] 

Steve Loughran commented on SPARK-21702:


IF this is just "directories", then there are no directories in s3. We create 
some mock ones for empty dirs (i.e after a mkdirs() call), through 0-byte 
objects. We then delete all such 0-byte objects when you write data underneath 
{{see S3AFilesystem.deleteUnnecessaryFakeDirectories(Path)}}. 

I think that's what's been causing the confusion.

I'm going to close this one as invalid. sorry.

FWIW, if you do want to guarantee data in a bucket is encrypted, set the bucket 
policy to mandate this. It's the best way to be confident that all your data is 
locked down: [[https://hortonworks.github.io/hdp-aws/s3-encryption/index.html]]



> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >