[jira] [Updated] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it
[ https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-18355: -- Affects Version/s: 2.2.0 > Spark SQL fails to read data from a ORC hive table that has a new column > added to it > > > Key: SPARK-18355 > URL: https://issues.apache.org/jira/browse/SPARK-18355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.1.0, 2.2.0 > Environment: Centos6 >Reporter: Sandeep Nemuri > > *PROBLEM*: > Spark SQL fails to read data from a ORC hive table that has a new column > added to it. > Below is the exception: > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > *STEPS TO SIMULATE THIS ISSUE*: > 1) Create table in hive. > {code} > CREATE TABLE `testorc`( > `click_id` string, > `search_id` string, > `uid` bigint) > PARTITIONED BY ( > `ts` string, > `hour` string) > STORED AS ORC; > {code} > 2) Load data into table : > {code} > INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES > (12,2,12345); > {code} > 3) Select through spark shell (This works) > {code} > sqlContext.sql("select click_id,search_id from testorc").show > {code} > 4) Now add column to hive table > {code} > ALTER TABLE testorc ADD COLUMNS (dummy string); > {code} > 5) Now again select from spark shell > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.cata
[jira] [Closed] (SPARK-21815) Undeterministic group labeling within small connected component
[ https://issues.apache.org/jira/browse/SPARK-21815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nguyen duc tuan closed SPARK-21815. --- Resolution: Not A Bug > Undeterministic group labeling within small connected component > > > Key: SPARK-21815 > URL: https://issues.apache.org/jira/browse/SPARK-21815 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.6.3, 2.2.0 >Reporter: nguyen duc tuan >Priority: Trivial > Labels: easyfix > > As I look in the code > https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/LabelPropagation.scala#L61, > when the number of vertices in each community is small and the number of > iteration is large enough, all candidates will have same scores. Due to order > in the set, each vertex will be assigned to different community id. By > ordering vertexId, the problem solved. > Sample code to reproduce this error: > val vertices = spark.sparkContext.parallelize(Seq((1l,1), (2l, 1))) > val edges = spark.sparkContext.parallelize(Seq(Edge(1l,2l, 1))) > val g = Graph(vertices, edges) > val c =LabelPropagation.run(g, 5) > c.vertices.map(x => (x._1, x._2)).toDF.show -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21255: - Assignee: Mike > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike >Assignee: Mike > Fix For: 2.3.0 > > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21255) NPE when creating encoder for enum
[ https://issues.apache.org/jira/browse/SPARK-21255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21255. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18488 [https://github.com/apache/spark/pull/18488] > NPE when creating encoder for enum > -- > > Key: SPARK-21255 > URL: https://issues.apache.org/jira/browse/SPARK-21255 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 > Environment: org.apache.spark:spark-core_2.10:2.1.0 > org.apache.spark:spark-sql_2.10:2.1.0 >Reporter: Mike > Fix For: 2.3.0 > > > When you try to create an encoder for Enum type (or bean with enum property) > via Encoders.bean(...), it fails with NullPointerException at TypeToken:495. > I did a little research and it turns out, that in JavaTypeInference:126 > following code > {code:java} > val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) > val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == > "class") > val fields = properties.map { property => > val returnType = > typeToken.method(property.getReadMethod).getReturnType > val (dataType, nullable) = inferDataType(returnType) > new StructField(property.getName, dataType, nullable) > } > (new StructType(fields), true) > {code} > filters out properties named "class", because we wouldn't want to serialize > that. But enum types have another property of type Class named > "declaringClass", which we are trying to inspect recursively. Eventually we > try to inspect ClassLoader class, which has property "defaultAssertionStatus" > with no read method, which leads to NPE at TypeToken:495. > I think adding property name "declaringClass" to filtering will resolve this. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception
[ https://issues.apache.org/jira/browse/SPARK-21691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141248#comment-16141248 ] Anton Okolnychyi commented on SPARK-21691: -- [~Robin Shao] My suggestion was to modify the output method in 'Project'. [~smilegator] Could you, please, provide your opinion on the proposed solution in the first comment? > Accessing canonicalized plan for query with limit throws exception > -- > > Key: SPARK-21691 > URL: https://issues.apache.org/jira/browse/SPARK-21691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Bjoern Toldbod > > Accessing the logical, canonicalized plan fails for queries with limits. > The following demonstrates the issue: > {code:java} > val session = SparkSession.builder.master("local").getOrCreate() > // This works > session.sql("select * from (values 0, > 1)").queryExecution.logical.canonicalized > // This fails > session.sql("select * from (values 0, 1) limit > 1").queryExecution.logical.canonicalized > {code} > The message in the thrown exception is somewhat confusing (or at least not > directly related to the limit): > "Invalid call to toAttribute on unresolved object, tree: *" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-21828: - Flags: (was: Important) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-21828: - Component/s: (was: ML) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141234#comment-16141234 ] Takeshi Yamamuro commented on SPARK-21828: -- You can't set target version or something and these fields are reserved for committers. You need to first see: http://spark.apache.org/contributing.html > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
[ https://issues.apache.org/jira/browse/SPARK-21835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141233#comment-16141233 ] Liang-Chi Hsieh commented on SPARK-21835: - Submitted PR at https://github.com/apache/spark/pull/19050 > RewritePredicateSubquery should not produce unresolved query plans > -- > > Key: SPARK-21835 > URL: https://issues.apache.org/jira/browse/SPARK-21835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. > During the structural integrity, I found {[RewritePredicateSubquery}} can > produce unresolved query plans due to conflicting attributes. We should not > let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-21828: - Target Version/s: (was: 2.1.0, 2.2.0) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21835) RewritePredicateSubquery should not produce unresolved query plans
Liang-Chi Hsieh created SPARK-21835: --- Summary: RewritePredicateSubquery should not produce unresolved query plans Key: SPARK-21835 URL: https://issues.apache.org/jira/browse/SPARK-21835 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh {{RewritePredicateSubquery}} rewrites correlated subquery to join operations. During the structural integrity, I found {[RewritePredicateSubquery}} can produce unresolved query plans due to conflicting attributes. We should not let {{RewritePredicateSubquery}} produce unresolved plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141138#comment-16141138 ] Otis Smart edited comment on SPARK-21828 at 8/25/17 4:19 AM: - Hi KI: I thank you for the expedient reply! * Here (below text) is example code that generates the error in PySpark 2.1. * Please forgive me...I initially inadvertently applied this code on a Spark 2.1 (rather than Spark 2.2) cluster; but I moments ago began a test on a Spark 2.2 cluster (definitely this time). Nonetheless, a troubleshoot + investigation of the aforementioned error may aid others on Spark 2.1 if my ongoing test yields no error in Pyspark 2.2. Gratefully + Best, OS {code:python} # OTIS SMART: 24.08.2017 (https://issues.apache.org/jira/browse/SPARK-21828) # -- # # -- # from pyspark import SparkConf, SparkContext # from pyspark.sql import HiveContext from pyspark.sql.functions import col, lit # from pyspark.ml import Pipeline from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # from pyspark.mllib.random import RandomRDDs as randrdd # -- # # -- # def computeDT(df): # namevAll = df.columns namevP = namevAll[:-1] namevC = namevAll[-1] # training_data, testing_data = df.randomSplit([0.70, 0.30]) print "done: randomSplit" # variableoutStringIndexer = StringIndexer(inputCol=namevC, outputCol='variableout') print "done: StringIndexer" # variablesinAssembler = VectorAssembler(inputCols=namevP, outputCol='variablesin') print "done: VectorAssembler" # dTree = DecisionTreeClassifier(labelCol='variableout', featuresCol='variablesin', impurity='gini', maxBins=3) print "done: DecisionTreeClassifier" # pipeline = Pipeline(stages=[variableoutStringIndexer, variablesinAssembler, dTree]) print "done: Pipeline" # paramgrid = ParamGridBuilder().addGrid(dTree.maxDepth, [3, 7, 8]).build() print "done: ParamGridBuilder" # evaluator = MulticlassClassificationEvaluator(labelCol='variableout', predictionCol='prediction', metricName="f1") print "done: MulticlassClassificationEvaluator" # CrossValidator(estimator=pipeline, estimatorParamMaps=paramgrid, evaluator=evaluator, numFolds=10).fit(training_data) print "done: CrossValidatorModel" # return True # -- # # -- # def main(): # numcols = 1234 numrows = 23456 # x = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=1) y = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=2) # dfx = HiveContext.createDataFrame(hiveContext, x.map(lambda v: tuple([float(ii) for ii in v])).collect(), ["v{0:0>4}".format(jj) for jj in range(0, numcols)]) # dfy = HiveContext.createDataFrame(hiveContext, y.map(lambda v: tuple([float(ii) for ii in v])).collect(), ["v{0:0>4}".format(jj) for jj in range(0, numcols)]) # dfx = dfx.withColumn("v{0:0>4}".format(numcols), dfx["v{0:0>4}".format(numcols - 1)] + 5).withColumn("V", lit('a')) dfy = dfy.withColumn("v{0:0>4}".format(numcols), dfy["v{0:0>4}".format(numcols - 1)] - 5).withColumn("V", lit('b')) # df = dfx.union(dfy) # df.cache() # df.printSchema() df.show(n=10, truncate=False) # computeDT(df) # df.unpersist() # return True # -- # CONFIGURE SPARK CONTEXT THEN RUN 'MAIN' ANALYSIS # -- # if __name__ == "__main__": # conf = SparkConf().setAppName("TroubleshootPyspark.ASFJira2
[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141138#comment-16141138 ] Otis Smart commented on SPARK-21828: Hi KI: I thank you for the expedient reply! * Here (below text) is example code that generates the error in PySpark 2.1. * Please forgive me...I initially inadvertently applied this code on a Spark 2.1 (rather than Spark 2.2) cluster; but I moments ago began a test on a Spark 2.2 cluster (definitely this time). Nonetheless, a troubleshoot + investigation of the aforementioned error may aid others on Spark 2.1 if my ongoing test yields no error in Pyspark 2.2. Gratefully + Best, OS # OTIS SMART: 24.08.2017 (https://issues.apache.org/jira/browse/SPARK-21828) # -- # # -- # from pyspark import SparkConf, SparkContext # from pyspark.sql import HiveContext from pyspark.sql.functions import col, lit # from pyspark.ml import Pipeline from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.evaluation import MulticlassClassificationEvaluator from pyspark.ml.feature import StringIndexer, VectorAssembler from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # from pyspark.mllib.random import RandomRDDs as randrdd # -- # # -- # def computeDT(df): # namevAll = df.columns namevP = namevAll[:-1] namevC = namevAll[-1] # training_data, testing_data = df.randomSplit([0.70, 0.30]) print "done: randomSplit" # variableoutStringIndexer = StringIndexer(inputCol=namevC, outputCol='variableout') print "done: StringIndexer" # variablesinAssembler = VectorAssembler(inputCols=namevP, outputCol='variablesin') print "done: VectorAssembler" # dTree = DecisionTreeClassifier(labelCol='variableout', featuresCol='variablesin', impurity='gini', maxBins=3) print "done: DecisionTreeClassifier" # pipeline = Pipeline(stages=[variableoutStringIndexer, variablesinAssembler, dTree]) print "done: Pipeline" # paramgrid = ParamGridBuilder().addGrid(dTree.maxDepth, [3, 7, 8]).build() print "done: ParamGridBuilder" # evaluator = MulticlassClassificationEvaluator(labelCol='variableout', predictionCol='prediction', metricName="f1") print "done: MulticlassClassificationEvaluator" # CrossValidator(estimator=pipeline, estimatorParamMaps=paramgrid, evaluator=evaluator, numFolds=10).fit(training_data) print "done: CrossValidatorModel" # return True # -- # # -- # def main(): # numcols = 1234 numrows = 23456 # x = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=1) y = randrdd.normalVectorRDD(hiveContext, numrows, numcols, seed=2) # dfx = HiveContext.createDataFrame(hiveContext, x.map(lambda v: tuple([float(ii) for ii in v])).collect(), ["v{0:0>4}".format(jj) for jj in range(0, numcols)]) # dfy = HiveContext.createDataFrame(hiveContext, y.map(lambda v: tuple([float(ii) for ii in v])).collect(), ["v{0:0>4}".format(jj) for jj in range(0, numcols)]) # dfx = dfx.withColumn("v{0:0>4}".format(numcols), dfx["v{0:0>4}".format(numcols - 1)] + 5).withColumn("V", lit('a')) dfy = dfy.withColumn("v{0:0>4}".format(numcols), dfy["v{0:0>4}".format(numcols - 1)] - 5).withColumn("V", lit('b')) # df = dfx.union(dfy) # df.cache() # df.printSchema() df.show(n=10, truncate=False) # computeDT(df) # df.unpersist() # return True # -- # CONFIGURE SPARK CONTEXT THEN RUN 'MAIN' ANALYSIS # -- # if __name__ == "__main__": # conf = SparkConf().setAppName("TroubleshootPyspark.ASFJira21828.Otis+Kazuaki"). \ set("spark.sql.tungsten.enabled"
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Smart updated SPARK-21828: --- Affects Version/s: (was: 2.2.0) 2.1.0 > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Smart updated SPARK-21828: --- Target Version/s: 2.2.0, 2.1.0 (was: 2.2.0) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.1.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141091#comment-16141091 ] jincheng commented on SPARK-18085: -- it is really a good idea to speed up history server with levelDB. I met a problem with following exception. {code:java} java.lang.IndexOutOfBoundsException: Page 1 is out of range. Please select a page number between 1 and 0. at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:56) at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:108) at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:703) at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:296) at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:285) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:88) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:845) at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689) at org.apache.spark.deploy.history.ApplicationCacheCheckFilter.doFilter(ApplicationCache.scala:438) at org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676) at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581) at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511) at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.spark_project.jetty.server.Server.handle(Server.java:524) at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:319) at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:253) at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:748) {code} it will occur when we clicking stages with no tasks successful. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141081#comment-16141081 ] Saisai Shao commented on SPARK-21829: - Cross post the comment here. Since you're running Spark on YARN, so I think node label should well address your scenario. The changes you made in BlacklistTracker seems break the design purpose of backlist. The blacklist in Spark as well as in MR/TEZ assumes bad nodes/executors will be back to normal in several hours, so it always has a timeout for blacklist. In your case, the problem is not bad nodes/executors, it is that you don't what to start executors on some nodes (like slow nodes). This is more like a cluster manager problem rather than Spark problem. To summarize your problem, you want your Spark application runs on some specific nodes. To solve your problem, for YARN you could use node label and Spark on YARN already support node label. You could google node label to know the details. For standalone, simply you should not start worker on such nodes you don't want. For Mesos I'm not sure, I guess it should also has similar approaches. > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141030#comment-16141030 ] lufei edited comment on SPARK-21822 at 8/25/17 1:26 AM: I'm so sorry for close this issue by mistake,so I reopen it. was (Author: figo): I close this issue by mistake. > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufei reopened SPARK-21822: --- I close this issue by mistake. > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141026#comment-16141026 ] lufei commented on SPARK-21822: --- [~sowen]I'm sorry,I didn't get your meaning before, I have already submitted a pull request(https://github.com/apache/spark/pull/19035). Thank you for your time. > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufei updated SPARK-21822: -- Comment: was deleted (was: [~sowen]Ok,I got it.I will close this issue immediately,thanks.) > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5
[ https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141014#comment-16141014 ] Hyukjin Kwon commented on SPARK-19571: -- Sure, thanks! > tests are failing to run on Windows with another instance Derby error with > Hadoop 2.6.5 > --- > > Key: SPARK-19571 > URL: https://issues.apache.org/jira/browse/SPARK-19571 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > Between > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master > https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db > And > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master > https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a > Something is changed (not likely caused by R) such that tests running on > Windows are consistently failing with > {code} > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database > C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown > Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > {code} > Since we run appveyor only when there is R changes, it is a bit harder to > track down which change specifically caused this. > We also can't run appveyor on branch-2.1, so it could also be broken there. > This could be a blocker, since it could fail tests for the R release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21834) Incorrect executor request in case of dynamic allocation
Sital Kedia created SPARK-21834: --- Summary: Incorrect executor request in case of dynamic allocation Key: SPARK-21834 URL: https://issues.apache.org/jira/browse/SPARK-21834 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Sital Kedia killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5
[ https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19571: - Fix Version/s: 2.2.0 > tests are failing to run on Windows with another instance Derby error with > Hadoop 2.6.5 > --- > > Key: SPARK-19571 > URL: https://issues.apache.org/jira/browse/SPARK-19571 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > Between > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master > https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db > And > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master > https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a > Something is changed (not likely caused by R) such that tests running on > Windows are consistently failing with > {code} > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database > C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown > Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > {code} > Since we run appveyor only when there is R changes, it is a bit harder to > track down which change specifically caused this. > We also can't run appveyor on branch-2.1, so it could also be broken there. > This could be a blocker, since it could fail tests for the R release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufei closed SPARK-21822. - Resolution: Invalid > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140983#comment-16140983 ] lufei commented on SPARK-21822: --- [~sowen]Ok,I got it.I will close this issue immediately,thanks. > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21830) Bump the dependency of ANTLR to version 4.7
[ https://issues.apache.org/jira/browse/SPARK-21830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21830. - Resolution: Fixed Fix Version/s: 2.3.0 > Bump the dependency of ANTLR to version 4.7 > --- > > Key: SPARK-21830 > URL: https://issues.apache.org/jira/browse/SPARK-21830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.3.0 > > > There are a few minor issues with the current ANTLR version. Version 4.7 > fixes most of these. I'd like to bump the version to that one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21830) Bump the dependency of ANTLR to version 4.7
[ https://issues.apache.org/jira/browse/SPARK-21830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140939#comment-16140939 ] Xiao Li commented on SPARK-21830: - https://github.com/apache/spark/pull/19042 > Bump the dependency of ANTLR to version 4.7 > --- > > Key: SPARK-21830 > URL: https://issues.apache.org/jira/browse/SPARK-21830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.3.0 > > > There are a few minor issues with the current ANTLR version. Version 4.7 > fixes most of these. I'd like to bump the version to that one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21798) No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server
[ https://issues.apache.org/jira/browse/SPARK-21798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140833#comment-16140833 ] Parth Gandhi commented on SPARK-21798: -- Filed a pull request for this ticket: https://github.com/apache/spark/pull/19047 > No config to replace deprecated SPARK_CLASSPATH config for launching daemons > like History Server > > > Key: SPARK-21798 > URL: https://issues.apache.org/jira/browse/SPARK-21798 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Sanket Reddy >Priority: Minor > > History Server Launch uses SparkClassCommandBuilder for launching the server. > It is observed that SPARK_CLASSPATH has been removed and deprecated. For > spark-submit this takes a different route and spark.driver.extraClasspath > takes care of specifying additional jars in the classpath that were > previously specified in the SPARK_CLASSPATH. Right now the only way specify > the additional jars for launching daemons such as history server is using > SPARK_DIST_CLASSPATH > (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I > presume is a distribution classpath. It would be nice to have a similar > config like spark.driver.extraClasspath for launching daemons similar to > history server. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client
[ https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-21701: Assignee: Xu Zhang > Add TCP send/rcv buffer size support for RPC client > --- > > Key: SPARK-21701 > URL: https://issues.apache.org/jira/browse/SPARK-21701 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xu Zhang >Assignee: Xu Zhang >Priority: Trivial > Fix For: 2.3.0 > > > For TransportClientFactory class, there are no params derived from SparkConf > to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. > Increasing the receive buffer size can increase the I/O performance for > high-volume transport. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21701) Add TCP send/rcv buffer size support for RPC client
[ https://issues.apache.org/jira/browse/SPARK-21701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-21701. -- Resolution: Fixed Fix Version/s: 2.3.0 > Add TCP send/rcv buffer size support for RPC client > --- > > Key: SPARK-21701 > URL: https://issues.apache.org/jira/browse/SPARK-21701 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Xu Zhang >Priority: Trivial > Fix For: 2.3.0 > > > For TransportClientFactory class, there are no params derived from SparkConf > to set ChannelOption.SO_RCVBUF and ChannelOption.SO_SNDBUF for netty. > Increasing the receive buffer size can increase the I/O performance for > high-volume transport. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18355) Spark SQL fails to read data from a ORC hive table that has a new column added to it
[ https://issues.apache.org/jira/browse/SPARK-18355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140552#comment-16140552 ] Dongjoon Hyun commented on SPARK-18355: --- This will be resolved via Apache ORC 1.4.0. > Spark SQL fails to read data from a ORC hive table that has a new column > added to it > > > Key: SPARK-18355 > URL: https://issues.apache.org/jira/browse/SPARK-18355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.1.0 > Environment: Centos6 >Reporter: Sandeep Nemuri > > *PROBLEM*: > Spark SQL fails to read data from a ORC hive table that has a new column > added to it. > Below is the exception: > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:332) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:281) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > {code} > *STEPS TO SIMULATE THIS ISSUE*: > 1) Create table in hive. > {code} > CREATE TABLE `testorc`( > `click_id` string, > `search_id` string, > `uid` bigint) > PARTITIONED BY ( > `ts` string, > `hour` string) > STORED AS ORC; > {code} > 2) Load data into table : > {code} > INSERT INTO TABLE testorc PARTITION (ts = '98765',hour = '01' ) VALUES > (12,2,12345); > {code} > 3) Select through spark shell (This works) > {code} > sqlContext.sql("select click_id,search_id from testorc").show > {code} > 4) Now add column to hive table > {code} > ALTER TABLE testorc ADD COLUMNS (dummy string); > {code} > 5) Now again select from spark shell > {code} > scala> sqlContext.sql("select click_id,search_id from testorc").show > 16/11/03 22:17:53 INFO ParseDriver: Parsing command: select > click_id,search_id from testorc > 16/11/03 22:17:54 INFO ParseDriver: Parse Completed > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:39) > at > org.apache.spark.sql.execution.datasources.LogicalRelation$$anonfun$1.apply(LogicalRelation.scala:38) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:38) > at > org.apache.spark.sql.execution.datasources.LogicalRelation.copy(LogicalRelation.scala:31) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToOrcRelation(HiveMetastoreCatalog.scala:588) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:647) > at > org.apache.spark.sql.hive.HiveMetastoreCatalog$OrcConversions$$anonfun$apply$2.applyOrElse(HiveMetastoreCatalog.scala:643) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1
[jira] [Closed] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia closed SPARK-21833. --- Resolution: Duplicate Duplicate of SPARK-20540 > CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation > --- > > Key: SPARK-21833 > URL: https://issues.apache.org/jira/browse/SPARK-21833 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > We have seen this issue in coarse grained scheduler that in case of dynamic > executor allocation is turned on, the scheduler asks for more executors than > needed. Consider the following situation where there are excutor allocation > manager is ramping down the number of executors. It will lower the executor > targer number by calling requestTotalExecutors api. > Later, when the allocation manager finds some executors to be idle, it will > call killExecutor api. The coarse grain scheduler, in the killExecutor > function replaces the total executor needed to current + pending which > overrides the earlier target set by the allocation manager. > This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-21833: Description: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api. Later, when the allocation manager finds some executors to be idle, it will call killExecutor api. The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager. This results in scheduler spawning more executors than actually needed. was: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. > CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation > --- > > Key: SPARK-21833 > URL: https://issues.apache.org/jira/browse/SPARK-21833 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > We have seen this issue in coarse grained scheduler that in case of dynamic > executor allocation is turned on, the scheduler asks for more executors than > needed. Consider the following situation where there are excutor allocation > manager is ramping down the number of executors. It will lower the executor > targer number by calling requestTotalExecutors api. > Later, when the allocation manager finds some executors to be idle, it will > call killExecutor api. The coarse grain scheduler, in the killExecutor > function replaces the total executor needed to current + pending which > overrides the earlier target set by the allocation manager. > This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140533#comment-16140533 ] Sital Kedia commented on SPARK-21833: - Actually, SPARK-20540 already addressed this issue on latest trunk. Will close this issue. > CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation > --- > > Key: SPARK-21833 > URL: https://issues.apache.org/jira/browse/SPARK-21833 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > We have seen this issue in coarse grained scheduler that in case of dynamic > executor allocation is turned on, the scheduler asks for more executors than > needed. Consider the following situation where there are excutor allocation > manager is ramping down the number of executors. It will lower the executor > targer number by calling requestTotalExecutors api. > Later, when the allocation manager finds some executors to be idle, it will > call killExecutor api. The coarse grain scheduler, in the killExecutor > function replaces the total executor needed to current + pending which > overrides the earlier target set by the allocation manager. > This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-21833: Description: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. was: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. > CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation > --- > > Key: SPARK-21833 > URL: https://issues.apache.org/jira/browse/SPARK-21833 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > We have seen this issue in coarse grained scheduler that in case of dynamic > executor allocation is turned on, the scheduler asks for more executors than > needed. Consider the following situation where there are excutor allocation > manager is ramping down the number of executors. It will lower the executor > targer number by calling requestTotalExecutors api (see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. > > Later, when the allocation manager finds some executors to be idle, it will > call killExecutor api > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L447). > The coarse grain scheduler, in the killExecutor function replaces the total > executor needed to current + pending which overrides the earlier target set > by the allocation manager > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. > > This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-21833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-21833: Description: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. was: We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L321). Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. > CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation > --- > > Key: SPARK-21833 > URL: https://issues.apache.org/jira/browse/SPARK-21833 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Sital Kedia > > We have seen this issue in coarse grained scheduler that in case of dynamic > executor allocation is turned on, the scheduler asks for more executors than > needed. Consider the following situation where there are excutor allocation > manager is ramping down the number of executors. It will lower the executor > targer number by calling requestTotalExecutors api (see > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L326. > > Later, when the allocation manager finds some executors to be idle, it will > call killExecutor api > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421). > The coarse grain scheduler, in the killExecutor function replaces the total > executor needed to current + pending which overrides the earlier target set > by the allocation manager > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. > > This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21833) CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation
Sital Kedia created SPARK-21833: --- Summary: CoarseGrainedSchedulerBackend leaks executors in case of dynamic allocation Key: SPARK-21833 URL: https://issues.apache.org/jira/browse/SPARK-21833 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Sital Kedia We have seen this issue in coarse grained scheduler that in case of dynamic executor allocation is turned on, the scheduler asks for more executors than needed. Consider the following situation where there are excutor allocation manager is ramping down the number of executors. It will lower the executor targer number by calling requestTotalExecutors api (see https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L321). Later, when the allocation manager finds some executors to be idle, it will call killExecutor api (https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L421). The coarse grain scheduler, in the killExecutor function replaces the total executor needed to current + pending which overrides the earlier target set by the allocation manager https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L523. This results in scheduler spawning more executors than actually needed. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21832) Merge SQLBuilderTest into ExpressionSQLBuilderSuite
Dongjoon Hyun created SPARK-21832: - Summary: Merge SQLBuilderTest into ExpressionSQLBuilderSuite Key: SPARK-21832 URL: https://issues.apache.org/jira/browse/SPARK-21832 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.0 Reporter: Dongjoon Hyun Priority: Minor After SPARK-19025, there is no need to keep SQLBuilderTest. ExpressionSQLBuilderSuite is the only place to use it. This issue aims to remove SQLBuilderTest. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21788) Handle more exceptions when stopping a streaming query
[ https://issues.apache.org/jira/browse/SPARK-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21788: - Fix Version/s: (was: 3.0.0) 2.3.0 > Handle more exceptions when stopping a streaming query > -- > > Key: SPARK-21788 > URL: https://issues.apache.org/jira/browse/SPARK-21788 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21831) Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
[ https://issues.apache.org/jira/browse/SPARK-21831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21831: -- Summary: Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite (was: Remove obsolete `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite) > Remove `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite > > > Key: SPARK-21831 > URL: https://issues.apache.org/jira/browse/SPARK-21831 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > SPARK-19025 removes SQLBuilder, so we need to remove the following in > HiveCompatibilitySuite. > {code} > // Ensures that the plans generation use metastore relation and not > OrcRelation > // Was done because SqlBuilder does not work with plans having logical > relation > TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21831) Remove obsolete `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite
Dongjoon Hyun created SPARK-21831: - Summary: Remove obsolete `spark.sql.hive.convertMetastoreOrc` config in HiveCompatibilitySuite Key: SPARK-21831 URL: https://issues.apache.org/jira/browse/SPARK-21831 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.3.0 Reporter: Dongjoon Hyun Priority: Minor SPARK-19025 removes SQLBuilder, so we need to remove the following in HiveCompatibilitySuite. {code} // Ensures that the plans generation use metastore relation and not OrcRelation // Was done because SqlBuilder does not work with plans having logical relation TestHive.setConf(HiveUtils.CONVERT_METASTORE_ORC, false) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-21797: --- Environment: Amazon EMR > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140408#comment-16140408 ] Steve Loughran edited comment on SPARK-21797 at 8/24/17 5:56 PM: - This is happening deep in the Amazon EMR team's closed source {{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; I'm confident S3A will handle it pretty similarly though, either in the open() call or shortly afterwards, in the first read(). All we could do there is convert to a more meaningful error, or actually check to see if the file is valid at open() time & again, fail meaningfully At the Spark level, it's because Parquet is trying to read the footer of every file in parallel the good news, you can tell Spark to ignore files it can't read. I believe this might be a quick workaround: {code} spark.sql.files.ignoreCorruptFiles=true {code} Let us know what happens was (Author: ste...@apache.org): This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; I'm confident S3A will handle it pretty similarly though, either in the open() call or shortly afterwards, in the first read(). All we could do there is convert to a more meaningful error, or actually check to see if the file is valid at open() time & again, fail meaningfully At the Spark level, it's because Parquet is trying to read the footer of every file in parallel the good news, you can tell Spark to ignore files it can't read. I believe this might be a quick workaround: {code} spark.sql.files.ignoreCorruptFiles=true {code} Let us know what happens > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140408#comment-16140408 ] Steve Loughran commented on SPARK-21797: This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; I'm confident S3A will handle it pretty similarly though, either in the open() call or shortly afterwards, in the first read(). All we could do there is convert to a more meaningful error, or actually check to see if the file is valid at open() time & again, fail meaningfully At the Spark level, it's because Parquet is trying to read the footer of every file in parallel the good news, you can tell Spark to ignore files it can't read. I believe this might be a quick workaround: {code} spark.sql.files.ignoreCorruptFiles=true {code} Let us know what happens > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21788) Handle more exceptions when stopping a streaming query
[ https://issues.apache.org/jira/browse/SPARK-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-21788. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 18997 [https://github.com/apache/spark/pull/18997] > Handle more exceptions when stopping a streaming query > -- > > Key: SPARK-21788 > URL: https://issues.apache.org/jira/browse/SPARK-21788 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21681: -- Labels: correctness (was: ) > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > Labels: correctness > Fix For: 2.2.1, 2.3.0 > > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21681) MLOR do not work correctly when featureStd contains zero
[ https://issues.apache.org/jira/browse/SPARK-21681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21681. --- Resolution: Fixed Fix Version/s: 2.2.1 Issue resolved by pull request 19026 [https://github.com/apache/spark/pull/19026] > MLOR do not work correctly when featureStd contains zero > > > Key: SPARK-21681 > URL: https://issues.apache.org/jira/browse/SPARK-21681 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu > Fix For: 2.2.1, 2.3.0 > > > MLOR do not work correctly when featureStd contains zero. > We can reproduce the bug through such dataset (features including zero > variance), will generate wrong result (all coefficients becomes 0) > {code} > val multinomialDatasetWithZeroVar = { > val nPoints = 100 > val coefficients = Array( > -0.57997, 0.912083, -0.371077, > -0.16624, -0.84355, -0.048509) > val xMean = Array(5.843, 3.0) > val xVariance = Array(0.6856, 0.0) // including zero variance > val testData = generateMultinomialLogisticInput( > coefficients, xMean, xVariance, addIntercept = true, nPoints, seed) > val df = sc.parallelize(testData, 4).toDF().withColumn("weight", > lit(1.0)) > df.cache() > df > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140293#comment-16140293 ] Marcelo Vanzin commented on SPARK-16742: Both renewal and creating new tickets after the TTL (those are different things). > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt >Assignee: Arthur Rand > Fix For: 2.3.0 > > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140282#comment-16140282 ] Kazuaki Ishizaki commented on SPARK-21828: -- Thank you for reporting a problem. First, IIUC, this PR (https://github.com/apache/spark/pull/15480) has been included in the latest release. Thus, the test case "SPARK-16845..." in {{OrderingSuite.scala}} does not fail. Could you please put a program that can reproduce this issue? Then, I will investigate this. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.2.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140261#comment-16140261 ] Boris Clémençon commented on SPARK-21797: -- Steve, This is the stacks: {noformat} WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, ip-172-31-42-242.eu-west-1.compute.internal, executor 1): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 5DD5BEBB8173977D), S3 Extended Request ID: K9bDwhm32CFHeg5zgVfW/T1A/vB4e8gqQ/p7E0Ze9ZG55UFoDP7hgnkQxLIwYX9i2LEcKwrR+lo= at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.handleAmazonServiceException(Jets3tNativeFileSystemStore.java:434) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:461) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:439) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy30.retrievePair(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1201) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166) at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:491) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485) at scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132) at scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62) at scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443) at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341) at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673) at scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443) at scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426) at scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56) at scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152) at scala.collection.parallel.AdaptiveWorkStealin
[jira] [Created] (SPARK-21830) Bump the dependency of ANTLR to version 4.7
Herman van Hovell created SPARK-21830: - Summary: Bump the dependency of ANTLR to version 4.7 Key: SPARK-21830 URL: https://issues.apache.org/jira/browse/SPARK-21830 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Herman van Hovell Assignee: Herman van Hovell There are a few minor issues with the current ANTLR version. Version 4.7 fixes most of these. I'd like to bump the version to that one. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140183#comment-16140183 ] zakaria hili commented on SPARK-21799: -- [~WeichenXu123], df.rdd.getStorageLevel return none even if df is cached. The solution is to check cache parameter on df. When i create PR SPARK-18356 , we didn't have df.storageLevel. But now, there are some people working on this solution. > KMeans performance regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` > returns the correct result after calling `df.cache()`, so I'd suggest > replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in > MLlib algorithms (the same pattern shows up in LogisticRegression, > LinearRegression, and others). I've verified this behavior in [this > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21829) Enable config to permanently blacklist a list of nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Canali updated SPARK-21829: Summary: Enable config to permanently blacklist a list of nodes (was: Prevent running executors/tasks on a user-specified list of cluster nodes) > Enable config to permanently blacklist a list of nodes > -- > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140162#comment-16140162 ] Weichen Xu commented on SPARK-21799: I suggest check both `df.storageLevel` and `df.rdd.getStorageLevel` for input data, only when none of them is cached, do cache before `train`. > KMeans performance regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` > returns the correct result after calling `df.cache()`, so I'd suggest > replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in > MLlib algorithms (the same pattern shows up in LogisticRegression, > LinearRegression, and others). I've verified this behavior in [this > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21817) Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex
[ https://issues.apache.org/jira/browse/SPARK-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140161#comment-16140161 ] Steve Loughran commented on SPARK-21817: FYI, this is now fixed in hadoop trunk/3.0-beta-1 > Pass FSPermissions to LocatedFileStatus from InMemoryFileIndex > -- > > Key: SPARK-21817 > URL: https://issues.apache.org/jira/browse/SPARK-21817 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ewan Higgs >Priority: Minor > Attachments: SPARK-21817.001.patch > > > The implementation of HDFS-6984 now uses the passed in {{FSPermission}} to > pull out the ACL and other information. Therefore passing in a {{null}} is no > longer adequate and hence causes a NPE when listing files. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140147#comment-16140147 ] Steve Loughran commented on SPARK-21797: Note that if it is just during spark partition calculation, it's probable that it is going down the directory tree and inspecting the files through HEAD requests, maybe looking at metadata entries too. So do attach the s3a & spark trace so we can see what's going on, as something may be over enthusastic about looking at files, or we could have something recognise the problem and recover from it. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140138#comment-16140138 ] Steve Loughran commented on SPARK-21797: I was talking about the cost and time of getting data from Glacier. If that's the only place where data lives, then its slow and expensive. And that's the bit I'm describing as niche. Given I've been working full time on S3A, I'm reasonably confident it gets used a lot. If you talk to data in S3 that has been backed up to glacier, you *wlll get a 403*: According to Jeff Barr himself: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ bq. If you archive objects using the Glacier storage option, you must inspect the storage class of an object before you attempt to retrieve it. The customary GET request will work as expected if the object is stored in S3 Standard or Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object is archived in Glacier. In this case, you must use the RESTORE operation (described below) to make your data available in S3. bq. You use S3’s new RESTORE operation to access an object archived in Glacier. As part of the request, you need to specify a retention period in days. Restoring an object will generally take 3 to 5 hours. Your restored object will remain in both Glacier and S3’s Reduced Redundancy Storage (RRS) for the duration of the retention period. At the end of the retention period the object’s data will be removed from S3; the object will remain in Glacier. Like I said, I'd be interested in getting the full stack trace if you try to read this with an S3A client. Not for fixing, but for better reporting. Probably point them at Jeff's blog entry. Or this JIRA :) > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21826) outer broadcast hash join should not throw NPE
[ https://issues.apache.org/jira/browse/SPARK-21826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-21826. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 Fixed per https://github.com/apache/spark/pull/19036 > outer broadcast hash join should not throw NPE > -- > > Key: SPARK-21826 > URL: https://issues.apache.org/jira/browse/SPARK-21826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21826) outer broadcast hash join should not throw NPE
[ https://issues.apache.org/jira/browse/SPARK-21826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell reassigned SPARK-21826: - Assignee: Wenchen Fan > outer broadcast hash join should not throw NPE > -- > > Key: SPARK-21826 > URL: https://issues.apache.org/jira/browse/SPARK-21826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140111#comment-16140111 ] Arthur Rand commented on SPARK-16742: - Hello [~vanzin], I'm assuming you're talking about automatic ticket renewal, correct? I was just starting to look into that w.r.t. Mesos, I'll create a ticket. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt >Assignee: Arthur Rand > Fix For: 2.3.0 > > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094 ] Li Jin edited comment on SPARK-21190 at 8/24/17 2:33 PM: - [~ueshin], Got it. I'd actually prefer doing it this way: {code} @pandas_udf(LongType()) def f0(v): return pd.Series(1).repeat(len(v)) df.select(f0(F.lit(0))) {code} Instead of passing the size as a scalar to the function. Passing size to f0 feels unintuitive to me because f0 is defined as a function that takes one argument - def f0(size), but being invoked with 0 args - f0(). was (Author: icexelloss): [~ueshin], Got it. I'd actually prefer doing it this way: {code} @pandas_udf(LongType()) def f0(v): return pd.Series(1).repeat(len(v)) df.select(f0(F.lit(0))) {code} Instead of passing the size as a scalar to the function. Passing size to f0 feels unintuitive to me because f0 is defined as a function that takes one argument - size, but being invoked with 0 args - f0(). > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > A few things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > 3. How do we handle null values, since Pandas doesn't have the concept of > nulls? > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a,
[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094 ] Li Jin edited comment on SPARK-21190 at 8/24/17 2:33 PM: - [~ueshin], Got it. I'd actually prefer doing it this way: {code} @pandas_udf(LongType()) def f0(v): return pd.Series(1).repeat(len(v)) df.select(f0(F.lit(0))) {code} Instead of passing the size as a scalar to the function. Passing size to f0 feels unintuitive to me because f0 is defined as a function that takes one argument - size, but being invoked with 0 args - f0(). was (Author: icexelloss): [~ueshin], Got it. I'd actually prefer doing it this way: {code} @pandas_udf(LongType()) def f0(v): return pd.Series(1).repeat(len(v)) df.select(f0(F.lit(0))) {code} Instead of passing the size as a scalar to the function. Passing size to f0 feels intuitive to me because f0 is defined as a function that takes one argument - size, but being invoked with 0 args - f0(). > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > A few things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > 3. How do we handle null values, since Pandas doesn't have the concept of > nulls? > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140094#comment-16140094 ] Li Jin commented on SPARK-21190: [~ueshin], Got it. I'd actually prefer doing it this way: {code} @pandas_udf(LongType()) def f0(v): return pd.Series(1).repeat(len(v)) df.select(f0(F.lit(0))) {code} Instead of passing the size as a scalar to the function. Passing size to f0 feels intuitive to me because f0 is defined as a function that takes one argument - size, but being invoked with 0 args - f0(). > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > A few things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > 3. How do we handle null values, since Pandas doesn't have the concept of > nulls? > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch* > I’m more concerned about getting proper feedback for API design. The > implementation should be pretty straightforward and is not a huge concern at > this point. We can leverage the same implementation for faster toPandas > (using Arrow). > > > *Optional Rejected Designs* > See above. > > > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) ---
[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140065#comment-16140065 ] Luca Canali commented on SPARK-21829: - https://github.com/apache/spark/pull/19039 > Prevent running executors/tasks on a user-specified list of cluster nodes > - > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21527) Use buffer limit in order to take advantage of JAVA NIO Util's buffercache
[ https://issues.apache.org/jira/browse/SPARK-21527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang reopened SPARK-21527: -- > Use buffer limit in order to take advantage of JAVA NIO Util's buffercache > --- > > Key: SPARK-21527 > URL: https://issues.apache.org/jira/browse/SPARK-21527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: zhoukang > > Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe > code in java nio Util below: > {code:java} > public static ByteBuffer More ...getTemporaryDirectBuffer(int size) { > BufferCache cache = bufferCache.get(); > ByteBuffer buf = cache.get(size); > if (buf != null) { > return buf; > } else { > // No suitable buffer in the cache so we need to allocate a new > // one. To avoid the cache growing then we remove the first > // buffer from the cache and free it. > if (!cache.isEmpty()) { > buf = cache.removeFirst(); > free(buf); > } > return ByteBuffer.allocateDirect(size); > } > } > {code} > If we slice first with a fixed size, we can use buffer cache and only need to > allocate at the first write call. > Since we allocate new buffer, we can not control the free time of this > buffer.This once cause memory issue in our production cluster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21527) Use buffer limit in order to take advantage of JAVA NIO Util's buffercache
[ https://issues.apache.org/jira/browse/SPARK-21527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140062#comment-16140062 ] zhoukang commented on SPARK-21527: -- https://github.com/apache/spark/pull/18730 > Use buffer limit in order to take advantage of JAVA NIO Util's buffercache > --- > > Key: SPARK-21527 > URL: https://issues.apache.org/jira/browse/SPARK-21527 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: zhoukang > > Right now, ChunkedByteBuffer#writeFully do not slice bytes first.We observe > code in java nio Util below: > {code:java} > public static ByteBuffer More ...getTemporaryDirectBuffer(int size) { > BufferCache cache = bufferCache.get(); > ByteBuffer buf = cache.get(size); > if (buf != null) { > return buf; > } else { > // No suitable buffer in the cache so we need to allocate a new > // one. To avoid the cache growing then we remove the first > // buffer from the cache and free it. > if (!cache.isEmpty()) { > buf = cache.removeFirst(); > free(buf); > } > return ByteBuffer.allocateDirect(size); > } > } > {code} > If we slice first with a fixed size, we can use buffer cache and only need to > allocate at the first write call. > Since we allocate new buffer, we can not control the free time of this > buffer.This once cause memory issue in our production cluster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
[ https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21759: --- Assignee: Liang-Chi Hsieh > In.checkInputDataTypes should not wrongly report unresolved plans for IN > correlated subquery > > > Key: SPARK-21759 > URL: https://issues.apache.org/jira/browse/SPARK-21759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > With the check for structural integrity proposed in SPARK-21726, I found that > an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved > plans. > For a correlated IN query like: > {code} > Project [a#0] > +- Filter a#0 IN (list#4 [b#1]) >: +- Project [c#2] >: +- Filter (outer(b#1) < d#3) >:+- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > After {{PullupCorrelatedPredicates}}, it produces query plan like: > {code} > 'Project [a#0] > +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > Because the correlated predicate involves another attribute {{d#3}} in > subquery, it has been pulled out and added into the {{Project}} on the top of > the subquery. > When {{list}} in {{In}} contains just one {{ListQuery}}, > {{In.checkInputDataTypes}} checks if the size of {{value}} expressions > matches the output size of subquery. In the above example, there is only > {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, > so it fails the check and {{In.resolved}} returns {{false}}. > We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans > to fail the structural integrity check. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery
[ https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21759. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18968 [https://github.com/apache/spark/pull/18968] > In.checkInputDataTypes should not wrongly report unresolved plans for IN > correlated subquery > > > Key: SPARK-21759 > URL: https://issues.apache.org/jira/browse/SPARK-21759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > Fix For: 2.3.0 > > > With the check for structural integrity proposed in SPARK-21726, I found that > an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved > plans. > For a correlated IN query like: > {code} > Project [a#0] > +- Filter a#0 IN (list#4 [b#1]) >: +- Project [c#2] >: +- Filter (outer(b#1) < d#3) >:+- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > After {{PullupCorrelatedPredicates}}, it produces query plan like: > {code} > 'Project [a#0] > +- 'Filter a#0 IN (list#4 [(b#1 < d#3)]) >: +- Project [c#2, d#3] >: +- LocalRelation , [c#2, d#3] >+- LocalRelation , [a#0, b#1] > {code} > Because the correlated predicate involves another attribute {{d#3}} in > subquery, it has been pulled out and added into the {{Project}} on the top of > the subquery. > When {{list}} in {{In}} contains just one {{ListQuery}}, > {{In.checkInputDataTypes}} checks if the size of {{value}} expressions > matches the output size of subquery. In the above example, there is only > {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, > so it fails the check and {{In.resolved}} returns {{false}}. > We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans > to fail the structural integrity check. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140041#comment-16140041 ] Luca Canali commented on SPARK-21829: - I think what I am addressing may be a rare case (but it actually happened to us): that is we are on a shared YARN cluster and we have a workload that runs slow on a couple of nodes, however the nodes are fine to run other types of jobs, so we want to have them in the cluster. The actual problem comes from reading from an external file system, and apparently only for this specific workload (which is only one of many workloads run on this cluster). What I have done as a workaround to make the job run faster is just killing the executors on the 2 "slow nodes" and the job could finish faster as it avoided the painfully slow long tail of execution on the affected nodes. The proposed patch/feature is an attempt to address this case in a more soft way than just going on the nodes and killing executors. > Prevent running executors/tasks on a user-specified list of cluster nodes > - > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes
[ https://issues.apache.org/jira/browse/SPARK-21829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140030#comment-16140030 ] Sean Owen commented on SPARK-21829: --- Why not just take them out of the resource manager entirely? This sounds like a difficult way to try to reimplement that in Spark. > Prevent running executors/tasks on a user-specified list of cluster nodes > - > > Key: SPARK-21829 > URL: https://issues.apache.org/jira/browse/SPARK-21829 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.1.1, 2.2.0 >Reporter: Luca Canali >Priority: Minor > > The idea for this proposal comes from a performance incident in a local > cluster where a job was found very slow because of a log tail of stragglers > due to 2 nodes in the cluster being slow to access a remote filesystem. > The issue was limited to the 2 machines and was related to external > configurations: the 2 machines that performed badly when accessing the remote > file system were behaving normally for other jobs in the cluster (a shared > YARN cluster). > With this new feature I propose to introduce a mechanism to allow users to > specify a list of nodes in the cluster where executors/tasks should not run > for a specific job. > The proposed implementation that I tested (see PR) uses the Spark blacklist > mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list > of user-specified nodes is added to the blacklist at the start of the Spark > Context and it is never expired. > I have tested this on a YARN cluster on a case taken from the original > production problem and I confirm a performance improvement of about 5x for > the specific test case I have. I imagine that there can be other cases where > Spark users may want to blacklist a set of nodes. This can be used for > troubleshooting, including cases where certain nodes/executors are slow for a > given workload and this is caused by external agents, so the anomaly is not > picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21829) Prevent running executors/tasks on a user-specified list of cluster nodes
Luca Canali created SPARK-21829: --- Summary: Prevent running executors/tasks on a user-specified list of cluster nodes Key: SPARK-21829 URL: https://issues.apache.org/jira/browse/SPARK-21829 Project: Spark Issue Type: New Feature Components: Scheduler, Spark Core Affects Versions: 2.2.0, 2.1.1 Reporter: Luca Canali Priority: Minor The idea for this proposal comes from a performance incident in a local cluster where a job was found very slow because of a log tail of stragglers due to 2 nodes in the cluster being slow to access a remote filesystem. The issue was limited to the 2 machines and was related to external configurations: the 2 machines that performed badly when accessing the remote file system were behaving normally for other jobs in the cluster (a shared YARN cluster). With this new feature I propose to introduce a mechanism to allow users to specify a list of nodes in the cluster where executors/tasks should not run for a specific job. The proposed implementation that I tested (see PR) uses the Spark blacklist mechanism. With the parameter spark.blacklist.alwaysBlacklistedNodes, a list of user-specified nodes is added to the blacklist at the start of the Spark Context and it is never expired. I have tested this on a YARN cluster on a case taken from the original production problem and I confirm a performance improvement of about 5x for the specific test case I have. I imagine that there can be other cases where Spark users may want to blacklist a set of nodes. This can be used for troubleshooting, including cases where certain nodes/executors are slow for a given workload and this is caused by external agents, so the anomaly is not picked up by the cluster manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
[ https://issues.apache.org/jira/browse/SPARK-21745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21745. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18958 [https://github.com/apache/spark/pull/18958] > Refactor ColumnVector hierarchy to make ColumnVector read-only and to > introduce WritableColumnVector. > - > > Key: SPARK-21745 > URL: https://issues.apache.org/jira/browse/SPARK-21745 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin > Fix For: 2.3.0 > > > This is a refactoring of {{ColumnVector}} hierarchy and related classes. > # make {{ColumnVector}} read-only > # introduce {{WritableColumnVector}} with write interface > # remove {{ReadOnlyColumnVector}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21822: -- Priority: Minor (was: Major) [~figo] although the exact meanings of priorities are a little debatable, please don't reverse a change a committer has made. You haven't identified a realistic problem that this even creates, so can't see that it's significant. You also have not proposed a pull request. > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei >Priority: Minor > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21745) Refactor ColumnVector hierarchy to make ColumnVector read-only and to introduce WritableColumnVector.
[ https://issues.apache.org/jira/browse/SPARK-21745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21745: --- Assignee: Takuya Ueshin > Refactor ColumnVector hierarchy to make ColumnVector read-only and to > introduce WritableColumnVector. > - > > Key: SPARK-21745 > URL: https://issues.apache.org/jira/browse/SPARK-21745 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.3.0 > > > This is a refactoring of {{ColumnVector}} hierarchy and related classes. > # make {{ColumnVector}} read-only > # introduce {{WritableColumnVector}} with write interface > # remove {{ReadOnlyColumnVector}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Smart updated SPARK-21828: --- Shepherd: Kazuaki Ishizaki (was: Liwei Lin) > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.2.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
[ https://issues.apache.org/jira/browse/SPARK-21828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Smart updated SPARK-21828: --- Component/s: SQL > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB...again > - > > Key: SPARK-21828 > URL: https://issues.apache.org/jira/browse/SPARK-21828 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.2.0 >Reporter: Otis Smart >Priority: Critical > > Hello! > 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., > dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of > CrossValidator() that includes Pipeline() that includes StringIndexer(), > VectorAssembler() and DecisionTreeClassifier()). > 2. Was the aforementioned patch (aka > fix(https://github.com/apache/spark/pull/15480) not included in the latest > release; what are the reason and (source) of and solution to this persistent > issue please? > py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 > in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage > 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > /* 001 */ public SpecificOrdering generate(Object[] references) > { /* 002 */ return new SpecificOrdering(references); /* 003 */ } > /* 004 */ > /* 005 */ class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ > /* 009 */ > /* 010 */ public SpecificOrdering(Object[] references) > { /* 011 */ this.references = references; /* 012 */ /* 013 */ } > /* 014 */ > /* 015 */ > /* 016 */ > /* 017 */ public int compare(InternalRow a, InternalRow b) { > /* 018 */ InternalRow i = null; // Holds current row being evaluated. > /* 019 */ > /* 020 */ i = a; > /* 021 */ boolean isNullA; > /* 022 */ double primitiveA; > /* 023 */ > { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = > false; /* 027 */ primitiveA = value; /* 028 */ } > /* 029 */ i = b; > /* 030 */ boolean isNullB; > /* 031 */ double primitiveB; > /* 032 */ > { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = > false; /* 036 */ primitiveB = value; /* 037 */ } > /* 038 */ if (isNullA && isNullB) > { /* 039 */ // Nothing /* 040 */ } > else if (isNullA) > { /* 041 */ return -1; /* 042 */ } > else if (isNullB) > { /* 043 */ return 1; /* 044 */ } > else { > /* 045 */ int comp = > org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); > /* 046 */ if (comp != 0) > { /* 047 */ return comp; /* 048 */ } > /* 049 */ } > /* 050 */ > /* 051 */ > ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21828) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again
Otis Smart created SPARK-21828: -- Summary: org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB...again Key: SPARK-21828 URL: https://issues.apache.org/jira/browse/SPARK-21828 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Otis Smart Priority: Critical Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()). 2. Was the aforementioned patch (aka fix(https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ } /* 038 */ if (isNullA && isNullB) { /* 039 */ // Nothing /* 040 */ } else if (isNullA) { /* 041 */ return -1; /* 042 */ } else if (isNullB) { /* 043 */ return 1; /* 044 */ } else { /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); /* 046 */ if (comp != 0) { /* 047 */ return comp; /* 048 */ } /* 049 */ } /* 050 */ /* 051 */ ... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21822) When insert Hive Table is finished, it is better to clean out the tmpLocation dir
[ https://issues.apache.org/jira/browse/SPARK-21822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufei updated SPARK-21822: -- Priority: Major (was: Minor) > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir > - > > Key: SPARK-21822 > URL: https://issues.apache.org/jira/browse/SPARK-21822 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: lufei > > When insert Hive Table is finished, it is better to clean out the tmpLocation > dir(the temp directories like > ".hive-staging_hive_2017-08-19_10-56-01_540_5448395226195533570-9/-ext-1" > or "/tmp/hive/..." for an old spark version). > Otherwise, when lots of spark job are executed, millions of temporary > directories are left in HDFS. And these temporary directories can only be > deleted by the maintainer through the shell script. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139986#comment-16139986 ] Sean Owen commented on SPARK-21797: --- Sure, but in all events, this is an operation that is fine with Spark, but not fine with something between the AWS SDK and AWS. It's not something Spark can fix. If source data is in S3, there's no way to avoid copying it from S3. Intermediate data produced by Spark can't live on S3 as it's too eventually consistent. Some final result could. And yeah you pay to read/write S3 so in some use cases might be more economical to keep intensely read/written data close to the compute workers for a time, rather than write/read to S3 between several closely related jobs. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978 ] Boris Clémençon edited comment on SPARK-21797 at 8/24/17 12:37 PM: Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do we understand each other? Besides, why do you say that it a niche use case? Reading partitioned data on S3 seems quite normal to me. and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. was (Author: clemencb): Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do we understand each other? Besides, why do you say that it a niche use case? and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969 ] Otis Smart edited comment on SPARK-16845 at 8/24/17 12:37 PM: -- Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()). 2. Was the aforementioned patch (aka fix) (https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ } /* 038 */ if (isNullA && isNullB) { /* 039 */ // Nothing /* 040 */ } else if (isNullA) { /* 041 */ return -1; /* 042 */ } else if (isNullB) { /* 043 */ return 1; /* 044 */ } else { /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); /* 046 */ if (comp != 0) { /* 047 */ return comp; /* 048 */ } /* 049 */ } /* 050 */ /* 051 */ was (Author: otissmart): Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()). 2. Was the aforementioned patch (aka fix(https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b;
[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978 ] Boris Clémençon edited comment on SPARK-21797 at 8/24/17 12:37 PM: Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do we understand each other? Besides, why do you say that it a niche use case? Reading partitioned data on S3 seems quite normal to me. And why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. was (Author: clemencb): Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do we understand each other? Besides, why do you say that it a niche use case? Reading partitioned data on S3 seems quite normal to me. and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978 ] Boris Clémençon edited comment on SPARK-21797 at 8/24/17 12:36 PM: Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do we understand each other? Besides, why do you say that it a niche use case? and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. was (Author: clemencb): Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do you understand each other? Besides, why do you say that it a niche use case? and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139978#comment-16139978 ] Boris Clémençon commented on SPARK-21797: -- Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do you understand each other? Besides, why do you say that it a niche use case? and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969 ] Otis Smart edited comment on SPARK-16845 at 8/24/17 12:31 PM: -- Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2 (e.g., dataframe with ~5 rows x 1100+ columns as input to ".fit()" method of CrossValidator() that includes Pipeline() that includes StringIndexer(), VectorAssembler() and DecisionTreeClassifier()). 2. Was the aforementioned patch (aka fix(https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ } /* 038 */ if (isNullA && isNullB) { /* 039 */ // Nothing /* 040 */ } else if (isNullA) { /* 041 */ return -1; /* 042 */ } else if (isNullB) { /* 043 */ return 1; /* 044 */ } else { /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); /* 046 */ if (comp != 0) { /* 047 */ return comp; /* 048 */ } /* 049 */ } /* 050 */ /* 051 */ was (Author: otissmart): Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2. 2. Was the aforementioned patch (aka fix(https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value;
[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139969#comment-16139969 ] Otis Smart commented on SPARK-16845: Hello! 1. I encounter a similar issue (see below text) on Pyspark 2.2. 2. Was the aforementioned patch (aka fix(https://github.com/apache/spark/pull/15480) not included in the latest release; what are the reason and (source) of and solution to this persistent issue please? py4j.protocol.Py4JJavaError: An error occurred while calling o9396.fit. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 38 in stage 18.0 failed 4 times, most recent failure: Lost task 38.3 in stage 18.0 (TID 1996, ip-10-0-14-83.ec2.internal, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB /* 001 */ public SpecificOrdering generate(Object[] references) { /* 002 */ return new SpecificOrdering(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificOrdering extends org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { /* 006 */ /* 007 */ private Object[] references; /* 008 */ /* 009 */ /* 010 */ public SpecificOrdering(Object[] references) { /* 011 */ this.references = references; /* 012 */ /* 013 */ } /* 014 */ /* 015 */ /* 016 */ /* 017 */ public int compare(InternalRow a, InternalRow b) { /* 018 */ InternalRow i = null; // Holds current row being evaluated. /* 019 */ /* 020 */ i = a; /* 021 */ boolean isNullA; /* 022 */ double primitiveA; /* 023 */ { /* 024 */ /* 025 */ double value = i.getDouble(0); /* 026 */ isNullA = false; /* 027 */ primitiveA = value; /* 028 */ } /* 029 */ i = b; /* 030 */ boolean isNullB; /* 031 */ double primitiveB; /* 032 */ { /* 033 */ /* 034 */ double value = i.getDouble(0); /* 035 */ isNullB = false; /* 036 */ primitiveB = value; /* 037 */ } /* 038 */ if (isNullA && isNullB) { /* 039 */ // Nothing /* 040 */ } else if (isNullA) { /* 041 */ return -1; /* 042 */ } else if (isNullB) { /* 043 */ return 1; /* 044 */ } else { /* 045 */ int comp = org.apache.spark.util.Utils.nanSafeCompareDoubles(primitiveA, primitiveB); /* 046 */ if (comp != 0) { /* 047 */ return comp; /* 048 */ } /* 049 */ } /* 050 */ /* 051 */ > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > - > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: hejie >Assignee: Liwei Lin > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21825) improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in ExchangeCoordinatorSuite
[ https://issues.apache.org/jira/browse/SPARK-21825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139957#comment-16139957 ] Sean Owen commented on SPARK-21825: --- I don't think this is even something we should merge. > improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in > ExchangeCoordinatorSuite > - > > Key: SPARK-21825 > URL: https://issues.apache.org/jira/browse/SPARK-21825 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: iamhumanbeing >Priority: Trivial > > ExchangeCoordinatorSuite.scala > Line 424: assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === > Set(2, 3)) is not so precisely. > change to > assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq === Seq(2, 2, > 2, 3)) > Line 476: > assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === Set(5, 3)) > is not so precisely. > change to > assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq > === Seq(5, 3, 5)) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21825) improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in ExchangeCoordinatorSuite
[ https://issues.apache.org/jira/browse/SPARK-21825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139953#comment-16139953 ] Nikhil Bhide commented on SPARK-21825: -- Please assign this issue to me. > improving "assert(exchanges.map(_.outputPartitioning.numPartitions)" in > ExchangeCoordinatorSuite > - > > Key: SPARK-21825 > URL: https://issues.apache.org/jira/browse/SPARK-21825 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.3.0 >Reporter: iamhumanbeing >Priority: Trivial > > ExchangeCoordinatorSuite.scala > Line 424: assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === > Set(2, 3)) is not so precisely. > change to > assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq === Seq(2, 2, > 2, 3)) > Line 476: > assert(exchanges.map(_.outputPartitioning.numPartitions).toSet === Set(5, 3)) > is not so precisely. > change to > assert(exchanges.map(_.outputPartitioning.numPartitions).toSeq > === Seq(5, 3, 5)) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19123) KeyProviderException when reading Azure Blobs from Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139950#comment-16139950 ] Davis commented on SPARK-19123: --- Please add the following config entry to the SparkSessionBuilder to skip the key decryption "fs.azure.account.keyprovider..blob.core.windows.net" with value "org.apache.hadoop.fs.azure.SimpleKeyProvider" > KeyProviderException when reading Azure Blobs from Apache Spark > --- > > Key: SPARK-19123 > URL: https://issues.apache.org/jira/browse/SPARK-19123 > Project: Spark > Issue Type: Question > Components: Input/Output, Java API >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0 running on Azure HDInsight cluster > version 3.5 with Hadoop version 2.7.3 >Reporter: Saulo Ricci >Priority: Minor > > I created a Spark job and it's intended to read a set of json files from a > Azure Blob container. I set the key and reference to my storage and I'm > reading the files as showed in the snippet bellow: > {code:java} > SparkSession > sparkSession = > SparkSession.builder().appName("Pipeline") > .master("yarn") > .config("fs.azure", > "org.apache.hadoop.fs.azure.NativeAzureFileSystem") > > .config("fs.azure.account.key..blob.core.windows.net","") > .getOrCreate(); > Dataset txs = sparkSession.read().json("wasb://path_to_files"); > {code} > The point is that I'm unfortunately getting a > `org.apache.hadoop.fs.azure.KeyProviderException` when reading the blobs from > the azure storage. According to the trace showed bellow it seems the header > too long but still trying to figure out what exactly that means: > {code:java} > 17/01/07 19:28:39 ERROR ApplicationMaster: User class threw exception: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > org.apache.hadoop.fs.azure.AzureException: > org.apache.hadoop.fs.azure.KeyProviderException: ExitCodeException > exitCode=2: Error reading S/MIME message > 140473279682200:error:0D07207B:asn1 encoding > routines:ASN1_get_object:header too long:asn1_lib.c:157: > 140473279682200:error:0D0D106E:asn1 encoding > routines:B64_READ_ASN1:decode error:asn_mime.c:192: > 140473279682200:error:0D0D40CB:asn1 encoding > routines:SMIME_read_ASN1:asn1 parse error:asn_mime.c:517: > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:953) > at > org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2761) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2795) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2777) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:366) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:364) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:364) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:294) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:249) > at > taka.pipelines.AnomalyTrainingPipeline.main(AnomalyTrainingPipeline.java:35) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(Native
[jira] [Assigned] (SPARK-19159) PySpark UDF API improvements
[ https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-19159: Assignee: Maciej Szymkiewicz > PySpark UDF API improvements > > > Key: SPARK-19159 > URL: https://issues.apache.org/jira/browse/SPARK-19159 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 2.3.0 > > > At this moment PySpark UDF API is a bit rough around the edges. This is an an > umbrella ticket for possible API improvements. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139919#comment-16139919 ] Steve Loughran commented on SPARK-21797: If you are using S3// URLs then its the AWS team's problem. If you were using s3a://, then it'd be something you ask the hadoop team to look at, but we'd say no as * it's a niche use case * It's really slow, as in "read() takes so long other bits of the system will start to think your worker is hanging". Which means if you have speculative execution turned on, they kick off other workers to read the data. * t's a very, very expensive way to work with data; $0.03/GB, which ramps up fast once multiple spark workers start reading the same datasets in parallel. * Finally, it's been rejected on the server with a 403 response. That's Amazon S3 saying "no", not any of the clients. You shouldn't be trying to process data direct from S3. Copy to S3 or a transient HDFS cluster, maybe as part of an oozie or airflow workflow. Be curious about the fulll stack trace you see if you do try this with s3a://, even though it'll still be a WONTFIX. We could at least go for a more meaningful exception translation, and the retry logic needs to know that it won't go away if you try again > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18777) Return UDF objects when registering from Python
[ https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-18777: Assignee: Maciej Szymkiewicz > Return UDF objects when registering from Python > --- > > Key: SPARK-18777 > URL: https://issues.apache.org/jira/browse/SPARK-18777 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: holdenk >Assignee: Maciej Szymkiewicz > > In Scala when registering a UDF it gives you back a UDF object that you can > use in the Dataset/DataFrame API as well as with SQL expressions. We can do > the same in Python, for both Python UDFs and Java UDFs registered from Python. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18777) Return UDF objects when registering from Python
[ https://issues.apache.org/jira/browse/SPARK-18777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-18777. -- Resolution: Fixed Fixed in https://github.com/apache/spark/pull/17831 > Return UDF objects when registering from Python > --- > > Key: SPARK-18777 > URL: https://issues.apache.org/jira/browse/SPARK-18777 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: holdenk > > In Scala when registering a UDF it gives you back a UDF object that you can > use in the Dataset/DataFrame API as well as with SQL expressions. We can do > the same in Python, for both Python UDFs and Java UDFs registered from Python. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19159) PySpark UDF API improvements
[ https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19159: - Fix Version/s: 2.3.0 > PySpark UDF API improvements > > > Key: SPARK-19159 > URL: https://issues.apache.org/jira/browse/SPARK-19159 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > Fix For: 2.3.0 > > > At this moment PySpark UDF API is a bit rough around the edges. This is an an > umbrella ticket for possible API improvements. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19159) PySpark UDF API improvements
[ https://issues.apache.org/jira/browse/SPARK-19159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19159. -- Resolution: Done > PySpark UDF API improvements > > > Key: SPARK-19159 > URL: https://issues.apache.org/jira/browse/SPARK-19159 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz > > At this moment PySpark UDF API is a bit rough around the edges. This is an an > umbrella ticket for possible API improvements. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19165) UserDefinedFunction should verify call arguments and provide readable exception in case of mismatch
[ https://issues.apache.org/jira/browse/SPARK-19165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-19165: Assignee: Hyukjin Kwon > UserDefinedFunction should verify call arguments and provide readable > exception in case of mismatch > --- > > Key: SPARK-19165 > URL: https://issues.apache.org/jira/browse/SPARK-19165 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > Invalid arguments to UDF call fail with a bit cryptic Py4J errors: > {code} > In [5]: g = udf(lambda x: x) > In [6]: df.select(f([])) > --- > Py4JError Traceback (most recent call last) > in () > > 1 df.select(f([])) > > Py4JError: An error occurred while calling > z:org.apache.spark.sql.functions.col. Trace: > py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > It is pretty easy to perform basic input validation: > {code} > In [8]: f = udf(lambda x: x) > In [9]: f(1) > --- > TypeError Traceback (most recent call last) > ... > TypeError: All arguments should be Columns or strings representing column > names. Got 1 of type > {code} > This can be further extended to check for expected number of arguments or > even, with some type of annotations, SQL types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19165) UserDefinedFunction should verify call arguments and provide readable exception in case of mismatch
[ https://issues.apache.org/jira/browse/SPARK-19165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19165. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19027 [https://github.com/apache/spark/pull/19027] > UserDefinedFunction should verify call arguments and provide readable > exception in case of mismatch > --- > > Key: SPARK-19165 > URL: https://issues.apache.org/jira/browse/SPARK-19165 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.3.0 > > > Invalid arguments to UDF call fail with a bit cryptic Py4J errors: > {code} > In [5]: g = udf(lambda x: x) > In [6]: df.select(f([])) > --- > Py4JError Traceback (most recent call last) > in () > > 1 df.select(f([])) > > Py4JError: An error occurred while calling > z:org.apache.spark.sql.functions.col. Trace: > py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {code} > It is pretty easy to perform basic input validation: > {code} > In [8]: f = udf(lambda x: x) > In [9]: f(1) > --- > TypeError Traceback (most recent call last) > ... > TypeError: All arguments should be Columns or strings representing column > names. Got 1 of type > {code} > This can be further extended to check for expected number of arguments or > even, with some type of annotations, SQL types. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21172) EOFException reached end of stream in UnsafeRowSerializer
[ https://issues.apache.org/jira/browse/SPARK-21172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139910#comment-16139910 ] liupengcheng commented on SPARK-21172: -- [~lasanthafdo] This may caused by packets bits error in network transfer, or disk bits error of shuffle data in parent stage. May be some hack of throwing FetchFailedException to retry parent stage may solve this problem. it seems like the community has already introduced a mechanism to check disk bits error and retry parent stage. > EOFException reached end of stream in UnsafeRowSerializer > - > > Key: SPARK-21172 > URL: https://issues.apache.org/jira/browse/SPARK-21172 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.1 >Reporter: liupengcheng > Labels: shuffle > > Spark sql job failed because of the following Exception. Seems like a bug in > shuffle stage. > Shuffle read size for single task is tens of GB > {code} > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:264) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.EOFException: reached end of stream after reading 9034374 > bytes; 1684891936 bytes expected > at > org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:253) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1345) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:259) > ... 8 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used
[ https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139898#comment-16139898 ] Steve Loughran commented on SPARK-21702: IF this is just "directories", then there are no directories in s3. We create some mock ones for empty dirs (i.e after a mkdirs() call), through 0-byte objects. We then delete all such 0-byte objects when you write data underneath {{see S3AFilesystem.deleteUnnecessaryFakeDirectories(Path)}}. I think that's what's been causing the confusion. I'm going to close this one as invalid. sorry. FWIW, if you do want to guarantee data in a bucket is encrypted, set the bucket policy to mandate this. It's the best way to be confident that all your data is locked down: [[https://hortonworks.github.io/hdp-aws/s3-encryption/index.html]] > Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when > PartitionBy Used > > > Key: SPARK-21702 > URL: https://issues.apache.org/jira/browse/SPARK-21702 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 > Environment: Hadoop 2.7.3: AWS SDK 1.7.4 > Hadoop 2.8.1: AWS SDK 1.10.6 >Reporter: George Pongracz >Priority: Minor > Labels: security > > Settings: > .config("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") > .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", > "AES256") > When writing to an S3 sink from structured streaming the files are being > encrypted using AES-256 > When introducing a "PartitionBy" the output data files are unencrypted. > All other supporting files, metadata are encrypted > Suspect write to temp is encrypted and move/rename is not applying the SSE. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org