[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: ``` root |-- muons: array (nullable = true) ||-- element: struct (containsNull = true) |||-- reco::Candidate: struct (nullable = true) |||-- qx3_: integer (nullable = true) |||-- pt_: float (nullable = true) |||-- eta_: float (nullable = true) |||-- phi_: float (nullable = true) |||-- mass_: float (nullable = true) |||-- vertex_: struct (nullable = true) ||||-- fCoordinates: struct (nullable = true) |||||-- fX: float (nullable = true) |||||-- fY: float (nullable = true) |||||-- fZ: float (nullable = true) |||-- pdgId_: integer (nullable = true) |||-- status_: integer (nullable = true) |||-- cachePolarFixed_: struct (nullable = true) |||-- cacheCartesianFixed_: struct (nullable = true) ``` As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: {quote} root |-- muons: array (nullable = true) ||-- element: struct (containsNull = true) |||-- reco::Candidate: struct (nullable = true) |||-- qx3_: integer (nullable = true) |||-- pt_: float (nullable = true) |||-- eta_: float (nullable = true) |||-- phi_: float (nullable = true) |||-- mass_: float (nullable = true) |||-- vertex_: struct (nullable = true) ||||-- fCoordinates: struct (nullable = true) |||||-- fX: float (nullable = true) |||||-- fY: float (nullable = true) |||||-- fZ: float (nullable = true) |||-- pdgId_: integer (nullable = true) |||-- status_: integer (nullable = true) |||-- cachePolarFixed_: struct (nullable = true) |||-- cacheCartesianFixed_: struct (nullable = true) {quote} As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be
[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: {quote} root |-- muons: array (nullable = true) ||-- element: struct (containsNull = true) |||-- reco::Candidate: struct (nullable = true) |||-- qx3_: integer (nullable = true) |||-- pt_: float (nullable = true) |||-- eta_: float (nullable = true) |||-- phi_: float (nullable = true) |||-- mass_: float (nullable = true) |||-- vertex_: struct (nullable = true) ||||-- fCoordinates: struct (nullable = true) |||||-- fX: float (nullable = true) |||||-- fY: float (nullable = true) |||||-- fZ: float (nullable = true) |||-- pdgId_: integer (nullable = true) |||-- status_: integer (nullable = true) |||-- cachePolarFixed_: struct (nullable = true) |||-- cacheCartesianFixed_: struct (nullable = true) {quote} As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root - muons: array (nullable = true) - element: struct (containsNull = true) - reco::Candidate: struct (nullable = true) - qx3_: integer (nullable = true) - pt_: float (nullable = true) - eta_: float (nullable = true) - phi_: float (nullable = true) - mass_: float (nullable = true) - vertex_: struct (nullable = true) - fCoordinates: struct (nullable = true) - fX: float (nullable = true) - fY: float (nullable = true) - fZ: float (nullable = true) - pdgId_: integer (nullable = true) - status_: integer (nullable = true) - cachePolarFixed_: struct (nullable = true) - cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original
[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root - muons: array (nullable = true) - element: struct (containsNull = true) - reco::Candidate: struct (nullable = true) - qx3_: integer (nullable = true) - pt_: float (nullable = true) - eta_: float (nullable = true) - phi_: float (nullable = true) - mass_: float (nullable = true) - vertex_: struct (nullable = true) - fCoordinates: struct (nullable = true) - fX: float (nullable = true) - fY: float (nullable = true) - fZ: float (nullable = true) - pdgId_: integer (nullable = true) - status_: integer (nullable = true) - cachePolarFixed_: struct (nullable = true) - cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root * muons: array (nullable = true) * element: struct (containsNull = true) * reco::Candidate: struct (nullable = true) * qx3_: integer (nullable = true) * pt_: float (nullable = true) * eta_: float (nullable = true) * phi_: float (nullable = true) * mass_: float (nullable = true) * vertex_: struct (nullable = true) * fCoordinates: struct (nullable = true) * fX: float (nullable = true) * fY: float (nullable = true) * fZ: float (nullable = true) * pdgId_: integer (nullable = true) * status_: integer (nullable = true) * cachePolarFixed_: struct (nullable = true) * cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group > Writing Parquet: Cannot build an empty group >
[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root * muons: array (nullable = true) * element: struct (containsNull = true) * reco::Candidate: struct (nullable = true) * qx3_: integer (nullable = true) * pt_: float (nullable = true) * eta_: float (nullable = true) * phi_: float (nullable = true) * mass_: float (nullable = true) * vertex_: struct (nullable = true) * fCoordinates: struct (nullable = true) * fX: float (nullable = true) * fY: float (nullable = true) * fZ: float (nullable = true) * pdgId_: integer (nullable = true) * status_: integer (nullable = true) * cachePolarFixed_: struct (nullable = true) * cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root * muons: array (nullable = true) * element: struct (containsNull = true) * reco::Candidate: struct (nullable = true) * qx3_: integer (nullable = true) * pt_: float (nullable = true) * eta_: float (nullable = true) * phi_: float (nullable = true) * mass_: float (nullable = true) * vertex_: struct (nullable = true) * fCoordinates: struct (nullable = true) * fX: float (nullable = true) * fY: float (nullable = true) * fZ: float (nullable = true) * pdgId_: integer (nullable = true) * status_: integer (nullable = true) * cachePolarFixed_: struct (nullable = true) * cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group > Writing Parquet: Cannot build an empty group >
[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root * muons: array (nullable = true) * element: struct (containsNull = true) * reco::Candidate: struct (nullable = true) * qx3_: integer (nullable = true) * pt_: float (nullable = true) * eta_: float (nullable = true) * phi_: float (nullable = true) * mass_: float (nullable = true) * vertex_: struct (nullable = true) * fCoordinates: struct (nullable = true) * fX: float (nullable = true) * fY: float (nullable = true) * fZ: float (nullable = true) * pdgId_: integer (nullable = true) * status_: integer (nullable = true) * cachePolarFixed_: struct (nullable = true) * cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root muons: array (nullable = true) element: struct (containsNull = true) reco::Candidate: struct (nullable = true) qx3_: integer (nullable = true) pt_: float (nullable = true) eta_: float (nullable = true) phi_: float (nullable = true) mass_: float (nullable = true) vertex_: struct (nullable = true) fCoordinates: struct (nullable = true) fX: float (nullable = true) fY: float (nullable = true) fZ: float (nullable = true) pdgId_: integer (nullable = true) status_: integer (nullable = true) cachePolarFixed_: struct (nullable = true) cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group > Writing Parquet: Cannot build an empty group >
[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group
[ https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viktor Khristenko updated SPARK-20593: -- Description: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root muons: array (nullable = true) element: struct (containsNull = true) reco::Candidate: struct (nullable = true) qx3_: integer (nullable = true) pt_: float (nullable = true) eta_: float (nullable = true) phi_: float (nullable = true) mass_: float (nullable = true) vertex_: struct (nullable = true) fCoordinates: struct (nullable = true) fX: float (nullable = true) fY: float (nullable = true) fZ: float (nullable = true) pdgId_: integer (nullable = true) status_: integer (nullable = true) cachePolarFixed_: struct (nullable = true) cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group was: Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root |-- muons: array (nullable = true) ||-- element: struct (containsNull = true) |||-- reco::Candidate: struct (nullable = true) |||-- qx3_: integer (nullable = true) |||-- pt_: float (nullable = true) |||-- eta_: float (nullable = true) |||-- phi_: float (nullable = true) |||-- mass_: float (nullable = true) |||-- vertex_: struct (nullable = true) ||||-- fCoordinates: struct (nullable = true) |||||-- fX: float (nullable = true) |||||-- fY: float (nullable = true) |||||-- fZ: float (nullable = true) |||-- pdgId_: integer (nullable = true) |||-- status_: integer (nullable = true) |||-- cachePolarFixed_: struct (nullable = true) |||-- cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1]
[jira] [Created] (SPARK-20593) Writing Parquet: Cannot build an empty group
Viktor Khristenko created SPARK-20593: - Summary: Writing Parquet: Cannot build an empty group Key: SPARK-20593 URL: https://issues.apache.org/jira/browse/SPARK-20593 Project: Spark Issue Type: Question Components: Spark Core, Spark Shell Affects Versions: 2.1.1 Environment: I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, switched today). Tested only Mac Reporter: Viktor Khristenko Priority: Minor Hi, This is my first ticket and I apologize for/if I'm doing certain things in an improper way. I have a dataset: root |-- muons: array (nullable = true) ||-- element: struct (containsNull = true) |||-- reco::Candidate: struct (nullable = true) |||-- qx3_: integer (nullable = true) |||-- pt_: float (nullable = true) |||-- eta_: float (nullable = true) |||-- phi_: float (nullable = true) |||-- mass_: float (nullable = true) |||-- vertex_: struct (nullable = true) ||||-- fCoordinates: struct (nullable = true) |||||-- fX: float (nullable = true) |||||-- fY: float (nullable = true) |||||-- fZ: float (nullable = true) |||-- pdgId_: integer (nullable = true) |||-- status_: integer (nullable = true) |||-- cachePolarFixed_: struct (nullable = true) |||-- cacheCartesianFixed_: struct (nullable = true) As you can see, there are 3 empty structs in this schema. I know 100% that I can read/manipulate/do whatever. However, when I try writing to disk in parquet, I get the following Exception: ds.write.format("parquet").save(outputPathName): java.lang.IllegalStateException: Cannot build an empty group at org.apache.parquet.Preconditions.checkState(Preconditions.java:91) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622) at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497) at org.apache.parquet.schema.Types$Builder.named(Types.java:286) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533) So, basically I would like to understand if it's a bug or an intended behavior??? I also assume that it's related to the empty structs. Any help would be really appreciated! I've quickly created stripped version and that one works without any issues! For reference, I put a link to a original question on SO[1] VK [1] http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20592) Alter table concatenate is not working as expected.
Guru Prabhakar Reddy Marthala created SPARK-20592: - Summary: Alter table concatenate is not working as expected. Key: SPARK-20592 URL: https://issues.apache.org/jira/browse/SPARK-20592 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: Guru Prabhakar Reddy Marthala Created a table using CTAS from csv to parquet.Parquet table generated numerous small files.tried alter table concatenate but it's not working as expected. spark.sql("CREATE TABLE flight.flight_data(year INT, month INT, day INT, day_of_week INT, dep_time INT, crs_dep_time INT, arr_time INT, crs_arr_time INT, unique_carrier STRING, flight_num INT, tail_num STRING, actual_elapsed_time INT, crs_elapsed_time INT, air_time INT, arr_delay INT, dep_delay INT, origin STRING, dest STRING, distance INT, taxi_in INT, taxi_out INT, cancelled INT, cancellation_code STRING, diverted INT, carrier_delay STRING, weather_delay STRING, nas_delay STRING, security_delay STRING, late_aircraft_delay STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile") spark.sql("load data local INPATH 'i:/2008/2008.csv' INTO TABLE flight.flight_data") spark.sql("create table flight.flight_data_pq stored as parquet as select * from flight.flight_data") spark.sql("create table flight.flight_data_orc stored as orc as select * from flight.flight_data") pyspark.sql.utils.ParseException: u'\nOperation not allowed: alter table concatenate(line 1, pos 0)\n\n== SQL ==\nalter table flight_data.flight_data_pq concatenate\n^^^\n' Tried on both orc and parquet format.It's not working. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19660) Replace the configuration property names that are deprecated in the version of Hadoop 2.6
[ https://issues.apache.org/jira/browse/SPARK-19660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996175#comment-15996175 ] Apache Spark commented on SPARK-19660: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/17856 > Replace the configuration property names that are deprecated in the version > of Hadoop 2.6 > - > > Key: SPARK-19660 > URL: https://issues.apache.org/jira/browse/SPARK-19660 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang > Fix For: 2.2.0 > > > Replace all the Hadoop deprecated configuration property names according to > [DeprecatedProperties|https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html]. > except: > https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533 > https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987 > https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45 > https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20543) R should skip long running or non-essential tests when running on CRAN
[ https://issues.apache.org/jira/browse/SPARK-20543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-20543. -- Resolution: Fixed Fix Version/s: 2.3.0 2.2.0 Target Version/s: 2.2.0, 2.3.0 > R should skip long running or non-essential tests when running on CRAN > -- > > Key: SPARK-20543 > URL: https://issues.apache.org/jira/browse/SPARK-20543 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.0, 2.3.0 > > > This is actually recommended in the CRAN policies -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996166#comment-15996166 ] Helena Edelson commented on SPARK-18057: With the current 0.10.0.1 version we have several issues happening, forcing us into ever tighter situations. Much of this is constraints related to new functionality in later Kafka releases around kafka security and SASL_SSL and related behavior not in previous versions of Kafka. Users in our ecosystem can not delete topics on clusters so this is not our relevant use case. It seems only structured streaming kafka does deleteTopic, vs spark-streaming-kafka. I've had to create an internal fork so that we can use Kafka 0.10.2.0 in Spark, which is bad but we are blocked otherwise. [~ijuma] good to know on the timing. A group of us voted for https://issues.apache.org/jira/browse/KAFKA-4879. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20564: Assignee: Apache Spark > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu >Assignee: Apache Spark > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20564: Assignee: (was: Apache Spark) > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996165#comment-15996165 ] Apache Spark commented on SPARK-20564: -- User 'mariahualiu' has created a pull request for this issue: https://github.com/apache/spark/pull/17854 > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist
Jinhua Fu created SPARK-20591: - Summary: Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist Key: SPARK-20591 URL: https://issues.apache.org/jira/browse/SPARK-20591 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.0.2 Reporter: Jinhua Fu when spark.speculation is enabled,and there are some speculative tasks, then we can see succeeded tasks num include speculative tasks on the job page, which however not being included on the job detail page(job stages page). When I consider some tasks may run a little slow by the job page's succeeded tasks more than total tasks,which make me want to known which tasks and why,I have to check every stage to find the speculative tasks which is beacause speculative tasks not being included in the stage succeeded task num. Can it be improved? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20584) Python generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20584. - Resolution: Fixed Assignee: Maciej Szymkiewicz Fix Version/s: 2.2.0 > Python generic hint support > --- > > Key: SPARK-20584 > URL: https://issues.apache.org/jira/browse/SPARK-20584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Maciej Szymkiewicz > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996013#comment-15996013 ] zhoukang edited comment on SPARK-20582 at 5/4/17 1:44 AM: -- You are right [~vanzin],i have looked into you implementation.Thanks for your time. was (Author: cane): You are right [~vanzin],i searched into you implementation.Thanks for your time. > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996013#comment-15996013 ] zhoukang commented on SPARK-20582: -- You are right [~vanzin],i searched into you implementation.Thanks for your time. > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20546: Assignee: Apache Spark > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Assignee: Apache Spark >Priority: Minor > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20546: Assignee: (was: Apache Spark) > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Priority: Minor > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20546) spark-class gets syntax error in posix mode
[ https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995999#comment-15995999 ] Apache Spark commented on SPARK-20546: -- User 'jyu00' has created a pull request for this issue: https://github.com/apache/spark/pull/17852 > spark-class gets syntax error in posix mode > --- > > Key: SPARK-20546 > URL: https://issues.apache.org/jira/browse/SPARK-20546 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.0.2 >Reporter: Jessie Yu >Priority: Minor > > spark-class gets the following error when running in posix mode: > {code} > spark-class: line 78: syntax error near unexpected token `<' > spark-class: line 78: `done < <(build_command "$@")' > {code} > \\ > It appears to be complaining about the process substitution: > {code} > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <(build_command "$@") > {code} > \\ > This can be reproduced by first turning on allexport then posix mode: > {code}set -a -o posix {code} > then run something like spark-shell which calls spark-class. > \\ > The simplest fix is probably to always turn off posix mode in spark-class > before the while loop. > \\ > This was previously reported in > [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed > with cannot reproduce. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic
[ https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995990#comment-15995990 ] Joseph K. Bradley commented on SPARK-17833: --- I'll add: There definitely needs to be a deterministic version in order to compute row indices. In my case, this is important for GraphFrames. The last time I checked, you can get non-deterministic behavior by, e.g., shuffling a DataFrame, adding a monotonicallyIncreasingId column, and inspecting. > 'monotonicallyIncreasingId()' should be deterministic > - > > Key: SPARK-17833 > URL: https://issues.apache.org/jira/browse/SPARK-17833 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Ushey >Priority: Critical > > Right now, it's (IMHO) too easy to shoot yourself in the foot using > 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers > to function as a 'stable' primary key, for example, and then go on to use > that key in e.g. 'joins' and so on. > Is there any reason why this function can't be made deterministic? Or, could > a deterministic analogue of this function be added (e.g. > 'withPrimaryKey(columnName = ...)')? > A solution is to immediately cache / persist the table after calling > 'monotonicallyIncreasingId()'; it's also possible that the documentation > should spell that out loud and clear. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic
[ https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17833: -- Summary: 'monotonicallyIncreasingId()' should be deterministic (was: 'monotonicallyIncreasingId()' should probably be deterministic) > 'monotonicallyIncreasingId()' should be deterministic > - > > Key: SPARK-17833 > URL: https://issues.apache.org/jira/browse/SPARK-17833 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Ushey >Priority: Minor > > Right now, it's (IMHO) too easy to shoot yourself in the foot using > 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers > to function as a 'stable' primary key, for example, and then go on to use > that key in e.g. 'joins' and so on. > Is there any reason why this function can't be made deterministic? Or, could > a deterministic analogue of this function be added (e.g. > 'withPrimaryKey(columnName = ...)')? > A solution is to immediately cache / persist the table after calling > 'monotonicallyIncreasingId()'; it's also possible that the documentation > should spell that out loud and clear. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic
[ https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17833: -- Component/s: SQL > 'monotonicallyIncreasingId()' should be deterministic > - > > Key: SPARK-17833 > URL: https://issues.apache.org/jira/browse/SPARK-17833 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Ushey >Priority: Critical > > Right now, it's (IMHO) too easy to shoot yourself in the foot using > 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers > to function as a 'stable' primary key, for example, and then go on to use > that key in e.g. 'joins' and so on. > Is there any reason why this function can't be made deterministic? Or, could > a deterministic analogue of this function be added (e.g. > 'withPrimaryKey(columnName = ...)')? > A solution is to immediately cache / persist the table after calling > 'monotonicallyIncreasingId()'; it's also possible that the documentation > should spell that out loud and clear. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic
[ https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17833: -- Priority: Critical (was: Minor) > 'monotonicallyIncreasingId()' should be deterministic > - > > Key: SPARK-17833 > URL: https://issues.apache.org/jira/browse/SPARK-17833 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kevin Ushey >Priority: Critical > > Right now, it's (IMHO) too easy to shoot yourself in the foot using > 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers > to function as a 'stable' primary key, for example, and then go on to use > that key in e.g. 'joins' and so on. > Is there any reason why this function can't be made deterministic? Or, could > a deterministic analogue of this function be added (e.g. > 'withPrimaryKey(columnName = ...)')? > A solution is to immediately cache / persist the table after calling > 'monotonicallyIncreasingId()'; it's also possible that the documentation > should spell that out loud and clear. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20585) R generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20585: Assignee: Apache Spark > R generic hint support > -- > > Key: SPARK-20585 > URL: https://issues.apache.org/jira/browse/SPARK-20585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20585) R generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20585: Assignee: (was: Apache Spark) > R generic hint support > -- > > Key: SPARK-20585 > URL: https://issues.apache.org/jira/browse/SPARK-20585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20063) Trigger without delay when falling behind
[ https://issues.apache.org/jira/browse/SPARK-20063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-20063. --- Resolution: Duplicate > Trigger without delay when falling behind > -- > > Key: SPARK-20063 > URL: https://issues.apache.org/jira/browse/SPARK-20063 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Priority: Critical > > Today, when we miss a trigger interval we wait until the next one to fire. > However, for real workloads this usually means that you fall further and > further behind by sitting idle while waiting. We should revisit this > decision. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20585) R generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995965#comment-15995965 ] Apache Spark commented on SPARK-20585: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17851 > R generic hint support > -- > > Key: SPARK-20585 > URL: https://issues.apache.org/jira/browse/SPARK-20585 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995958#comment-15995958 ] Marcelo Vanzin commented on SPARK-20582: You're talking about checkpointing the application info, which the code I have already does. With my changes the HTTP server comes up immediately. > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995951#comment-15995951 ] zhoukang commented on SPARK-20582: -- [~vanzin]I think you misunderstand this issue.This issue in order to let jetty server listen to given port faster.Since when there are too many logs need replaying,the time may be tens of minutes. > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck
[ https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995939#comment-15995939 ] Hyukjin Kwon commented on SPARK-20588: -- Hm, I thought {{TimeZone.getTimeZone}} itself caches once it computes. > from_utc_timestamp causes bottleneck > > > Key: SPARK-20588 > URL: https://issues.apache.org/jira/browse/SPARK-20588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR AMI 5.2.1 >Reporter: Ameen Tayyebi > > We have a SQL query that makes use of the from_utc_timestamp function like > so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') > This causes a major bottleneck. Our exact call is: > date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) > Switching from the above to date_add(itemSigningTime, 1) reduces the job > running time from 40 minutes to 9. > When from_utc_timestamp function is used, several threads in the executors > are in the BLOCKED state, on this call stack: > "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 > tid=0x7f848472e000 nid=0x4294 waiting for monitor entry > [0x7f501981c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.TimeZone.getTimeZone(TimeZone.java:516) > - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for > java.util.TimeZone) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Can we cache the locale's once per JVM so that we don't do this for every > record? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995928#comment-15995928 ] Ryan Blue commented on SPARK-20213: --- [~zsxwing], the PR adds a method that causes tests to fail if they aren't wrapped. If you remove the additional high-level calls to withNewExecutionId that I added, you can see all the test failures. > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue > Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png > > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995923#comment-15995923 ] Shixiong Zhu edited comment on SPARK-20213 at 5/4/17 12:02 AM: --- I tested the master branch, and I can see "insertInto" in SQL tab. !Screen Shot 2017-05-03 at 5.00.19 PM.png|width=300! Could you clarify the issue? It would be great if you can provide a reproducer. was (Author: zsxwing): I tested the master branch, and I can see "insertInto" in SQL tab. Could you clarify the issue? It would be great if you can provide a reproducer. > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue > Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png > > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995923#comment-15995923 ] Shixiong Zhu commented on SPARK-20213: -- I tested the master branch, and I can see "insertInto" in SQL tab. Could you clarify the issue? It would be great if you can provide a reproducer. > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-20213: - Attachment: Screen Shot 2017-05-03 at 5.00.19 PM.png > DataFrameWriter operations do not show up in SQL tab > > > Key: SPARK-20213 > URL: https://issues.apache.org/jira/browse/SPARK-20213 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.2, 2.1.0 >Reporter: Ryan Blue > Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png > > > In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like > {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no > longer do. The problem is that 2.0.0 and later no longer wrap execution with > {{SQLExecution.withNewExecutionId}}, which emits > {{SparkListenerSQLExecutionStart}}. > Here are the relevant parts of the stack traces: > {code:title=Spark 1.6.1} > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56) > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56) > => holding > Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196) > {code} > {code:title=Spark 2.0.0} > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924}) > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301) > {code} > I think this was introduced by > [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be > to add withNewExecutionId to > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
[ https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17939: - Target Version/s: 2.3.0 (was: 2.2.0) > Spark-SQL Nullability: Optimizations vs. Enforcement Clarification > -- > > Key: SPARK-17939 > URL: https://issues.apache.org/jira/browse/SPARK-17939 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Aleksander Eskilson >Priority: Critical > > The notion of Nullability of of StructFields in DataFrames and Datasets > creates some confusion. As has been pointed out previously [1], Nullability > is a hint to the Catalyst optimizer, and is not meant to be a type-level > enforcement. Allowing null fields can also help the reader successfully parse > certain types of more loosely-typed data, like JSON and CSV, where null > values are common, rather than just failing. > There's already been some movement to clarify the meaning of Nullable in the > API, but also some requests for a (perhaps completely separate) type-level > implementation of Nullable that can act as an enforcement contract. > This bug is logged here to discuss and clarify this issue. > [1] - > [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535] > [2] - https://github.com/apache/spark/pull/11785 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20584) Python generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20584: Assignee: (was: Apache Spark) > Python generic hint support > --- > > Key: SPARK-20584 > URL: https://issues.apache.org/jira/browse/SPARK-20584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20584) Python generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20584: Assignee: Apache Spark > Python generic hint support > --- > > Key: SPARK-20584 > URL: https://issues.apache.org/jira/browse/SPARK-20584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20584) Python generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995910#comment-15995910 ] Apache Spark commented on SPARK-20584: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/17850 > Python generic hint support > --- > > Key: SPARK-20584 > URL: https://issues.apache.org/jira/browse/SPARK-20584 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values
[ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995862#comment-15995862 ] Apache Spark commented on SPARK-10931: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/17849 > PySpark ML Models should contain Param values > - > > Key: SPARK-10931 > URL: https://issues.apache.org/jira/browse/SPARK-10931 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > PySpark spark.ml Models are generally wrappers around Java objects and do not > even contain Param values. This JIRA is for copying the Param values from > the Estimator to the model. > This can likely be solved by modifying Estimator.fit to copy Param values, > but should also include proper unit tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20586: Assignee: Apache Spark (was: Xiao Li) > Add deterministic and distinctLike to ScalaUDF > -- > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. > - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated > on just the distinct values of a column. Examples include min and max UDFs. > This information is used by metadata-only optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20586: Assignee: Xiao Li (was: Apache Spark) > Add deterministic and distinctLike to ScalaUDF > -- > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. > - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated > on just the distinct values of a column. Examples include min and max UDFs. > This information is used by metadata-only optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995811#comment-15995811 ] Apache Spark commented on SPARK-20586: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17848 > Add deterministic and distinctLike to ScalaUDF > -- > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. > - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated > on just the distinct values of a column. Examples include min and max UDFs. > This information is used by metadata-only optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20590: Assignee: (was: Apache Spark) > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20590: Assignee: Apache Spark > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes
[ https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995799#comment-15995799 ] Apache Spark commented on SPARK-20590: -- User 'sameeragarwal' has created a pull request for this issue: https://github.com/apache/spark/pull/17847 > Map default input data source formats to inlined classes > > > Key: SPARK-20590 > URL: https://issues.apache.org/jira/browse/SPARK-20590 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal > > One of the common usability problems around reading data in spark > (particularly CSV) is that there can often be a conflict between different > readers in the classpath. > As an example, if someone launches a 2.x spark shell with the spark-csv > package in the classpath, Spark currently fails in an extremely unfriendly way > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > java.lang.RuntimeException: Multiple sources found for csv > (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, > com.databricks.spark.csv.DefaultSource15), please specify the fully qualified > class name. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) > ... 48 elided > {code} > This JIRA proposes a simple way of fixing this error by always mapping > default input data source formats to inlined classes (that exist in Spark). > {code} > ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 > scala> val df = spark.read.csv("/foo/bar.csv") > df: org.apache.spark.sql.DataFrame = [_c0: string] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20590) Map default input data source formats to inlined classes
Sameer Agarwal created SPARK-20590: -- Summary: Map default input data source formats to inlined classes Key: SPARK-20590 URL: https://issues.apache.org/jira/browse/SPARK-20590 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Sameer Agarwal One of the common usability problems around reading data in spark (particularly CSV) is that there can often be a conflict between different readers in the classpath. As an example, if someone launches a 2.x spark shell with the spark-csv package in the classpath, Spark currently fails in an extremely unfriendly way {code} ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 scala> val df = spark.read.csv("/foo/bar.csv") java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) ... 48 elided {code} This JIRA proposes a simple way of fixing this error by always mapping default input data source formats to inlined classes (that exist in Spark). {code} ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0 scala> val df = spark.read.csv("/foo/bar.csv") df: org.apache.spark.sql.DataFrame = [_c0: string] {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995650#comment-15995650 ] Thomas Graves commented on SPARK-20589: --- thanks for the suggestions [~mridulm80]. Note just for reference tez did a similar thing under https://issues.apache.org/jira/browse/TEZ-2914, may be useful for the yarn side of things here. > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20588) from_utc_timestamp causes bottleneck
[ https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20588: -- Issue Type: Improvement (was: Bug) > from_utc_timestamp causes bottleneck > > > Key: SPARK-20588 > URL: https://issues.apache.org/jira/browse/SPARK-20588 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR AMI 5.2.1 >Reporter: Ameen Tayyebi > > We have a SQL query that makes use of the from_utc_timestamp function like > so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') > This causes a major bottleneck. Our exact call is: > date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) > Switching from the above to date_add(itemSigningTime, 1) reduces the job > running time from 40 minutes to 9. > When from_utc_timestamp function is used, several threads in the executors > are in the BLOCKED state, on this call stack: > "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 > tid=0x7f848472e000 nid=0x4294 waiting for monitor entry > [0x7f501981c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.TimeZone.getTimeZone(TimeZone.java:516) > - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for > java.util.TimeZone) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Can we cache the locale's once per JVM so that we don't do this for every > record? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995550#comment-15995550 ] Mridul Muralidharan commented on SPARK-20589: - coalasce with shuffle=false might be a workaround if source is already persisted ? (I see the benefit of the jira, just wondering if this will unblock you !) > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20589) Allow limiting task concurrency per stage
Thomas Graves created SPARK-20589: - Summary: Allow limiting task concurrency per stage Key: SPARK-20589 URL: https://issues.apache.org/jira/browse/SPARK-20589 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Thomas Graves It would be nice to have the ability to limit the number of concurrent tasks per stage. This is useful when your spark job might be accessing another service and you don't want to DOS that service. For instance Spark writing to hbase or Spark doing http puts on a service. Many times you want to do this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19243) Error when selecting from DataFrame containing parsed data from files larger than 1MB
[ https://issues.apache.org/jira/browse/SPARK-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995489#comment-15995489 ] Harish edited comment on SPARK-19243 at 5/3/17 7:52 PM: i am getting the same error in spark 2.1.0. I have 10 node cluster with 109GB each. My data set is just 30K rows with 60 columns. I see total 72 partitions after loading the orc file to DF. then re-partitioned to 2001. No luck. [~srowen] did any one raised the similar issue? Regards, Harish was (Author: harishk15): i am getting the same error in spark 2.1.0. I have 10 node cluster with 109GB each. My data set is just 30K rows with 60 columns. I see total 72 partitions after loading the orc file to DF. then re-partitioned to 2001. No luck. Regards, Harish > Error when selecting from DataFrame containing parsed data from files larger > than 1MB > - > > Key: SPARK-19243 > URL: https://issues.apache.org/jira/browse/SPARK-19243 > Project: Spark > Issue Type: Bug >Reporter: Ben > > I hope I can describe the problem clearly. This error happens with Spark > 2.0.1. However, I tried with Spark 2.1.0 on my test PC and it worked there, > none of the issues below, but I can't try it on the test cluster because > Spark needs to be upgraded there. I'm opening this ticket because if it's a > bug, maybe something is still partially present in Spark 2.1.0. > Initially I though it was my script's problem so I tried to debug, until I > found why this is happening. > Step by step, I load XML files through spark-xml into a DataFrame. In my > case, the rowTag is the root tag, so each XML file creates a row. The XML > structure is fairly complex, which are converted to nested columns or arrays > inside the DF. Since I need to flatten the whole table, and since the output > is not fixed but I dynamically select what I want as output, in case I need > to output columns that have been parsed as arrays, then I explode them with > explode() only when needed. > Normally I can select various columns that don't have many entries without a > problem. > I select a column that has a lot of entries into a new DF, e.g. simply through > {noformat} > df2 = df.select(...) > {noformat} > and then if I try to do a count() or first() or anything, Spark behaves two > ways: > 1. If the source file was smaller than 1MB, it works. > 2. If the source file was larger than 1MB, the following error occurs: > {noformat} > Traceback (most recent call last): > File \"/myCode.py\", line 71, in main > df.count() > File > \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py\", > line 299, in count > return int(self._jdf.count()) > File > \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py\", > line 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/utils.py\", > line 63, in deco > return f(*a, **kw) > File > \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py\", > line 319, in get_return_value > format(target_id, \".\", name), value) > Py4JJavaError: An error occurred while calling o180.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 6, compname): java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1307) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:438) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:606) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:663) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995507#comment-15995507 ] Shixiong Zhu commented on SPARK-20568: -- [~srowen] Structured Streaming's Source has a "commit" method to let a source discard unused data. This should be easy to implement in Structured Streaming. > Delete files after processing in structured streaming > - > > Key: SPARK-20568 > URL: https://issues.apache.org/jira/browse/SPARK-20568 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Saul Shanabrook > > It would be great to be able to delete files after processing them with > structured streaming. > For example, I am reading in a bunch of JSON files and converting them into > Parquet. If the JSON files are not deleted after they are processed, it > quickly fills up my hard drive. I originally [posted this on Stack > Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to > make a feature request for it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20588) from_utc_timestamp causes bottleneck
[ https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995499#comment-15995499 ] Ameen Tayyebi edited comment on SPARK-20588 at 5/3/17 7:40 PM: --- Hopefully more readable version of the call stack: {code:java} "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.TimeZone.getTimeZone(TimeZone.java:516) - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for java.util.TimeZone) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) at org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was (Author: ameen.tayy...@gmail.com): Hopefully more readable version of the call stack: "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.TimeZone.getTimeZone(TimeZone.java:516) - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for java.util.TimeZone) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) at org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > from_utc_timestamp causes bottleneck > > > Key: SPARK-20588 > URL: https://issues.apache.org/jira/browse/SPARK-20588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR AMI 5.2.1 >Reporter: Ameen Tayyebi > > We have a SQL query that makes use of the from_utc_timestamp function like > so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') > This causes a major bottleneck. Our exact call is: > date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) > Switching from the above to date_add(itemSigningTime, 1) reduces the job > running time from 40 minutes to 9. > When from_utc_timestamp function is used, several threads in the executors > are in the BLOCKED state, on this call stack: > "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 > tid=0x7f848472e000 nid=0x4294 waiting for monitor entry > [0x7f501981c000] >
[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck
[ https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995499#comment-15995499 ] Ameen Tayyebi commented on SPARK-20588: --- Hopefully more readable version of the call stack: "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.TimeZone.getTimeZone(TimeZone.java:516) - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for java.util.TimeZone) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) at org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) > from_utc_timestamp causes bottleneck > > > Key: SPARK-20588 > URL: https://issues.apache.org/jira/browse/SPARK-20588 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: AWS EMR AMI 5.2.1 >Reporter: Ameen Tayyebi > > We have a SQL query that makes use of the from_utc_timestamp function like > so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') > This causes a major bottleneck. Our exact call is: > date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) > Switching from the above to date_add(itemSigningTime, 1) reduces the job > running time from 40 minutes to 9. > When from_utc_timestamp function is used, several threads in the executors > are in the BLOCKED state, on this call stack: > "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 > tid=0x7f848472e000 nid=0x4294 waiting for monitor entry > [0x7f501981c000] >java.lang.Thread.State: BLOCKED (on object monitor) > at java.util.TimeZone.getTimeZone(TimeZone.java:516) > - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for > java.util.TimeZone) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) > at > org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Can we cache the locale's once per JVM so that we don't do this for every > record? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Created] (SPARK-20588) from_utc_timestamp causes bottleneck
Ameen Tayyebi created SPARK-20588: - Summary: from_utc_timestamp causes bottleneck Key: SPARK-20588 URL: https://issues.apache.org/jira/browse/SPARK-20588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Environment: AWS EMR AMI 5.2.1 Reporter: Ameen Tayyebi We have a SQL query that makes use of the from_utc_timestamp function like so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles') This causes a major bottleneck. Our exact call is: date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1) Switching from the above to date_add(itemSigningTime, 1) reduces the job running time from 40 minutes to 9. When from_utc_timestamp function is used, several threads in the executors are in the BLOCKED state, on this call stack: "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.TimeZone.getTimeZone(TimeZone.java:516) - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for java.util.TimeZone) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356) at org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Can we cache the locale's once per JVM so that we don't do this for every record? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20562: Assignee: (was: Apache Spark) > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20562: Assignee: Apache Spark > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala >Assignee: Apache Spark > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995495#comment-15995495 ] Apache Spark commented on SPARK-20562: -- User 'gkc2104' has created a pull request for this issue: https://github.com/apache/spark/pull/17846 > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995493#comment-15995493 ] Kamal Gurala commented on SPARK-20562: -- Spark is not smart about offers that might be scheduled for maintenance i.e. has an Unavailability period set. It also cannot estimate the amount of time a scheduled Task would use an Offer for. It is however easier for Users to guess how long they think a Task would run for. So they can easily set a Configurable Threshold(x) that makes the Spark scheduler be wary of which offers it can accept and which ones might go under maintenance in x amount of time. > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability
[ https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kamal Gurala updated SPARK-20562: - Issue Type: Improvement (was: Bug) > Support Maintenance by having a threshold for unavailability > > > Key: SPARK-20562 > URL: https://issues.apache.org/jira/browse/SPARK-20562 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Kamal Gurala > > Make Spark be aware of offers that have an unavailability period set because > of a scheduled Maintenance on the node. > Have a configurable option that's a threshold which ensures that tasks are > not scheduled on offers that are within a threshold for maintenance -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19243) Error when selecting from DataFrame containing parsed data from files larger than 1MB
[ https://issues.apache.org/jira/browse/SPARK-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995489#comment-15995489 ] Harish commented on SPARK-19243: i am getting the same error in spark 2.1.0. I have 10 node cluster with 109GB each. My data set is just 30K rows with 60 columns. I see total 72 partitions after loading the orc file to DF. then re-partitioned to 2001. No luck. Regards, Harish > Error when selecting from DataFrame containing parsed data from files larger > than 1MB > - > > Key: SPARK-19243 > URL: https://issues.apache.org/jira/browse/SPARK-19243 > Project: Spark > Issue Type: Bug >Reporter: Ben > > I hope I can describe the problem clearly. This error happens with Spark > 2.0.1. However, I tried with Spark 2.1.0 on my test PC and it worked there, > none of the issues below, but I can't try it on the test cluster because > Spark needs to be upgraded there. I'm opening this ticket because if it's a > bug, maybe something is still partially present in Spark 2.1.0. > Initially I though it was my script's problem so I tried to debug, until I > found why this is happening. > Step by step, I load XML files through spark-xml into a DataFrame. In my > case, the rowTag is the root tag, so each XML file creates a row. The XML > structure is fairly complex, which are converted to nested columns or arrays > inside the DF. Since I need to flatten the whole table, and since the output > is not fixed but I dynamically select what I want as output, in case I need > to output columns that have been parsed as arrays, then I explode them with > explode() only when needed. > Normally I can select various columns that don't have many entries without a > problem. > I select a column that has a lot of entries into a new DF, e.g. simply through > {noformat} > df2 = df.select(...) > {noformat} > and then if I try to do a count() or first() or anything, Spark behaves two > ways: > 1. If the source file was smaller than 1MB, it works. > 2. If the source file was larger than 1MB, the following error occurs: > {noformat} > Traceback (most recent call last): > File \"/myCode.py\", line 71, in main > df.count() > File > \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py\", > line 299, in count > return int(self._jdf.count()) > File > \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py\", > line 1133, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/utils.py\", > line 63, in deco > return f(*a, **kw) > File > \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py\", > line 319, in get_return_value > format(target_id, \".\", name), value) > Py4JJavaError: An error occurred while calling o180.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 > (TID 6, compname): java.lang.IllegalArgumentException: Size exceeds > Integer.MAX_VALUE > at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103) > at > org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1307) > at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:438) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:606) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:663) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at
[jira] [Assigned] (SPARK-20587) Improve performance of ML ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20587: Assignee: Nick Pentreath (was: Apache Spark) > Improve performance of ML ALS recommendForAll > - > > Key: SPARK-20587 > URL: https://issues.apache.org/jira/browse/SPARK-20587 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" > approach for generating top-k recommendations in > {{mllib.recommendation.MatrixFactorizationModel}}. > The solution there is still based on blocking factors, but efficiently > computes the top-k elements *per block* first (using > {{BoundedPriorityQueue}}) and then computes the global top-k elements. > This improves performance and GC pressure substantially for {{mllib}}'s ALS > model. The same approach is also a lot more efficient than the current > "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. > This adapts the solution in SPARK-11968 for {{DataFrame}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20587) Improve performance of ML ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20587: Assignee: Apache Spark (was: Nick Pentreath) > Improve performance of ML ALS recommendForAll > - > > Key: SPARK-20587 > URL: https://issues.apache.org/jira/browse/SPARK-20587 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Apache Spark > > SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" > approach for generating top-k recommendations in > {{mllib.recommendation.MatrixFactorizationModel}}. > The solution there is still based on blocking factors, but efficiently > computes the top-k elements *per block* first (using > {{BoundedPriorityQueue}}) and then computes the global top-k elements. > This improves performance and GC pressure substantially for {{mllib}}'s ALS > model. The same approach is also a lot more efficient than the current > "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. > This adapts the solution in SPARK-11968 for {{DataFrame}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20587) Improve performance of ML ALS recommendForAll
[ https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995480#comment-15995480 ] Apache Spark commented on SPARK-20587: -- User 'MLnick' has created a pull request for this issue: https://github.com/apache/spark/pull/17845 > Improve performance of ML ALS recommendForAll > - > > Key: SPARK-20587 > URL: https://issues.apache.org/jira/browse/SPARK-20587 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" > approach for generating top-k recommendations in > {{mllib.recommendation.MatrixFactorizationModel}}. > The solution there is still based on blocking factors, but efficiently > computes the top-k elements *per block* first (using > {{BoundedPriorityQueue}}) and then computes the global top-k elements. > This improves performance and GC pressure substantially for {{mllib}}'s ALS > model. The same approach is also a lot more efficient than the current > "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. > This adapts the solution in SPARK-11968 for {{DataFrame}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20587) Improve performance of ML ALS recommendForAll
Nick Pentreath created SPARK-20587: -- Summary: Improve performance of ML ALS recommendForAll Key: SPARK-20587 URL: https://issues.apache.org/jira/browse/SPARK-20587 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Nick Pentreath Assignee: Nick Pentreath SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" approach for generating top-k recommendations in {{mllib.recommendation.MatrixFactorizationModel}}. The solution there is still based on blocking factors, but efficiently computes the top-k elements *per block* first (using {{BoundedPriorityQueue}}) and then computes the global top-k elements. This improves performance and GC pressure substantially for {{mllib}}'s ALS model. The same approach is also a lot more efficient than the current "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. This adapts the solution in SPARK-11968 for {{DataFrame}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20569: - Affects Version/s: 2.2.0 > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995423#comment-15995423 ] Michael Armbrust commented on SPARK-20569: -- [~rxin] this does seem like a bug. > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter
[ https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-20569: - Summary: RuntimeReplaceable functions accept invalid third parameter (was: In spark-sql,Some functions can execute successfully, when the number of input parameter is wrong) > RuntimeReplaceable functions accept invalid third parameter > --- > > Key: SPARK-20569 > URL: https://issues.apache.org/jira/browse/SPARK-20569 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: liuxian >Priority: Trivial > > >select Nvl(null,'1',3); > >3 > The function of "Nvl" has Only two input parameters,so, when input three > parameters, i think it should notice that:"Error in query: Invalid number of > arguments for function nvl". > Such as "nvl2", "nullIf","IfNull",these have a similar problem -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19104: - Affects Version/s: 2.2.0 Target Version/s: 2.2.0 > CompileException with Map and Case Class in Spark 2.1.0 > > > Key: SPARK-19104 > URL: https://issues.apache.org/jira/browse/SPARK-19104 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Nils Grabbert > > The following code will run with Spark 2.0.2 but not with Spark 2.1.0: > {code} > case class InnerData(name: String, value: Int) > case class Data(id: Int, param: Map[String, InnerData]) > val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + > 100 > val ds = spark.createDataset(data) > {code} > Exception: > {code} > Caused by: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 63, Column 46: Expression > "ExternalMapToCatalyst_value_isNull1" is not an rvalue > at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) > at > org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639) > > at > org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) > at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) > > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) > at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) > at > org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) > at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) > at > org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) > at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > > at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) > > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) > > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) >
[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0
[ https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19104: - Description: The following code will run with Spark 2.0.2 but not with Spark 2.1.0: {code} case class InnerData(name: String, value: Int) case class Data(id: Int, param: Map[String, InnerData]) val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 100 val ds = spark.createDataset(data) {code} Exception: {code} Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 46: Expression "ExternalMapToCatalyst_value_isNull1" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935) ... 77 more {code} was: The following code will run with Spark 2.0.2 but not with Spark 2.1.0: {code} case class InnerData(name: String, value: Int) case class Data(id: Int, param: Map[String, InnerData]) val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 100 val ds = spark.createDataset(data) {code} Exception: Caused by:
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver within seconds. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, causing executor removal and aggravating the overloading of driver, which leads to more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal number of containers that driver can ask for in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver within seconds. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, causing executor removal and aggravating the overloading of driver, which leads to more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver within seconds. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, causing executor removal and aggravating the overloading of driver, which leads to more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver within seconds. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > containers driver can ask for and launch in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver within seconds. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, cause > executor removal and aggravate the overloading of driver, which causes more > executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > containers driver can ask for and launch in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them almost simultaneously. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver. However, driver handles > executor registration, stop, removal and spark props retrieval in one thread, > and it can not handle such a large number of RPCs within a short period of > time. As a result, some executors cannot retrieve spark props and/or > register. These failed executors are then marked as failed, cause executor > removal and aggravate the overloading of driver, which causes more executor > failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > containers driver can ask for and launch in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them. These thousands of executors try to retrieve spark props > and register with driver. However, driver handles executor registration, > stop, removal and spark props retrieval in one thread, and it can not handle > such a large number of RPCs within a short period of time. As a result, some > executors cannot retrieve spark props and/or register. These failed executors > are then marked as failed, cause executor removal and aggravate the > overloading of driver, which causes more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > containers driver can ask for and launch in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was
[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hua Liu updated SPARK-20564: Description: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they were marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. was: When we used more than 2000 executors in a spark application, we noticed a large number of executors cannot connect to driver and as a result they are marked as failed. In some cases, the failed executor number reached twice of the requested executor count and thus applications retried and may eventually fail. This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, YarnAllocator can ask for and get 2000 containers in one request, and then launch them. These thousands of executors try to retrieve spark props and register with driver. However, driver handles executor registration, stop, removal and spark props retrieval in one thread, and it can not handle such a large number of RPCs within a short period of time. As a result, some executors cannot retrieve spark props and/or register. These failed executors are then marked as failed, cause executor removal and aggravate the overloading of driver, which causes more executor failures. This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously, which caps the maximal containers driver can ask for and launch in every spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of executors grows steadily. The number of executor failures is reduced and applications can reach the desired number of executors faster. > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get 2000 containers in one request, and then > launch them. These thousands of executors try to retrieve spark props and > register with driver. However, driver handles executor registration, stop, > removal and spark props retrieval in one thread, and it can not handle such a > large number of RPCs within a short period of time. As a result, some > executors cannot retrieve spark props and/or register. These failed executors > are then marked as failed, cause executor removal and aggravate the > overloading of driver, which causes more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > containers driver can ask for and launch in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by
[jira] [Resolved] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output
[ https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19965. -- Resolution: Fixed Assignee: Liwei Lin Fix Version/s: 2.2.0 > DataFrame batch reader may fail to infer partitions when reading > FileStreamSink's output > > > Key: SPARK-19965 > URL: https://issues.apache.org/jira/browse/SPARK-19965 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Liwei Lin > Fix For: 2.2.0 > > > Reproducer > {code} > test("partitioned writing and batch reading with 'basePath'") { > val inputData = MemoryStream[Int] > val ds = inputData.toDS() > val outputDir = Utils.createTempDir(namePrefix = > "stream.output").getCanonicalPath > val checkpointDir = Utils.createTempDir(namePrefix = > "stream.checkpoint").getCanonicalPath > var query: StreamingQuery = null > try { > query = > ds.map(i => (i, i * 1000)) > .toDF("id", "value") > .writeStream > .partitionBy("id") > .option("checkpointLocation", checkpointDir) > .format("parquet") > .start(outputDir) > inputData.addData(1, 2, 3) > failAfter(streamingTimeout) { > query.processAllAvailable() > } > spark.read.option("basePath", outputDir).parquet(outputDir + > "/*").show() > } finally { > if (query != null) { > query.stop() > } > } > } > {code} > Stack trace > {code} > [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** > (3 seconds, 928 milliseconds) > [info] java.lang.AssertionError: assertion failed: Conflicting directory > structures detected. Suspicious paths: > [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637 > [info] > ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata > [info] > [info] If provided paths are partition directories, please set "basePath" in > the options of the data source to specify the root directory of the table. If > there are multiple root directories, please load them separately and then > union them. > [info] at scala.Predef$.assert(Predef.scala:170) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156) > [info] at > org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54) > [info] at > org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361) > [info] at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536) > [info] at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > [info] at > org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20548: Assignee: Sameer Agarwal (was: Apache Spark) > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995360#comment-15995360 ] Apache Spark commented on SPARK-20548: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17844 > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > Fix For: 2.2.0 > > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class
[ https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20548: Assignee: Apache Spark (was: Sameer Agarwal) > Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class > --- > > Key: SPARK-20548 > URL: https://issues.apache.org/jira/browse/SPARK-20548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark > Fix For: 2.2.0 > > > {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been > failing in-deterministically : https://spark-tests.appspot.com/failed-tests > over the last few days. > https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20570) The main version number on docs/latest/index.html
[ https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20570. --- Resolution: Fixed Assignee: Michael Armbrust Fix Version/s: 2.1.1 Oh, [~marmbrus] already did that, and it looks like it's OK now: http://spark.apache.org/docs/latest/ > The main version number on docs/latest/index.html > - > > Key: SPARK-20570 > URL: https://issues.apache.org/jira/browse/SPARK-20570 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liucht-inspur >Assignee: Michael Armbrust > Fix For: 2.1.1 > > > On the spark.apache.org home page, when I click the menu Latest Release > (Spark 2.1.1) under the documentation menu ,the next page latest appear with > display 2.1.0 lable in the upper left corner of the page -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20570) The main version number on docs/latest/index.html
[ https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995265#comment-15995265 ] Sean Owen commented on SPARK-20570: --- Oh, probably another git sync hiccup. It may 'flush' if you push an empty commit to the repo. I can try that in a sec. > The main version number on docs/latest/index.html > - > > Key: SPARK-20570 > URL: https://issues.apache.org/jira/browse/SPARK-20570 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liucht-inspur > > On the spark.apache.org home page, when I click the menu Latest Release > (Spark 2.1.1) under the documentation menu ,the next page latest appear with > display 2.1.0 lable in the upper left corner of the page -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20570) The main version number on docs/latest/index.html
[ https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995248#comment-15995248 ] Michael Armbrust commented on SPARK-20570: -- Hmmm, I did push them, and they show up on the [asf git website|https://git-wip-us.apache.org/repos/asf?p=spark-website.git;a=commit;h=d4f0c34ac33001169452a276acfebb7b8cc58c0f]. They don't seem to appear on github though. I guess I'll open an INFRA ticket. > The main version number on docs/latest/index.html > - > > Key: SPARK-20570 > URL: https://issues.apache.org/jira/browse/SPARK-20570 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liucht-inspur > > On the spark.apache.org home page, when I click the menu Latest Release > (Spark 2.1.1) under the documentation menu ,the next page latest appear with > display 2.1.0 lable in the upper left corner of the page -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF
Xiao Li created SPARK-20586: --- Summary: Add deterministic and distinctLike to ScalaUDF Key: SPARK-20586 URL: https://issues.apache.org/jira/browse/SPARK-20586 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Xiao Li Assignee: Xiao Li https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we only add the following two flags. - deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query. - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated on just the distinct values of a column. Examples include min and max UDFs. This information is used by metadata-only optimizer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20583) Scala/Java generic hint support
[ https://issues.apache.org/jira/browse/SPARK-20583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20583. - Resolution: Fixed Fix Version/s: 2.2.0 > Scala/Java generic hint support > --- > > Key: SPARK-20583 > URL: https://issues.apache.org/jira/browse/SPARK-20583 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.2.0 > > > Done in https://github.com/apache/spark/pull/17839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20584) Python generic hint support
Reynold Xin created SPARK-20584: --- Summary: Python generic hint support Key: SPARK-20584 URL: https://issues.apache.org/jira/browse/SPARK-20584 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20583) Scala/Java generic hint support
Reynold Xin created SPARK-20583: --- Summary: Scala/Java generic hint support Key: SPARK-20583 URL: https://issues.apache.org/jira/browse/SPARK-20583 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin Done in https://github.com/apache/spark/pull/17839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20585) R generic hint support
Reynold Xin created SPARK-20585: --- Summary: R generic hint support Key: SPARK-20585 URL: https://issues.apache.org/jira/browse/SPARK-20585 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20582. Resolution: Duplicate > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests
[ https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995149#comment-15995149 ] Felix Cheung commented on SPARK-20571: -- Thanks, I will look into today. > Flaky SparkR StructuredStreaming tests > -- > > Key: SPARK-20571 > URL: https://issues.apache.org/jira/browse/SPARK-20571 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Burak Yavuz > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
zhoukang created SPARK-20582: Summary: Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing Key: SPARK-20582 URL: https://issues.apache.org/jira/browse/SPARK-20582 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.1.0 Reporter: zhoukang Priority: Critical In the current code of HistoryServer,jetty server will be started after all the logs fetch from yarn has been replayed.However,when the number of logs is becoming larger,the start time of jetty will be too long. Here,we implement a solution which using ApplicationAttemptInfo checkpointing to speed up the start of historyserver.When historyserver is starting,it will load ApplicationAttemptInfo from checkpoint file first if exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing
[ https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhoukang updated SPARK-20582: - Description: In the current code of HistoryServer,jetty server will be started after all the logs fetch from yarn has been replayed.However,when the number of logs is becoming larger,the start time of jetty will be too long. Here,we implement a solution which using ApplicationAttemptInfo checkpointing to speed up the start of historyserver.When historyserver is starting,it will load ApplicationAttemptInfo from checkpoint file first if exists which is faster then replaying one by one. was: In the current code of HistoryServer,jetty server will be started after all the logs fetch from yarn has been replayed.However,when the number of logs is becoming larger,the start time of jetty will be too long. Here,we implement a solution which using ApplicationAttemptInfo checkpointing to speed up the start of historyserver.When historyserver is starting,it will load ApplicationAttemptInfo from checkpoint file first if exists which is faster then replaying one by one. > Speed up the restart of HistoryServer using ApplicationAttemptInfo > checkpointing > > > Key: SPARK-20582 > URL: https://issues.apache.org/jira/browse/SPARK-20582 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.1.0 >Reporter: zhoukang >Priority: Critical > >In the current code of HistoryServer,jetty server will be started > after all the logs fetch from yarn has been replayed.However,when the number > of logs is becoming larger,the start time of jetty will be too long. >Here,we implement a solution which using ApplicationAttemptInfo > checkpointing to speed up the start of historyserver.When historyserver is > starting,it will load ApplicationAttemptInfo from checkpoint file first if > exists which is faster then replaying one by one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20441) Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation
[ https://issues.apache.org/jira/browse/SPARK-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-20441. - Resolution: Fixed Resolved with https://github.com/apache/spark/pull/17735 > Within the same streaming query, one StreamingRelation should only be > transformed to one StreamingExecutionRelation > --- > > Key: SPARK-20441 > URL: https://issues.apache.org/jira/browse/SPARK-20441 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0 >Reporter: Liwei Lin > > Within the same streaming query, when one StreamingRelation is referred > multiple times -- e.g. df.union(df) -- we should transform it only to one > StreamingExecutionRelation, instead of two or more different > StreamingExecutionRelations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20432) Unioning two identical Streaming DataFrames fails during attribute resolution
[ https://issues.apache.org/jira/browse/SPARK-20432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz closed SPARK-20432. --- Resolution: Duplicate > Unioning two identical Streaming DataFrames fails during attribute resolution > - > > Key: SPARK-20432 > URL: https://issues.apache.org/jira/browse/SPARK-20432 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz > > To reproduce, try unioning two identical Kafka Streams: > {code} > df = spark.readStream.format("kafka")... \ > .select(from_json(col("value").cast("string"), > simpleSchema).alias("parsed_value")) > df.union(df).writeStream... > {code} > Exception is confusing: > {code} > org.apache.spark.sql.AnalysisException: resolved attribute(s) value#526 > missing from > value#511,topic#512,partition#513,offset#514L,timestampType#516,key#510,timestamp#515 > in operator !Project [jsontostructs(...) AS parsed_value#357]; > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20581) Using AVG or SUM on a INT/BIGINT column with fraction operator will yield BIGINT instead of DOUBLE
Dominic Ricard created SPARK-20581: -- Summary: Using AVG or SUM on a INT/BIGINT column with fraction operator will yield BIGINT instead of DOUBLE Key: SPARK-20581 URL: https://issues.apache.org/jira/browse/SPARK-20581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.2 Reporter: Dominic Ricard We stumbled on this multiple times and every time we are baffled by the behavior of AVG and SUM. Given the following SQL (Executed through Thrift): {noformat} SELECT SUM(col/2) FROM (SELECT 3 as `col`) t {noformat} The result will be "1", when the expected and accurate result is 1.5 Here's the explain plan: {noformat} == Physical Plan == TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as double) / 2.0) as bigint)),mode=Final,isDistinct=false)], output=[_c0#1519344L]) +- TungstenExchange SinglePartition, None +- TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as double) / 2.0) as bigint)),mode=Partial,isDistinct=false)], output=[sum#1519347L]) +- Project [3 AS col#1519342] +- Scan OneRowRelation[] {noformat} Why the extra cast to BIGINT? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects
[ https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994968#comment-15994968 ] Fernando Pereira commented on SPARK-20580: -- I try to avoid any operation involving serialization. Im using pyspark and the default RDD.cache() So eventually there is a bug somewhere. My case: In [11]: fzer.fdata.morphologyRDD.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count() Out[11]: 22431 In [12]: rdd2 = fzer.fdata.morphologyRDD.cache() In [13]: rdd2.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count() [Stage 7:>(0 + 8) / 128]17/05/03 16:22:52 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 652) org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) (...) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) > Allow RDD cache with unserializable objects > --- > > Key: SPARK-20580 > URL: https://issues.apache.org/jira/browse/SPARK-20580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Fernando Pereira >Priority: Minor > > In my current scenario we load complex Python objects in the worker nodes > that are not completely serializable. We then apply map certain operations to > the RDD which at some point we collect. In this basic usage all works well. > However, if we cache() the RDD (which defaults to memory) suddenly it fails > to execute the transformations after the caching step. Apparently caching > serializes the RDD data and deserializes it whenever more transformations are > required. > It would be nice to avoid serialization of the objects if they are to be > cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects
[ https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994947#comment-15994947 ] Sean Owen commented on SPARK-20580: --- In general, such objects need to be serializable, because otherwise there's no way to move such data, and most anything you do involves moving the data. You might be able to get away with in scenarios where the objects are created from a data source into memory and never participate in any operation that involves serialization. Here, it sounds like you chose one of the "_SER" storage levels which serializes the object into memory. If so, that's your problem. > Allow RDD cache with unserializable objects > --- > > Key: SPARK-20580 > URL: https://issues.apache.org/jira/browse/SPARK-20580 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Fernando Pereira >Priority: Minor > > In my current scenario we load complex Python objects in the worker nodes > that are not completely serializable. We then apply map certain operations to > the RDD which at some point we collect. In this basic usage all works well. > However, if we cache() the RDD (which defaults to memory) suddenly it fails > to execute the transformations after the caching step. Apparently caching > serializes the RDD data and deserializes it whenever more transformations are > required. > It would be nice to avoid serialization of the objects if they are to be > cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20580) Allow RDD cache with unserializable objects
Fernando Pereira created SPARK-20580: Summary: Allow RDD cache with unserializable objects Key: SPARK-20580 URL: https://issues.apache.org/jira/browse/SPARK-20580 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.0 Reporter: Fernando Pereira Priority: Minor In my current scenario we load complex Python objects in the worker nodes that are not completely serializable. We then apply map certain operations to the RDD which at some point we collect. In this basic usage all works well. However, if we cache() the RDD (which defaults to memory) suddenly it fails to execute the transformations after the caching step. Apparently caching serializes the RDD data and deserializes it whenever more transformations are required. It would be nice to avoid serialization of the objects if they are to be cached to memory, and keep the original object -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org