[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

```
root
|-- muons: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- reco::Candidate: struct (nullable = true)
|||-- qx3_: integer (nullable = true)
|||-- pt_: float (nullable = true)
|||-- eta_: float (nullable = true)
|||-- phi_: float (nullable = true)
|||-- mass_: float (nullable = true)
|||-- vertex_: struct (nullable = true)
||||-- fCoordinates: struct (nullable = true)
|||||-- fX: float (nullable = true)
|||||-- fY: float (nullable = true)
|||||-- fZ: float (nullable = true)
|||-- pdgId_: integer (nullable = true)
|||-- status_: integer (nullable = true)
|||-- cachePolarFixed_: struct (nullable = true)
|||-- cacheCartesianFixed_: struct (nullable = true)
```

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

{quote}
root
|-- muons: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- reco::Candidate: struct (nullable = true)
|||-- qx3_: integer (nullable = true)
|||-- pt_: float (nullable = true)
|||-- eta_: float (nullable = true)
|||-- phi_: float (nullable = true)
|||-- mass_: float (nullable = true)
|||-- vertex_: struct (nullable = true)
||||-- fCoordinates: struct (nullable = true)
|||||-- fX: float (nullable = true)
|||||-- fY: float (nullable = true)
|||||-- fZ: float (nullable = true)
|||-- pdgId_: integer (nullable = true)
|||-- status_: integer (nullable = true)
|||-- cachePolarFixed_: struct (nullable = true)
|||-- cacheCartesianFixed_: struct (nullable = true)
{quote}

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be 

[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

{quote}
root
|-- muons: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- reco::Candidate: struct (nullable = true)
|||-- qx3_: integer (nullable = true)
|||-- pt_: float (nullable = true)
|||-- eta_: float (nullable = true)
|||-- phi_: float (nullable = true)
|||-- mass_: float (nullable = true)
|||-- vertex_: struct (nullable = true)
||||-- fCoordinates: struct (nullable = true)
|||||-- fX: float (nullable = true)
|||||-- fY: float (nullable = true)
|||||-- fZ: float (nullable = true)
|||-- pdgId_: integer (nullable = true)
|||-- status_: integer (nullable = true)
|||-- cachePolarFixed_: struct (nullable = true)
|||-- cacheCartesianFixed_: struct (nullable = true)
{quote}

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
- muons: array (nullable = true)
  - element: struct (containsNull = true)
- reco::Candidate: struct (nullable = true)
  - qx3_: integer (nullable = true)
  - pt_: float (nullable = true)
  - eta_: float (nullable = true)
  - phi_: float (nullable = true)
  - mass_: float (nullable = true)
  - vertex_: struct (nullable = true)
  - fCoordinates: struct (nullable = true)
  - fX: float (nullable = true)
  - fY: float (nullable = true)
  - fZ: float (nullable = true)
  - pdgId_: integer (nullable = true)
  - status_: integer (nullable = true)
  - cachePolarFixed_: struct (nullable = true)
  - cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original 

[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
- muons: array (nullable = true)
  - element: struct (containsNull = true)
- reco::Candidate: struct (nullable = true)
  - qx3_: integer (nullable = true)
  - pt_: float (nullable = true)
  - eta_: float (nullable = true)
  - phi_: float (nullable = true)
  - mass_: float (nullable = true)
  - vertex_: struct (nullable = true)
  - fCoordinates: struct (nullable = true)
  - fX: float (nullable = true)
  - fY: float (nullable = true)
  - fZ: float (nullable = true)
  - pdgId_: integer (nullable = true)
  - status_: integer (nullable = true)
  - cachePolarFixed_: struct (nullable = true)
  - cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
* muons: array (nullable = true)
  * element: struct (containsNull = true)
* reco::Candidate: struct (nullable = true)
  * qx3_: integer (nullable = true)
* pt_: float (nullable = true)
* eta_: float (nullable = true)
* phi_: float (nullable = true)
* mass_: float (nullable = true)
* vertex_: struct (nullable = true)
* fCoordinates: struct (nullable = true)
* fX: float (nullable = true)
* fY: float (nullable = true)
* fZ: float (nullable = true)
* pdgId_: integer (nullable = true)
* status_: integer (nullable = true)
* cachePolarFixed_: struct (nullable = true)
* cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group


> Writing Parquet: Cannot build an empty group
> 

[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
* muons: array (nullable = true)
  * element: struct (containsNull = true)
* reco::Candidate: struct (nullable = true)
  * qx3_: integer (nullable = true)
* pt_: float (nullable = true)
* eta_: float (nullable = true)
* phi_: float (nullable = true)
* mass_: float (nullable = true)
* vertex_: struct (nullable = true)
* fCoordinates: struct (nullable = true)
* fX: float (nullable = true)
* fY: float (nullable = true)
* fZ: float (nullable = true)
* pdgId_: integer (nullable = true)
* status_: integer (nullable = true)
* cachePolarFixed_: struct (nullable = true)
* cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
* muons: array (nullable = true)
  * element: struct (containsNull = true)
   * reco::Candidate: struct (nullable = true)
* qx3_: integer (nullable = true)
* pt_: float (nullable = true)
* eta_: float (nullable = true)
* phi_: float (nullable = true)
* mass_: float (nullable = true)
* vertex_: struct (nullable = true)
* fCoordinates: struct (nullable = true)
* fX: float (nullable = true)
* fY: float (nullable = true)
* fZ: float (nullable = true)
* pdgId_: integer (nullable = true)
* status_: integer (nullable = true)
* cachePolarFixed_: struct (nullable = true)
* cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group


> Writing Parquet: Cannot build an empty group
> 

[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
* muons: array (nullable = true)
  * element: struct (containsNull = true)
   * reco::Candidate: struct (nullable = true)
* qx3_: integer (nullable = true)
* pt_: float (nullable = true)
* eta_: float (nullable = true)
* phi_: float (nullable = true)
* mass_: float (nullable = true)
* vertex_: struct (nullable = true)
* fCoordinates: struct (nullable = true)
* fX: float (nullable = true)
* fY: float (nullable = true)
* fZ: float (nullable = true)
* pdgId_: integer (nullable = true)
* status_: integer (nullable = true)
* cachePolarFixed_: struct (nullable = true)
* cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
 muons: array (nullable = true)
   element: struct (containsNull = true)
reco::Candidate: struct (nullable = true)
  qx3_: integer (nullable = true)
  pt_: float (nullable = true)
  eta_: float (nullable = true)
  phi_: float (nullable = true)
  mass_: float (nullable = true)
  vertex_: struct (nullable = true)
  fCoordinates: struct (nullable = true)
  fX: float (nullable = true)
  fY: float (nullable = true)
  fZ: float (nullable = true)
  pdgId_: integer (nullable = true)
  status_: integer (nullable = true)
  cachePolarFixed_: struct (nullable = true)
  cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group


> Writing Parquet: Cannot build an empty group
> 

[jira] [Updated] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viktor Khristenko updated SPARK-20593:
--
Description: 
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
 muons: array (nullable = true)
   element: struct (containsNull = true)
reco::Candidate: struct (nullable = true)
  qx3_: integer (nullable = true)
  pt_: float (nullable = true)
  eta_: float (nullable = true)
  phi_: float (nullable = true)
  mass_: float (nullable = true)
  vertex_: struct (nullable = true)
  fCoordinates: struct (nullable = true)
  fX: float (nullable = true)
  fY: float (nullable = true)
  fZ: float (nullable = true)
  pdgId_: integer (nullable = true)
  status_: integer (nullable = true)
  cachePolarFixed_: struct (nullable = true)
  cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group

  was:
Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
|-- muons: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- reco::Candidate: struct (nullable = true)
|||-- qx3_: integer (nullable = true)
|||-- pt_: float (nullable = true)
|||-- eta_: float (nullable = true)
|||-- phi_: float (nullable = true)
|||-- mass_: float (nullable = true)
|||-- vertex_: struct (nullable = true)
||||-- fCoordinates: struct (nullable = true)
|||||-- fX: float (nullable = true)
|||||-- fY: float (nullable = true)
|||||-- fZ: float (nullable = true)
|||-- pdgId_: integer (nullable = true)
|||-- status_: integer (nullable = true)
|||-- cachePolarFixed_: struct (nullable = true)
|||-- cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 

[jira] [Created] (SPARK-20593) Writing Parquet: Cannot build an empty group

2017-05-03 Thread Viktor Khristenko (JIRA)
Viktor Khristenko created SPARK-20593:
-

 Summary: Writing Parquet: Cannot build an empty group
 Key: SPARK-20593
 URL: https://issues.apache.org/jira/browse/SPARK-20593
 Project: Spark
  Issue Type: Question
  Components: Spark Core, Spark Shell
Affects Versions: 2.1.1
 Environment: I use Apache Spark 2.1.1 (used 2.1.0 and it was the same, 
switched today). Tested only Mac
Reporter: Viktor Khristenko
Priority: Minor


Hi,

This is my first ticket and I apologize for/if I'm doing certain things in an 
improper way.

 I have a dataset:

root
|-- muons: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- reco::Candidate: struct (nullable = true)
|||-- qx3_: integer (nullable = true)
|||-- pt_: float (nullable = true)
|||-- eta_: float (nullable = true)
|||-- phi_: float (nullable = true)
|||-- mass_: float (nullable = true)
|||-- vertex_: struct (nullable = true)
||||-- fCoordinates: struct (nullable = true)
|||||-- fX: float (nullable = true)
|||||-- fY: float (nullable = true)
|||||-- fZ: float (nullable = true)
|||-- pdgId_: integer (nullable = true)
|||-- status_: integer (nullable = true)
|||-- cachePolarFixed_: struct (nullable = true)
|||-- cacheCartesianFixed_: struct (nullable = true)

As you can see, there are 3 empty structs in this schema. I know 100% that I 
can read/manipulate/do whatever. However, when I try writing to disk in 
parquet, I get the following Exception:

ds.write.format("parquet").save(outputPathName):

java.lang.IllegalStateException: Cannot build an empty group
at org.apache.parquet.Preconditions.checkState(Preconditions.java:91)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:622)
at org.apache.parquet.schema.Types$BaseGroupBuilder.build(Types.java:497)
at org.apache.parquet.schema.Types$Builder.named(Types.java:286)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:535)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:321)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:534)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertField$1.apply(ParquetSchemaConverter.scala:533)
So, basically I would like to understand if it's a bug or an intended 
behavior??? I also assume that it's related to the empty structs. Any help 
would be really appreciated!

I've quickly created stripped version and that one works without any issues!
For reference, I put a link to a original question on SO[1]

VK

[1] 
http://stackoverflow.com/questions/43767358/apache-spark-parquet-cannot-build-an-empty-group



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20592) Alter table concatenate is not working as expected.

2017-05-03 Thread Guru Prabhakar Reddy Marthala (JIRA)
Guru Prabhakar Reddy Marthala created SPARK-20592:
-

 Summary: Alter table concatenate is not working as expected.
 Key: SPARK-20592
 URL: https://issues.apache.org/jira/browse/SPARK-20592
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Guru Prabhakar Reddy Marthala


Created a table using CTAS from csv to parquet.Parquet table generated numerous 
small files.tried alter table concatenate but it's not working as expected.

spark.sql("CREATE TABLE flight.flight_data(year INT,   month INT,   day INT,   
day_of_week INT,   dep_time INT,   crs_dep_time INT,   arr_time INT,   
crs_arr_time INT,   unique_carrier STRING,   flight_num INT,   tail_num STRING, 
  actual_elapsed_time INT,   crs_elapsed_time INT,   air_time INT,   arr_delay 
INT,   dep_delay INT,   origin STRING,   dest STRING,   distance INT,   taxi_in 
INT,   taxi_out INT,   cancelled INT,   cancellation_code STRING,   diverted 
INT,   carrier_delay STRING,   weather_delay STRING,   nas_delay STRING,   
security_delay STRING,   late_aircraft_delay STRING) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' stored as textfile")

spark.sql("load data local INPATH 'i:/2008/2008.csv' INTO TABLE 
flight.flight_data")

spark.sql("create table flight.flight_data_pq stored as parquet as select * 
from flight.flight_data")

spark.sql("create table flight.flight_data_orc stored as orc as select * from 
flight.flight_data")

pyspark.sql.utils.ParseException: u'\nOperation not allowed: alter table 
concatenate(line 1, pos 0)\n\n== SQL ==\nalter table flight_data.flight_data_pq 
concatenate\n^^^\n'

Tried on both orc and parquet format.It's not working.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19660) Replace the configuration property names that are deprecated in the version of Hadoop 2.6

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996175#comment-15996175
 ] 

Apache Spark commented on SPARK-19660:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17856

> Replace the configuration property names that are deprecated in the version 
> of Hadoop 2.6
> -
>
> Key: SPARK-19660
> URL: https://issues.apache.org/jira/browse/SPARK-19660
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
> Fix For: 2.2.0
>
>
> Replace all the Hadoop deprecated configuration property names according to  
> [DeprecatedProperties|https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html].
> except:
> https://github.com/apache/spark/blob/v2.1.0/python/pyspark/sql/tests.py#L1533
> https://github.com/apache/spark/blob/v2.1.0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala#L987
> https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/SetCommand.scala#L45
> https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L614



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20543) R should skip long running or non-essential tests when running on CRAN

2017-05-03 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20543.
--
  Resolution: Fixed
   Fix Version/s: 2.3.0
  2.2.0
Target Version/s: 2.2.0, 2.3.0

> R should skip long running or non-essential tests when running on CRAN
> --
>
> Key: SPARK-20543
> URL: https://issues.apache.org/jira/browse/SPARK-20543
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.0, 2.3.0
>
>
> This is actually recommended in the CRAN policies



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-05-03 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996166#comment-15996166
 ] 

Helena Edelson commented on SPARK-18057:


With the current 0.10.0.1 version we have several issues happening, forcing us 
into ever tighter situations. Much of this is constraints related to new 
functionality in later Kafka releases around kafka security and SASL_SSL and 
related behavior not in previous versions of Kafka. 

Users in our ecosystem can not delete topics on clusters so this is not our 
relevant use case. It seems only structured streaming  kafka does deleteTopic, 
vs spark-streaming-kafka. 

I've had to create an internal fork so that we can use Kafka 0.10.2.0 in Spark, 
which is bad but we are blocked otherwise.

[~ijuma] good to know on the timing. A group of us voted for 
https://issues.apache.org/jira/browse/KAFKA-4879. 

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20564:


Assignee: Apache Spark

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>Assignee: Apache Spark
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20564:


Assignee: (was: Apache Spark)

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996165#comment-15996165
 ] 

Apache Spark commented on SPARK-20564:
--

User 'mariahualiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17854

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20591) Succeeded tasks num not equal in job page and job detail page on spark web ui when speculative task(s) exist

2017-05-03 Thread Jinhua Fu (JIRA)
Jinhua Fu created SPARK-20591:
-

 Summary: Succeeded tasks num not equal in job page and job detail 
page on spark web ui when speculative task(s) exist
 Key: SPARK-20591
 URL: https://issues.apache.org/jira/browse/SPARK-20591
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Jinhua Fu


when spark.speculation is enabled,and there are some speculative tasks, then we 
can see succeeded tasks num include speculative tasks on the job page, which 
however not being included on the job detail page(job stages page).
When I consider some tasks may run a little slow by the job page's  succeeded 
tasks more than total tasks,which make me want to known which tasks and why,I 
have to check every stage to find the speculative tasks which is beacause 
speculative tasks not being included in the stage succeeded task num.
Can it be improved?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20584) Python generic hint support

2017-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20584.
-
   Resolution: Fixed
 Assignee: Maciej Szymkiewicz
Fix Version/s: 2.2.0

> Python generic hint support
> ---
>
> Key: SPARK-20584
> URL: https://issues.apache.org/jira/browse/SPARK-20584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996013#comment-15996013
 ] 

zhoukang edited comment on SPARK-20582 at 5/4/17 1:44 AM:
--

You are right [~vanzin],i have looked into you implementation.Thanks for your 
time.


was (Author: cane):
You are right [~vanzin],i searched into you implementation.Thanks for your time.

> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996013#comment-15996013
 ] 

zhoukang commented on SPARK-20582:
--

You are right [~vanzin],i searched into you implementation.Thanks for your time.

> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20546:


Assignee: Apache Spark

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Assignee: Apache Spark
>Priority: Minor
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20546:


Assignee: (was: Apache Spark)

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Priority: Minor
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20546) spark-class gets syntax error in posix mode

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995999#comment-15995999
 ] 

Apache Spark commented on SPARK-20546:
--

User 'jyu00' has created a pull request for this issue:
https://github.com/apache/spark/pull/17852

> spark-class gets syntax error in posix mode
> ---
>
> Key: SPARK-20546
> URL: https://issues.apache.org/jira/browse/SPARK-20546
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.2
>Reporter: Jessie Yu
>Priority: Minor
>
> spark-class gets the following error when running in posix mode:
> {code}
> spark-class: line 78: syntax error near unexpected token `<'
> spark-class: line 78: `done < <(build_command "$@")'
> {code}
> \\
> It appears to be complaining about the process substitution: 
> {code}
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <(build_command "$@")
> {code}
> \\
> This can be reproduced by first turning on allexport then posix mode:
> {code}set -a -o posix {code}
> then run something like spark-shell which calls spark-class.
> \\
> The simplest fix is probably to always turn off posix mode in spark-class 
> before the while loop.
> \\
> This was previously reported in 
> [SPARK-8417|https://issues.apache.org/jira/browse/SPARK-8417] which closed 
> with cannot reproduce. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2017-05-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995990#comment-15995990
 ] 

Joseph K. Bradley commented on SPARK-17833:
---

I'll add: There definitely needs to be a deterministic version in order to 
compute row indices.  In my case, this is important for GraphFrames.  The last 
time I checked, you can get non-deterministic behavior by, e.g., shuffling a 
DataFrame, adding a monotonicallyIncreasingId column, and inspecting.


> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Critical
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2017-05-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17833:
--
Summary: 'monotonicallyIncreasingId()' should be deterministic  (was: 
'monotonicallyIncreasingId()' should probably be deterministic)

> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Minor
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2017-05-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17833:
--
Component/s: SQL

> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Critical
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17833) 'monotonicallyIncreasingId()' should be deterministic

2017-05-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17833:
--
Priority: Critical  (was: Minor)

> 'monotonicallyIncreasingId()' should be deterministic
> -
>
> Key: SPARK-17833
> URL: https://issues.apache.org/jira/browse/SPARK-17833
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kevin Ushey
>Priority: Critical
>
> Right now, it's (IMHO) too easy to shoot yourself in the foot using 
> 'monotonicallyIncreasingId()', as it's easy to expect the generated numbers 
> to function as a 'stable' primary key, for example, and then go on to use 
> that key in e.g. 'joins' and so on.
> Is there any reason why this function can't be made deterministic? Or, could 
> a deterministic analogue of this function be added (e.g. 
> 'withPrimaryKey(columnName = ...)')?
> A solution is to immediately cache / persist the table after calling 
> 'monotonicallyIncreasingId()'; it's also possible that the documentation 
> should spell that out loud and clear.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20585) R generic hint support

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20585:


Assignee: Apache Spark

> R generic hint support
> --
>
> Key: SPARK-20585
> URL: https://issues.apache.org/jira/browse/SPARK-20585
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20585) R generic hint support

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20585:


Assignee: (was: Apache Spark)

> R generic hint support
> --
>
> Key: SPARK-20585
> URL: https://issues.apache.org/jira/browse/SPARK-20585
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20063) Trigger without delay when falling behind

2017-05-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20063.
---
Resolution: Duplicate

> Trigger without delay when falling behind 
> --
>
> Key: SPARK-20063
> URL: https://issues.apache.org/jira/browse/SPARK-20063
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Today, when we miss a trigger interval we wait until the next one to fire.  
> However, for real workloads this usually means that you fall further and 
> further behind by sitting idle while waiting.  We should revisit this 
> decision.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20585) R generic hint support

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995965#comment-15995965
 ] 

Apache Spark commented on SPARK-20585:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17851

> R generic hint support
> --
>
> Key: SPARK-20585
> URL: https://issues.apache.org/jira/browse/SPARK-20585
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995958#comment-15995958
 ] 

Marcelo Vanzin commented on SPARK-20582:


You're talking about checkpointing the application info, which the code I have 
already does. With my changes the HTTP server comes up immediately.

> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995951#comment-15995951
 ] 

zhoukang commented on SPARK-20582:
--

[~vanzin]I think you misunderstand this issue.This issue in order to let jetty 
server listen to given port faster.Since when there are too many logs need 
replaying,the time may be tens of minutes.

> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995939#comment-15995939
 ] 

Hyukjin Kwon commented on SPARK-20588:
--

Hm, I thought {{TimeZone.getTimeZone}} itself caches once it computes. 

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995928#comment-15995928
 ] 

Ryan Blue commented on SPARK-20213:
---

[~zsxwing], the PR adds a method that causes tests to fail if they aren't 
wrapped. If you remove the additional high-level calls to withNewExecutionId 
that I added, you can see all the test failures.

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-03 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995923#comment-15995923
 ] 

Shixiong Zhu edited comment on SPARK-20213 at 5/4/17 12:02 AM:
---

I tested the master branch, and I can see "insertInto" in SQL tab.  

!Screen Shot 2017-05-03 at 5.00.19 PM.png|width=300!

Could you clarify the issue? It would be great if you can provide a reproducer.


was (Author: zsxwing):
I tested the master branch, and I can see "insertInto" in SQL tab. Could you 
clarify the issue? It would be great if you can provide a reproducer.

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-03 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995923#comment-15995923
 ] 

Shixiong Zhu commented on SPARK-20213:
--

I tested the master branch, and I can see "insertInto" in SQL tab. Could you 
clarify the issue? It would be great if you can provide a reproducer.

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20213) DataFrameWriter operations do not show up in SQL tab

2017-05-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20213:
-
Attachment: Screen Shot 2017-05-03 at 5.00.19 PM.png

> DataFrameWriter operations do not show up in SQL tab
> 
>
> Key: SPARK-20213
> URL: https://issues.apache.org/jira/browse/SPARK-20213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Ryan Blue
> Attachments: Screen Shot 2017-05-03 at 5.00.19 PM.png
>
>
> In 1.6.1, {{DataFrame}} writes started using {{DataFrameWriter}} actions like 
> {{insertInto}} would show up in the SQL tab. In 2.0.0 and later, they no 
> longer do. The problem is that 2.0.0 and later no longer wrap execution with 
> {{SQLExecution.withNewExecutionId}}, which emits 
> {{SparkListenerSQLExecutionStart}}.
> Here are the relevant parts of the stack traces:
> {code:title=Spark 1.6.1}
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toRdd$1.apply(QueryExecution.scala:56)
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:56)
>  => holding 
> Monitor(org.apache.spark.sql.hive.HiveContext$QueryExecution@424773807})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:196)
> {code}
> {code:title=Spark 2.0.0}
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
>  => holding Monitor(org.apache.spark.sql.execution.QueryExecution@490977924})
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:301)
> {code}
> I think this was introduced by 
> [54d23599|https://github.com/apache/spark/commit/54d23599]. The fix should be 
> to add withNewExecutionId to 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L610



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17939) Spark-SQL Nullability: Optimizations vs. Enforcement Clarification

2017-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17939:
-
Target Version/s: 2.3.0  (was: 2.2.0)

> Spark-SQL Nullability: Optimizations vs. Enforcement Clarification
> --
>
> Key: SPARK-17939
> URL: https://issues.apache.org/jira/browse/SPARK-17939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Aleksander Eskilson
>Priority: Critical
>
> The notion of Nullability of of StructFields in DataFrames and Datasets 
> creates some confusion. As has been pointed out previously [1], Nullability 
> is a hint to the Catalyst optimizer, and is not meant to be a type-level 
> enforcement. Allowing null fields can also help the reader successfully parse 
> certain types of more loosely-typed data, like JSON and CSV, where null 
> values are common, rather than just failing. 
> There's already been some movement to clarify the meaning of Nullable in the 
> API, but also some requests for a (perhaps completely separate) type-level 
> implementation of Nullable that can act as an enforcement contract.
> This bug is logged here to discuss and clarify this issue.
> [1] - 
> [https://issues.apache.org/jira/browse/SPARK-11319|https://issues.apache.org/jira/browse/SPARK-11319?focusedCommentId=15014535=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15014535]
> [2] - https://github.com/apache/spark/pull/11785



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20584) Python generic hint support

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20584:


Assignee: (was: Apache Spark)

> Python generic hint support
> ---
>
> Key: SPARK-20584
> URL: https://issues.apache.org/jira/browse/SPARK-20584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20584) Python generic hint support

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20584:


Assignee: Apache Spark

> Python generic hint support
> ---
>
> Key: SPARK-20584
> URL: https://issues.apache.org/jira/browse/SPARK-20584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20584) Python generic hint support

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995910#comment-15995910
 ] 

Apache Spark commented on SPARK-20584:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/17850

> Python generic hint support
> ---
>
> Key: SPARK-20584
> URL: https://issues.apache.org/jira/browse/SPARK-20584
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995862#comment-15995862
 ] 

Apache Spark commented on SPARK-10931:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/17849

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20586:


Assignee: Apache Spark  (was: Xiao Li)

> Add deterministic and distinctLike to ScalaUDF
> --
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.
> - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated 
> on just the distinct values of a column. Examples include min and max UDFs. 
> This information is used by metadata-only optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20586:


Assignee: Xiao Li  (was: Apache Spark)

> Add deterministic and distinctLike to ScalaUDF
> --
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.
> - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated 
> on just the distinct values of a column. Examples include min and max UDFs. 
> This information is used by metadata-only optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995811#comment-15995811
 ] 

Apache Spark commented on SPARK-20586:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17848

> Add deterministic and distinctLike to ScalaUDF
> --
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.
> - distinctLike: A UDF is considered distinctLike if the UDF can be evaluated 
> on just the distinct values of a column. Examples include min and max UDFs. 
> This information is used by metadata-only optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20590:


Assignee: (was: Apache Spark)

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20590:


Assignee: Apache Spark

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995799#comment-15995799
 ] 

Apache Spark commented on SPARK-20590:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/17847

> Map default input data source formats to inlined classes
> 
>
> Key: SPARK-20590
> URL: https://issues.apache.org/jira/browse/SPARK-20590
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>
> One of the common usability problems around reading data in spark 
> (particularly CSV) is that there can often be a conflict between different 
> readers in the classpath.
> As an example, if someone launches a 2.x spark shell with the spark-csv 
> package in the classpath, Spark currently fails in an extremely unfriendly way
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> java.lang.RuntimeException: Multiple sources found for csv 
> (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
> com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
> class name.
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
>   ... 48 elided
> {code}
> This JIRA proposes a simple way of fixing this error by always mapping 
> default input data source formats to inlined classes (that exist in Spark).
> {code}
> ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
> scala> val df = spark.read.csv("/foo/bar.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20590) Map default input data source formats to inlined classes

2017-05-03 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-20590:
--

 Summary: Map default input data source formats to inlined classes
 Key: SPARK-20590
 URL: https://issues.apache.org/jira/browse/SPARK-20590
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Sameer Agarwal


One of the common usability problems around reading data in spark (particularly 
CSV) is that there can often be a conflict between different readers in the 
classpath.

As an example, if someone launches a 2.x spark shell with the spark-csv package 
in the classpath, Spark currently fails in an extremely unfriendly way

{code}
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
  at scala.sys.package$.error(package.scala:27)
  at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
  at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
  at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
  ... 48 elided
{code}

This JIRA proposes a simple way of fixing this error by always mapping default 
input data source formats to inlined classes (that exist in Spark).

{code}
./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
scala> val df = spark.read.csv("/foo/bar.csv")
df: org.apache.spark.sql.DataFrame = [_c0: string]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-05-03 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995650#comment-15995650
 ] 

Thomas Graves commented on SPARK-20589:
---

thanks for the suggestions [~mridulm80].

Note just for reference tez did a similar thing under 
https://issues.apache.org/jira/browse/TEZ-2914, may be useful for the yarn side 
of things here.

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20588:
--
Issue Type: Improvement  (was: Bug)

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-05-03 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995550#comment-15995550
 ] 

Mridul Muralidharan commented on SPARK-20589:
-

coalasce with shuffle=false might be a workaround if source is already 
persisted ?
(I see the benefit of the jira, just wondering if this will unblock you !)

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20589) Allow limiting task concurrency per stage

2017-05-03 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-20589:
-

 Summary: Allow limiting task concurrency per stage
 Key: SPARK-20589
 URL: https://issues.apache.org/jira/browse/SPARK-20589
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Thomas Graves


It would be nice to have the ability to limit the number of concurrent tasks 
per stage.  This is useful when your spark job might be accessing another 
service and you don't want to DOS that service.  For instance Spark writing to 
hbase or Spark doing http puts on a service.  Many times you want to do this 
without limiting the number of partitions. 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19243) Error when selecting from DataFrame containing parsed data from files larger than 1MB

2017-05-03 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995489#comment-15995489
 ] 

Harish edited comment on SPARK-19243 at 5/3/17 7:52 PM:


i am getting the same error in spark 2.1.0. 
I have 10 node cluster with 109GB each.
My data set is just 30K rows with 60 columns. I see total 72 partitions after 
loading the orc file to DF. then re-partitioned to 2001. No luck.

[~srowen]  did any one raised the similar issue?

Regards,
 Harish


was (Author: harishk15):
i am getting the same error in spark 2.1.0. 
I have 10 node cluster with 109GB each.
My data set is just 30K rows with 60 columns. I see total 72 partitions after 
loading the orc file to DF. then re-partitioned to 2001. No luck.

Regards,
 Harish

> Error when selecting from DataFrame containing parsed data from files larger 
> than 1MB
> -
>
> Key: SPARK-19243
> URL: https://issues.apache.org/jira/browse/SPARK-19243
> Project: Spark
>  Issue Type: Bug
>Reporter: Ben
>
> I hope I can describe the problem clearly. This error happens with Spark 
> 2.0.1. However, I tried with Spark 2.1.0 on my test PC and it worked there, 
> none of the issues below, but I can't try it on the test cluster because 
> Spark needs to be upgraded there. I'm opening this ticket because if it's a 
> bug, maybe something is still partially present in Spark 2.1.0.
> Initially I though it was my script's problem so I tried to debug, until I 
> found why this is happening.
> Step by step, I load XML files through spark-xml into a DataFrame. In my 
> case, the rowTag is the root tag, so each XML file creates a row. The XML 
> structure is fairly complex, which are converted to nested columns or arrays 
> inside the DF. Since I need to flatten the whole table, and since the output 
> is not fixed but I dynamically select what I want as output, in case I need 
> to output columns that have been parsed as arrays, then I explode them with 
> explode() only when needed.
> Normally I can select various columns that don't have many entries without a 
> problem. 
> I select a column that has a lot of entries into a new DF, e.g. simply through
> {noformat}
> df2 = df.select(...)
> {noformat}
> and then if I try to do a count() or first() or anything, Spark behaves two 
> ways:
> 1. If the source file was smaller than 1MB, it works.
> 2. If the source file was larger than 1MB, the following error occurs:
> {noformat}
> Traceback (most recent call last):
>   File \"/myCode.py\", line 71, in main
> df.count()
>   File 
> \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py\",
>  line 299, in count
> return int(self._jdf.count())
>   File 
> \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py\",
>  line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/utils.py\",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py\",
>  line 319, in get_return_value
> format(target_id, \".\", name), value)
> Py4JJavaError: An error occurred while calling o180.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 6, compname): java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1307)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:438)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:606)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:663)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 

[jira] [Commented] (SPARK-20568) Delete files after processing in structured streaming

2017-05-03 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995507#comment-15995507
 ] 

Shixiong Zhu commented on SPARK-20568:
--

[~srowen] Structured Streaming's Source has a "commit" method to let a source 
discard unused data. This should be easy to implement in  Structured Streaming.

> Delete files after processing in structured streaming
> -
>
> Key: SPARK-20568
> URL: https://issues.apache.org/jira/browse/SPARK-20568
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Saul Shanabrook
>
> It would be great to be able to delete files after processing them with 
> structured streaming.
> For example, I am reading in a bunch of JSON files and converting them into 
> Parquet. If the JSON files are not deleted after they are processed, it 
> quickly fills up my hard drive. I originally [posted this on Stack 
> Overflow|http://stackoverflow.com/q/43671757/907060] and was recommended to 
> make a feature request for it. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-03 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995499#comment-15995499
 ] 

Ameen Tayyebi edited comment on SPARK-20588 at 5/3/17 7:40 PM:
---

Hopefully more readable version of the call stack:

{code:java}
"Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.TimeZone.getTimeZone(TimeZone.java:516)
- waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
java.util.TimeZone)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: ameen.tayy...@gmail.com):
Hopefully more readable version of the call stack:

"Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.TimeZone.getTimeZone(TimeZone.java:516)
- waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
java.util.TimeZone)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>

[jira] [Commented] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-03 Thread Ameen Tayyebi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995499#comment-15995499
 ] 

Ameen Tayyebi commented on SPARK-20588:
---

Hopefully more readable version of the call stack:

"Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.TimeZone.getTimeZone(TimeZone.java:516)
- waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
java.util.TimeZone)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> from_utc_timestamp causes bottleneck
> 
>
> Key: SPARK-20588
> URL: https://issues.apache.org/jira/browse/SPARK-20588
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
> Environment: AWS EMR AMI 5.2.1
>Reporter: Ameen Tayyebi
>
> We have a SQL query that makes use of the from_utc_timestamp function like 
> so: from_utc_timestamp(itemSigningTime,'America/Los_Angeles')
> This causes a major bottleneck. Our exact call is:
> date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)
> Switching from the above to date_add(itemSigningTime, 1) reduces the job 
> running time from 40 minutes to 9.
> When from_utc_timestamp function is used, several threads in the executors 
> are in the BLOCKED state, on this call stack:
> "Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
> tid=0x7f848472e000 nid=0x4294 waiting for monitor entry 
> [0x7f501981c000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at java.util.TimeZone.getTimeZone(TimeZone.java:516)
> - waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
> java.util.TimeZone)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
> at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Can we cache the locale's once per JVM so that we don't do this for every 
> record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Created] (SPARK-20588) from_utc_timestamp causes bottleneck

2017-05-03 Thread Ameen Tayyebi (JIRA)
Ameen Tayyebi created SPARK-20588:
-

 Summary: from_utc_timestamp causes bottleneck
 Key: SPARK-20588
 URL: https://issues.apache.org/jira/browse/SPARK-20588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
 Environment: AWS EMR AMI 5.2.1
Reporter: Ameen Tayyebi


We have a SQL query that makes use of the from_utc_timestamp function like so: 
from_utc_timestamp(itemSigningTime,'America/Los_Angeles')

This causes a major bottleneck. Our exact call is:
date_add(from_utc_timestamp(itemSigningTime,'America/Los_Angeles'), 1)

Switching from the above to date_add(itemSigningTime, 1) reduces the job 
running time from 40 minutes to 9.

When from_utc_timestamp function is used, several threads in the executors are 
in the BLOCKED state, on this call stack:

"Executor task launch worker-63" #261 daemon prio=5 os_prio=0 
tid=0x7f848472e000 nid=0x4294 waiting for monitor entry [0x7f501981c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.TimeZone.getTimeZone(TimeZone.java:516)
- waiting to lock <0x7f5216c2aa58> (a java.lang.Class for 
java.util.TimeZone)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTimestamp(DateTimeUtils.scala:356)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToTimestamp(DateTimeUtils.scala)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
 at java.lang.Thread.run(Thread.java:745)

Can we cache the locale's once per JVM so that we don't do this for every 
record?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20562:


Assignee: (was: Apache Spark)

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20562:


Assignee: Apache Spark

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>Assignee: Apache Spark
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995495#comment-15995495
 ] 

Apache Spark commented on SPARK-20562:
--

User 'gkc2104' has created a pull request for this issue:
https://github.com/apache/spark/pull/17846

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995493#comment-15995493
 ] 

Kamal Gurala commented on SPARK-20562:
--

Spark is not smart about offers that might be scheduled for maintenance i.e. 
has an Unavailability period set. It also cannot estimate the amount of time a 
scheduled Task would use an Offer for.
It is however easier for Users to guess how long they think a Task would run 
for. So they can easily set a Configurable Threshold(x) that makes the Spark 
scheduler be wary of which offers it can accept and which ones might go under 
maintenance in x amount of time.
 

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20562) Support Maintenance by having a threshold for unavailability

2017-05-03 Thread Kamal Gurala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamal Gurala updated SPARK-20562:
-
Issue Type: Improvement  (was: Bug)

> Support Maintenance by having a threshold for unavailability
> 
>
> Key: SPARK-20562
> URL: https://issues.apache.org/jira/browse/SPARK-20562
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Kamal Gurala
>
> Make Spark be aware of offers that have an unavailability period set because 
> of a scheduled Maintenance on the node.
> Have a configurable option that's a threshold which ensures that tasks are 
> not scheduled on offers that are within a threshold for maintenance



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19243) Error when selecting from DataFrame containing parsed data from files larger than 1MB

2017-05-03 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995489#comment-15995489
 ] 

Harish commented on SPARK-19243:


i am getting the same error in spark 2.1.0. 
I have 10 node cluster with 109GB each.
My data set is just 30K rows with 60 columns. I see total 72 partitions after 
loading the orc file to DF. then re-partitioned to 2001. No luck.

Regards,
 Harish

> Error when selecting from DataFrame containing parsed data from files larger 
> than 1MB
> -
>
> Key: SPARK-19243
> URL: https://issues.apache.org/jira/browse/SPARK-19243
> Project: Spark
>  Issue Type: Bug
>Reporter: Ben
>
> I hope I can describe the problem clearly. This error happens with Spark 
> 2.0.1. However, I tried with Spark 2.1.0 on my test PC and it worked there, 
> none of the issues below, but I can't try it on the test cluster because 
> Spark needs to be upgraded there. I'm opening this ticket because if it's a 
> bug, maybe something is still partially present in Spark 2.1.0.
> Initially I though it was my script's problem so I tried to debug, until I 
> found why this is happening.
> Step by step, I load XML files through spark-xml into a DataFrame. In my 
> case, the rowTag is the root tag, so each XML file creates a row. The XML 
> structure is fairly complex, which are converted to nested columns or arrays 
> inside the DF. Since I need to flatten the whole table, and since the output 
> is not fixed but I dynamically select what I want as output, in case I need 
> to output columns that have been parsed as arrays, then I explode them with 
> explode() only when needed.
> Normally I can select various columns that don't have many entries without a 
> problem. 
> I select a column that has a lot of entries into a new DF, e.g. simply through
> {noformat}
> df2 = df.select(...)
> {noformat}
> and then if I try to do a count() or first() or anything, Spark behaves two 
> ways:
> 1. If the source file was smaller than 1MB, it works.
> 2. If the source file was larger than 1MB, the following error occurs:
> {noformat}
> Traceback (most recent call last):
>   File \"/myCode.py\", line 71, in main
> df.count()
>   File 
> \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/dataframe.py\",
>  line 299, in count
> return int(self._jdf.count())
>   File 
> \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py\",
>  line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> \"/usr/hdp/current/spark-client/python/lib/pyspark.zip/pyspark/sql/utils.py\",
>  line 63, in deco
> return f(*a, **kw)
>   File 
> \"/usr/hdp/current/spark-client/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py\",
>  line 319, in get_return_value
> format(target_id, \".\", name), value)
> Py4JJavaError: An error occurred while calling o180.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
> (TID 6, compname): java.lang.IllegalArgumentException: Size exceeds 
> Integer.MAX_VALUE
>   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:869)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:103)
>   at 
> org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:91)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1307)
>   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:438)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:606)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:663)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 

[jira] [Assigned] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20587:


Assignee: Nick Pentreath  (was: Apache Spark)

> Improve performance of ML ALS recommendForAll
> -
>
> Key: SPARK-20587
> URL: https://issues.apache.org/jira/browse/SPARK-20587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
> approach for generating top-k recommendations in 
> {{mllib.recommendation.MatrixFactorizationModel}}.
> The solution there is still based on blocking factors, but efficiently 
> computes the top-k elements *per block* first (using 
> {{BoundedPriorityQueue}}) and then computes the global top-k elements.
> This improves performance and GC pressure substantially for {{mllib}}'s ALS 
> model. The same approach is also a lot more efficient than the current 
> "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. 
> This adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20587:


Assignee: Apache Spark  (was: Nick Pentreath)

> Improve performance of ML ALS recommendForAll
> -
>
> Key: SPARK-20587
> URL: https://issues.apache.org/jira/browse/SPARK-20587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
> approach for generating top-k recommendations in 
> {{mllib.recommendation.MatrixFactorizationModel}}.
> The solution there is still based on blocking factors, but efficiently 
> computes the top-k elements *per block* first (using 
> {{BoundedPriorityQueue}}) and then computes the global top-k elements.
> This improves performance and GC pressure substantially for {{mllib}}'s ALS 
> model. The same approach is also a lot more efficient than the current 
> "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. 
> This adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995480#comment-15995480
 ] 

Apache Spark commented on SPARK-20587:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/17845

> Improve performance of ML ALS recommendForAll
> -
>
> Key: SPARK-20587
> URL: https://issues.apache.org/jira/browse/SPARK-20587
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
> approach for generating top-k recommendations in 
> {{mllib.recommendation.MatrixFactorizationModel}}.
> The solution there is still based on blocking factors, but efficiently 
> computes the top-k elements *per block* first (using 
> {{BoundedPriorityQueue}}) and then computes the global top-k elements.
> This improves performance and GC pressure substantially for {{mllib}}'s ALS 
> model. The same approach is also a lot more efficient than the current 
> "crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. 
> This adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20587) Improve performance of ML ALS recommendForAll

2017-05-03 Thread Nick Pentreath (JIRA)
Nick Pentreath created SPARK-20587:
--

 Summary: Improve performance of ML ALS recommendForAll
 Key: SPARK-20587
 URL: https://issues.apache.org/jira/browse/SPARK-20587
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Nick Pentreath
Assignee: Nick Pentreath


SPARK-11968 relates to excessive GC pressure from using the "blocked BLAS 3" 
approach for generating top-k recommendations in 
{{mllib.recommendation.MatrixFactorizationModel}}.

The solution there is still based on blocking factors, but efficiently computes 
the top-k elements *per block* first (using {{BoundedPriorityQueue}}) and then 
computes the global top-k elements.

This improves performance and GC pressure substantially for {{mllib}}'s ALS 
model. The same approach is also a lot more efficient than the current 
"crossJoin and score per-row" used in {{ml}}'s {{DataFrame}}-based method. This 
adapts the solution in SPARK-11968 for {{DataFrame}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20569:
-
Affects Version/s: 2.2.0

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995423#comment-15995423
 ] 

Michael Armbrust commented on SPARK-20569:
--

[~rxin] this does seem like a bug.

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20569) RuntimeReplaceable functions accept invalid third parameter

2017-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-20569:
-
Summary: RuntimeReplaceable functions accept invalid third parameter  (was: 
In spark-sql,Some functions can execute successfully, when the  number of input 
parameter is wrong)

> RuntimeReplaceable functions accept invalid third parameter
> ---
>
> Key: SPARK-20569
> URL: https://issues.apache.org/jira/browse/SPARK-20569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: liuxian
>Priority: Trivial
>
> >select  Nvl(null,'1',3);
> >3
> The function of "Nvl" has Only two  input parameters,so, when input three 
> parameters, i think it should notice that:"Error in query: Invalid number of 
> arguments for function nvl".
> Such as "nvl2", "nullIf","IfNull",these have a similar problem



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19104:
-
Affects Version/s: 2.2.0
 Target Version/s: 2.2.0

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
>   

[jira] [Updated] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-05-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19104:
-
Description: 
The following code will run with Spark 2.0.2 but not with Spark 2.1.0:

{code}
case class InnerData(name: String, value: Int)
case class Data(id: Int, param: Map[String, InnerData])

val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
100
val ds   = spark.createDataset(data)
{code}

Exception:
{code}
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 63, Column 46: Expression 
"ExternalMapToCatalyst_value_isNull1" is not an rvalue 
  at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
  at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
 
  at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
  at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) 
  at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
  at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
  at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
  at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
  at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
  at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
  at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
  at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
 
  at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
 
  at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
  at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
 
  at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
 
  at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
  at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
 
  at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
 
  at 
org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
  at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
  at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
 
  at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
 
  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
  at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
  at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) 
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935)
 
  ... 77 more 
{code}



  was:
The following code will run with Spark 2.0.2 but not with Spark 2.1.0:

{code}
case class InnerData(name: String, value: Int)
case class Data(id: Int, param: Map[String, InnerData])

val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
100
val ds   = spark.createDataset(data)
{code}

Exception:
Caused by: 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal number 
of containers that driver can ask for in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, causing executor removal and 
aggravating the overloading of driver, which leads to more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver within seconds. However, driver handles 
executor registration, stop, removal and spark props retrieval in one thread, 
and it can not handle such a large number of RPCs within a short period of 
time. As a result, some executors cannot retrieve spark props and/or register. 
These failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver. However, driver handles executor 
registration, stop, removal and spark props retrieval in one thread, and it can 
not handle such a large number of RPCs within a short period of time. As a 
result, some executors cannot retrieve spark props and/or register. These 
failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, cause 
> executor removal and aggravate the overloading of driver, which causes more 
> executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them almost simultaneously. These thousands of executors try to retrieve 
spark props and register with driver. However, driver handles executor 
registration, stop, removal and spark props retrieval in one thread, and it can 
not handle such a large number of RPCs within a short period of time. As a 
result, some executors cannot retrieve spark props and/or register. These 
failed executors are then marked as failed, cause executor removal and 
aggravate the overloading of driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver. However, driver handles 
> executor registration, stop, removal and spark props retrieval in one thread, 
> and it can not handle such a large number of RPCs within a short period of 
> time. As a result, some executors cannot retrieve spark props and/or 
> register. These failed executors are then marked as failed, cause executor 
> removal and aggravate the overloading of driver, which causes more executor 
> failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get over 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them. These thousands of executors try to retrieve spark props 
> and register with driver. However, driver handles executor registration, 
> stop, removal and spark props retrieval in one thread, and it can not handle 
> such a large number of RPCs within a short period of time. As a result, some 
> executors cannot retrieve spark props and/or register. These failed executors 
> are then marked as failed, cause executor removal and aggravate the 
> overloading of driver, which causes more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was 

[jira] [Updated] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-05-03 Thread Hua Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hua Liu updated SPARK-20564:

Description: 
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they were 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.

  was:
When we used more than 2000 executors in a spark application, we noticed a 
large number of executors cannot connect to driver and as a result they are 
marked as failed. In some cases, the failed executor number reached twice of 
the requested executor count and thus applications retried and may eventually 
fail.

This is because that YarnAllocator requests all missing containers every 
spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
YarnAllocator can ask for and get 2000 containers in one request, and then 
launch them. These thousands of executors try to retrieve spark props and 
register with driver. However, driver handles executor registration, stop, 
removal and spark props retrieval in one thread, and it can not handle such a 
large number of RPCs within a short period of time. As a result, some executors 
cannot retrieve spark props and/or register. These failed executors are then 
marked as failed, cause executor removal and aggravate the overloading of 
driver, which causes more executor failures. 

This patch adds an extra configuration 
spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
containers driver can ask for and launch in every 
spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
executors grows steadily. The number of executor failures is reduced and 
applications can reach the desired number of executors faster.


> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get 2000 containers in one request, and then 
> launch them. These thousands of executors try to retrieve spark props and 
> register with driver. However, driver handles executor registration, stop, 
> removal and spark props retrieval in one thread, and it can not handle such a 
> large number of RPCs within a short period of time. As a result, some 
> executors cannot retrieve spark props and/or register. These failed executors 
> are then marked as failed, cause executor removal and aggravate the 
> overloading of driver, which causes more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> containers driver can ask for and launch in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by 

[jira] [Resolved] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-05-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19965.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.2.0

> DataFrame batch reader may fail to infer partitions when reading 
> FileStreamSink's output
> 
>
> Key: SPARK-19965
> URL: https://issues.apache.org/jira/browse/SPARK-19965
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Liwei Lin
> Fix For: 2.2.0
>
>
> Reproducer
> {code}
>   test("partitioned writing and batch reading with 'basePath'") {
> val inputData = MemoryStream[Int]
> val ds = inputData.toDS()
> val outputDir = Utils.createTempDir(namePrefix = 
> "stream.output").getCanonicalPath
> val checkpointDir = Utils.createTempDir(namePrefix = 
> "stream.checkpoint").getCanonicalPath
> var query: StreamingQuery = null
> try {
>   query =
> ds.map(i => (i, i * 1000))
>   .toDF("id", "value")
>   .writeStream
>   .partitionBy("id")
>   .option("checkpointLocation", checkpointDir)
>   .format("parquet")
>   .start(outputDir)
>   inputData.addData(1, 2, 3)
>   failAfter(streamingTimeout) {
> query.processAllAvailable()
>   }
>   spark.read.option("basePath", outputDir).parquet(outputDir + 
> "/*").show()
> } finally {
>   if (query != null) {
> query.stop()
>   }
> }
>   }
> {code}
> Stack trace
> {code}
> [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
> (3 seconds, 928 milliseconds)
> [info]   java.lang.AssertionError: assertion failed: Conflicting directory 
> structures detected. Suspicious paths:
> [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
> [info]
> ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20548:


Assignee: Sameer Agarwal  (was: Apache Spark)

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995360#comment-15995360
 ] 

Apache Spark commented on SPARK-20548:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17844

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20548) Flaky Test: ReplSuite.newProductSeqEncoder with REPL defined class

2017-05-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20548:


Assignee: Apache Spark  (was: Sameer Agarwal)

> Flaky Test:  ReplSuite.newProductSeqEncoder with REPL defined class
> ---
>
> Key: SPARK-20548
> URL: https://issues.apache.org/jira/browse/SPARK-20548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
> Fix For: 2.2.0
>
>
> {{newProductSeqEncoder with REPL defined class}} in {{ReplSuite}} has been 
> failing in-deterministically : https://spark-tests.appspot.com/failed-tests 
> over the last few days.
> https://spark.test.databricks.com/job/spark-master-test-sbt-hadoop-2.7/176/testReport/junit/org.apache.spark.repl/ReplSuite/newProductSeqEncoder_with_REPL_defined_class/history/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20570) The main version number on docs/latest/index.html

2017-05-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20570.
---
   Resolution: Fixed
 Assignee: Michael Armbrust
Fix Version/s: 2.1.1

Oh, [~marmbrus] already did that, and it looks like it's OK now: 
http://spark.apache.org/docs/latest/

> The main version number on docs/latest/index.html
> -
>
> Key: SPARK-20570
> URL: https://issues.apache.org/jira/browse/SPARK-20570
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liucht-inspur
>Assignee: Michael Armbrust
> Fix For: 2.1.1
>
>
> On the spark.apache.org home page, when I click the menu  Latest Release 
> (Spark 2.1.1) under the documentation menu ,the next page latest appear with 
> display 2.1.0 lable in the upper left corner of the page



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20570) The main version number on docs/latest/index.html

2017-05-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995265#comment-15995265
 ] 

Sean Owen commented on SPARK-20570:
---

Oh, probably another git sync hiccup. It may 'flush' if you push an empty 
commit to the repo. I can try that in a sec.

> The main version number on docs/latest/index.html
> -
>
> Key: SPARK-20570
> URL: https://issues.apache.org/jira/browse/SPARK-20570
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liucht-inspur
>
> On the spark.apache.org home page, when I click the menu  Latest Release 
> (Spark 2.1.1) under the documentation menu ,the next page latest appear with 
> display 2.1.0 lable in the upper left corner of the page



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20570) The main version number on docs/latest/index.html

2017-05-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995248#comment-15995248
 ] 

Michael Armbrust commented on SPARK-20570:
--

Hmmm, I did push them, and they show up on the [asf git 
website|https://git-wip-us.apache.org/repos/asf?p=spark-website.git;a=commit;h=d4f0c34ac33001169452a276acfebb7b8cc58c0f].
  They don't seem to appear on github though.  I guess I'll open an INFRA 
ticket.

> The main version number on docs/latest/index.html
> -
>
> Key: SPARK-20570
> URL: https://issues.apache.org/jira/browse/SPARK-20570
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liucht-inspur
>
> On the spark.apache.org home page, when I click the menu  Latest Release 
> (Spark 2.1.1) under the documentation menu ,the next page latest appear with 
> display 2.1.0 lable in the upper left corner of the page



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20586) Add deterministic and distinctLike to ScalaUDF

2017-05-03 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20586:
---

 Summary: Add deterministic and distinctLike to ScalaUDF
 Key: SPARK-20586
 URL: https://issues.apache.org/jira/browse/SPARK-20586
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html

Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
only add the following two flags. 

- deterministic: Certain optimizations should not be applied if UDF is not 
deterministic. Deterministic UDF returns same result each time it is invoked 
with a particular input. This determinism just needs to hold within the context 
of a query.

- distinctLike: A UDF is considered distinctLike if the UDF can be evaluated on 
just the distinct values of a column. Examples include min and max UDFs. This 
information is used by metadata-only optimizer.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20583) Scala/Java generic hint support

2017-05-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-20583.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Scala/Java generic hint support
> ---
>
> Key: SPARK-20583
> URL: https://issues.apache.org/jira/browse/SPARK-20583
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.0
>
>
> Done in https://github.com/apache/spark/pull/17839



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20584) Python generic hint support

2017-05-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20584:
---

 Summary: Python generic hint support
 Key: SPARK-20584
 URL: https://issues.apache.org/jira/browse/SPARK-20584
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20583) Scala/Java generic hint support

2017-05-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20583:
---

 Summary: Scala/Java generic hint support
 Key: SPARK-20583
 URL: https://issues.apache.org/jira/browse/SPARK-20583
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin


Done in https://github.com/apache/spark/pull/17839



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20585) R generic hint support

2017-05-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-20585:
---

 Summary: R generic hint support
 Key: SPARK-20585
 URL: https://issues.apache.org/jira/browse/SPARK-20585
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.1.0
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20582.

Resolution: Duplicate

> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20571) Flaky SparkR StructuredStreaming tests

2017-05-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15995149#comment-15995149
 ] 

Felix Cheung commented on SPARK-20571:
--

Thanks, I will look into today.

> Flaky SparkR StructuredStreaming tests
> --
>
> Key: SPARK-20571
> URL: https://issues.apache.org/jira/browse/SPARK-20571
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76399



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread zhoukang (JIRA)
zhoukang created SPARK-20582:


 Summary: Speed up the restart of HistoryServer using 
ApplicationAttemptInfo checkpointing
 Key: SPARK-20582
 URL: https://issues.apache.org/jira/browse/SPARK-20582
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.1.0
Reporter: zhoukang
Priority: Critical


In the current code of HistoryServer,jetty server will be started after all the 
logs fetch from yarn has been replayed.However,when the number of logs is 
becoming larger,the start time of jetty will be too long.
Here,we implement a solution which using ApplicationAttemptInfo checkpointing 
to speed up the start of historyserver.When historyserver is starting,it will 
load ApplicationAttemptInfo from checkpoint file first if exists which is 
faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20582) Speed up the restart of HistoryServer using ApplicationAttemptInfo checkpointing

2017-05-03 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-20582:
-
Description: 
   In the current code of HistoryServer,jetty server will be started after 
all the logs fetch from yarn has been replayed.However,when the number of logs 
is becoming larger,the start time of jetty will be too long.
   Here,we implement a solution which using ApplicationAttemptInfo 
checkpointing to speed up the start of historyserver.When historyserver is 
starting,it will load ApplicationAttemptInfo from checkpoint file first if 
exists which is faster then replaying one by one.

  was:
In the current code of HistoryServer,jetty server will be started after all the 
logs fetch from yarn has been replayed.However,when the number of logs is 
becoming larger,the start time of jetty will be too long.
Here,we implement a solution which using ApplicationAttemptInfo checkpointing 
to speed up the start of historyserver.When historyserver is starting,it will 
load ApplicationAttemptInfo from checkpoint file first if exists which is 
faster then replaying one by one.


> Speed up the restart of HistoryServer using ApplicationAttemptInfo 
> checkpointing
> 
>
> Key: SPARK-20582
> URL: https://issues.apache.org/jira/browse/SPARK-20582
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Critical
>
>In the current code of HistoryServer,jetty server will be started 
> after all the logs fetch from yarn has been replayed.However,when the number 
> of logs is becoming larger,the start time of jetty will be too long.
>Here,we implement a solution which using ApplicationAttemptInfo 
> checkpointing to speed up the start of historyserver.When historyserver is 
> starting,it will load ApplicationAttemptInfo from checkpoint file first if 
> exists which is faster then replaying one by one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20441) Within the same streaming query, one StreamingRelation should only be transformed to one StreamingExecutionRelation

2017-05-03 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-20441.
-
Resolution: Fixed

Resolved with https://github.com/apache/spark/pull/17735

> Within the same streaming query, one StreamingRelation should only be 
> transformed to one StreamingExecutionRelation
> ---
>
> Key: SPARK-20441
> URL: https://issues.apache.org/jira/browse/SPARK-20441
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0
>Reporter: Liwei Lin
>
> Within the same streaming query, when one StreamingRelation is referred 
> multiple times -- e.g. df.union(df) -- we should transform it only to one 
> StreamingExecutionRelation, instead of two or more different  
> StreamingExecutionRelations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20432) Unioning two identical Streaming DataFrames fails during attribute resolution

2017-05-03 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz closed SPARK-20432.
---
Resolution: Duplicate

> Unioning two identical Streaming DataFrames fails during attribute resolution
> -
>
> Key: SPARK-20432
> URL: https://issues.apache.org/jira/browse/SPARK-20432
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> To reproduce, try unioning two identical Kafka Streams:
> {code}
> df = spark.readStream.format("kafka")... \
>   .select(from_json(col("value").cast("string"), 
> simpleSchema).alias("parsed_value"))
> df.union(df).writeStream...
> {code}
> Exception is confusing:
> {code}
> org.apache.spark.sql.AnalysisException: resolved attribute(s) value#526 
> missing from 
> value#511,topic#512,partition#513,offset#514L,timestampType#516,key#510,timestamp#515
>  in operator !Project [jsontostructs(...) AS parsed_value#357];
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20581) Using AVG or SUM on a INT/BIGINT column with fraction operator will yield BIGINT instead of DOUBLE

2017-05-03 Thread Dominic Ricard (JIRA)
Dominic Ricard created SPARK-20581:
--

 Summary: Using AVG or SUM on a INT/BIGINT column with fraction 
operator will yield BIGINT instead of DOUBLE
 Key: SPARK-20581
 URL: https://issues.apache.org/jira/browse/SPARK-20581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Dominic Ricard


We stumbled on this multiple times and every time we are baffled by the 
behavior of AVG and SUM.

Given the following SQL (Executed through Thrift):
{noformat}
SELECT SUM(col/2) FROM
(SELECT 3 as `col`) t
{noformat}

The result will be "1", when the expected and accurate result is 1.5

Here's the explain plan:
{noformat}
== Physical Plan == 
TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as double) / 
2.0) as bigint)),mode=Final,isDistinct=false)], output=[_c0#1519344L])
+- TungstenExchange SinglePartition, None   
   +- TungstenAggregate(key=[], functions=[(sum(cast((cast(col#1519342 as 
double) / 2.0) as bigint)),mode=Partial,isDistinct=false)], 
output=[sum#1519347L])
  +- Project [3 AS col#1519342] 
 +- Scan OneRowRelation[]   
{noformat}

Why the extra cast to BIGINT?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects

2017-05-03 Thread Fernando Pereira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994968#comment-15994968
 ] 

Fernando Pereira commented on SPARK-20580:
--

I try to avoid any operation involving serialization. Im using pyspark and the 
default RDD.cache()

So eventually there is a bug somewhere.
My case:
In [11]: fzer.fdata.morphologyRDD.map(lambda a: 
MorphoStats.has_duplicated_points(a[1])).count()
Out[11]: 22431  
In [12]: rdd2 = fzer.fdata.morphologyRDD.cache()
In [13]: rdd2.map(lambda a: MorphoStats.has_duplicated_points(a[1])).count()
[Stage 7:>(0 + 8) / 
128]17/05/03 16:22:52 
ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 652)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
(...)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)


> Allow RDD cache with unserializable objects
> ---
>
> Key: SPARK-20580
> URL: https://issues.apache.org/jira/browse/SPARK-20580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> In my current scenario we load complex Python objects in the worker nodes 
> that are not completely serializable. We then apply map certain operations to 
> the RDD which at some point we collect. In this basic usage all works well.
> However, if we cache() the RDD (which defaults to memory) suddenly it fails 
> to execute the transformations after the caching step. Apparently caching 
> serializes the RDD data and deserializes it whenever more transformations are 
> required.
> It would be nice to avoid serialization of the objects if they are to be 
> cached to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20580) Allow RDD cache with unserializable objects

2017-05-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994947#comment-15994947
 ] 

Sean Owen commented on SPARK-20580:
---

In general, such objects need to be serializable, because otherwise there's no 
way to move such data, and most anything you do involves moving the data. You 
might be able to get away with in scenarios where the objects are created from 
a data source into memory and never participate in any operation that involves 
serialization. Here, it sounds like you chose one of the "_SER" storage levels 
which serializes the object into memory. If so, that's your problem.

> Allow RDD cache with unserializable objects
> ---
>
> Key: SPARK-20580
> URL: https://issues.apache.org/jira/browse/SPARK-20580
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Fernando Pereira
>Priority: Minor
>
> In my current scenario we load complex Python objects in the worker nodes 
> that are not completely serializable. We then apply map certain operations to 
> the RDD which at some point we collect. In this basic usage all works well.
> However, if we cache() the RDD (which defaults to memory) suddenly it fails 
> to execute the transformations after the caching step. Apparently caching 
> serializes the RDD data and deserializes it whenever more transformations are 
> required.
> It would be nice to avoid serialization of the objects if they are to be 
> cached to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20580) Allow RDD cache with unserializable objects

2017-05-03 Thread Fernando Pereira (JIRA)
Fernando Pereira created SPARK-20580:


 Summary: Allow RDD cache with unserializable objects
 Key: SPARK-20580
 URL: https://issues.apache.org/jira/browse/SPARK-20580
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Fernando Pereira
Priority: Minor


In my current scenario we load complex Python objects in the worker nodes that 
are not completely serializable. We then apply map certain operations to the 
RDD which at some point we collect. In this basic usage all works well.

However, if we cache() the RDD (which defaults to memory) suddenly it fails to 
execute the transformations after the caching step. Apparently caching 
serializes the RDD data and deserializes it whenever more transformations are 
required.

It would be nice to avoid serialization of the objects if they are to be cached 
to memory, and keep the original object



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >