[jira] [Updated] (SPARK-15214) Implement code generation for Generate

2016-11-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15214:

Summary: Implement code generation for Generate  (was: Enable code 
generation for Generate)

> Implement code generation for Generate
> --
>
> Key: SPARK-15214
> URL: https://issues.apache.org/jira/browse/SPARK-15214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> {{Generate}} currently does not support code generation. Lets add support for 
> CG and for it and its most important generators: {{explode}} and 
> {{json_tuple}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15214) Enable code generation for Generate

2016-11-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15214.
-
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s:   (was: 2.1.0)

> Enable code generation for Generate
> ---
>
> Key: SPARK-15214
> URL: https://issues.apache.org/jira/browse/SPARK-15214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> {{Generate}} currently does not support code generation. Lets add support for 
> CG and for it and its most important generators: {{explode}} and 
> {{json_tuple}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18508) Fix documentation for DateDiff

2016-11-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18508.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Fix documentation for DateDiff
> --
>
> Key: SPARK-18508
> URL: https://issues.apache.org/jira/browse/SPARK-18508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> The current documentation for DateDiff does not make it clear which one is 
> the start date, and which is the end date. The example is also wrong about 
> the direction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18458.
-
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.1.0

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Assignee: Kazuaki Ishizaki
>Priority: Critical
>  Labels: core, dump
> Fix For: 2.1.0
>
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at 

[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680487#comment-15680487
 ] 

Reynold Xin commented on SPARK-18510:
-

Does your pull request for SPARK-18407 fix this one as well?


> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18511) Add an api for join operation with just the column name and join type.

2016-11-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680374#comment-15680374
 ] 

Apache Spark commented on SPARK-18511:
--

User 'shiv4nsh' has created a pull request for this issue:
https://github.com/apache/spark/pull/15943

> Add an api for join operation with just the column name and join type.
> --
>
> Key: SPARK-18511
> URL: https://issues.apache.org/jira/browse/SPARK-18511
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.0.2
>Reporter: Shivansh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18511) Add an api for join operation with just the column name and join type.

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18511:


Assignee: (was: Apache Spark)

> Add an api for join operation with just the column name and join type.
> --
>
> Key: SPARK-18511
> URL: https://issues.apache.org/jira/browse/SPARK-18511
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.0.2
>Reporter: Shivansh
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18511) Add an api for join operation with just the column name and join type.

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18511:


Assignee: Apache Spark

> Add an api for join operation with just the column name and join type.
> --
>
> Key: SPARK-18511
> URL: https://issues.apache.org/jira/browse/SPARK-18511
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 2.0.2
>Reporter: Shivansh
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18511) Add an api for join operation with just the column name and join type.

2016-11-19 Thread Shivansh (JIRA)
Shivansh created SPARK-18511:


 Summary: Add an api for join operation with just the column name 
and join type.
 Key: SPARK-18511
 URL: https://issues.apache.org/jira/browse/SPARK-18511
 Project: Spark
  Issue Type: New Feature
Affects Versions: 2.0.2
Reporter: Shivansh
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18407:


Assignee: Apache Spark

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18407:


Assignee: (was: Apache Spark)

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18407) Inferred partition columns cause assertion error

2016-11-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680352#comment-15680352
 ] 

Apache Spark commented on SPARK-18407:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15942

> Inferred partition columns cause assertion error
> 
>
> Key: SPARK-18407
> URL: https://issues.apache.org/jira/browse/SPARK-18407
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Michael Armbrust
>Priority: Critical
>
> [This 
> assertion|https://github.com/apache/spark/blob/16eaad9daed0b633e6a714b5704509aa7107d6e5/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L408]
>  fails when you run a stream against json data that is stored in partitioned 
> folders, if you manually specify the schema and that schema omits the 
> partitioned columns.
> My hunch is that we are inferring those columns even though the schema is 
> being passed in manually and adding them to the end.
> While we are fixing this bug, it would be nice to make the assertion better.  
> Truncating is not terribly useful as, at least in my case, it truncated the 
> most interesting part.  I changed it to this while debugging:
> {code}
>   s"""
>  |Batch does not have expected schema
>  |Expected: ${output.mkString(",")}
>  |Actual: ${newPlan.output.mkString(",")}
>  |
>  |== Original ==
>  |$logicalPlan
>  |
>  |== Batch ==
>  |$newPlan
>""".stripMargin
> {code}
> I also tried specifying the partition columns in the schema and now it 
> appears that they are filled with corrupted data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680328#comment-15680328
 ] 

Burak Yavuz commented on SPARK-18510:
-

cc [~r...@databricks.com] I marked this as a blocker as it is pretty nasty. 
Feel free to downgrade if you don't think so, or feel like the semantics of 
what I'm doing is wrong.

> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> .add("part", IntegerType)
> .add("ex", ArrayType(StringType))
> spark.read
>   .schema(schema)
>   .format("parquet")
>   .load(src.toString)
>   .show()
> {code}
> The UDF is useful for creating a row long enough so that you don't hit other 
> weird NullPointerExceptions caused for the same reason I believe.
> Output:
> {code}
> +-+++
> |   id|part|  ex|
> +-+++
> |�|   1|[1, 2, 3, 4, 5, 6...|
> | |   0|[1, 2, 3, 4, 5, 6...|
> |  |   3|[1, 2, 3, 4, 5, 6...|
> |   |   2|[1, 2, 3, 4, 5, 6...|
> ||   1|  [1, 2, 3, 4, 5, 6]|
> | |   0| [1, 2, 3, 4, 5]|
> |  |   3|[1, 2, 3, 4]|
> |   |   2|   [1, 2, 3]|
> ||   1|  [1, 2]|
> | |   0| [1]|
> +-+++
> {code}
> I was hoping to fix the issue as part of SPARK-18407 but it seems it's not 
> only applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18510:

Description: 
Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}
The UDF is useful for creating a row long enough so that you don't hit other 
weird NullPointerExceptions caused for the same reason I believe.
Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.

  was:
Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}

Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.


> Partition schema inference corrupts data
> 
>
> Key: SPARK-18510
> URL: https://issues.apache.org/jira/browse/SPARK-18510
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> Not sure if this is a regression from 2.0 to 2.1. I was investigating this 
> for Structured Streaming, but it seems it affects batch data as well.
> Here's the issue:
> If I specify my schema when doing
> {code}
> spark.read
>   .schema(someSchemaWherePartitionColumnsAreStrings)
> {code}
> but if the partition inference can infer it as IntegerType or I assume 
> LongType or DoubleType (basically fixed size types), then once UnsafeRows are 
> generated, your data will be corrupted.
> Reproduction:
> {code}
> val createArray = udf { (length: Long) =>
> for (i <- 1 to length.toInt) yield i.toString
> }
> spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
> 'part).coalesce(1).write
> .partitionBy("part", "id")
> .mode("overwrite")
> .parquet(src.toString)
> val schema = new StructType()
> .add("id", StringType)
> 

[jira] [Created] (SPARK-18510) Partition schema inference corrupts data

2016-11-19 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18510:
---

 Summary: Partition schema inference corrupts data
 Key: SPARK-18510
 URL: https://issues.apache.org/jira/browse/SPARK-18510
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 2.1.0
Reporter: Burak Yavuz
Priority: Blocker


Not sure if this is a regression from 2.0 to 2.1. I was investigating this for 
Structured Streaming, but it seems it affects batch data as well.

Here's the issue:
If I specify my schema when doing
{code}
spark.read
  .schema(someSchemaWherePartitionColumnsAreStrings)
{code}

but if the partition inference can infer it as IntegerType or I assume LongType 
or DoubleType (basically fixed size types), then once UnsafeRows are generated, 
your data will be corrupted.

Reproduction:
{code}
val createArray = udf { (length: Long) =>
for (i <- 1 to length.toInt) yield i.toString
}
spark.range(10).select(createArray('id + 1) as 'ex, 'id, 'id % 4 as 
'part).coalesce(1).write
.partitionBy("part", "id")
.mode("overwrite")
.parquet(src.toString)
val schema = new StructType()
.add("id", StringType)
.add("part", IntegerType)
.add("ex", ArrayType(StringType))
spark.read
  .schema(schema)
  .format("parquet")
  .load(src.toString)
  .show()
{code}

Output:
{code}
+-+++
|   id|part|  ex|
+-+++
|�|   1|[1, 2, 3, 4, 5, 6...|
| |   0|[1, 2, 3, 4, 5, 6...|
|  |   3|[1, 2, 3, 4, 5, 6...|
|   |   2|[1, 2, 3, 4, 5, 6...|
||   1|  [1, 2, 3, 4, 5, 6]|
| |   0| [1, 2, 3, 4, 5]|
|  |   3|[1, 2, 3, 4]|
|   |   2|   [1, 2, 3]|
||   1|  [1, 2]|
| |   0| [1]|
+-+++
{code}

I was hoping to fix the issue as part of SPARK-18407 but it seems it's not only 
applicable to StructuredStreaming and deserves it's own JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)

2016-11-19 Thread Udit Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680276#comment-15680276
 ] 

Udit Mehrotra commented on SPARK-17380:
---

The above leak was seen with Spark 2.0 running on EMR. I noticed that the code 
path which causes the leak is the Block replication code, so I switched to 
using StorageLevel.MEMORY_AND_DISK, from StorageLevel.MEMORY_AND_DISK_2 for the 
Kinesis blocks received. After switching, I do not observe the above memory 
leak in the logs, but the application still freezes after 3-3.5 days. Spark 
streaming stops processing the records, and the input queue of records received 
from Kinesis keeps growing, until the executor runs out of memory.

> Spark streaming with a multi shard Kinesis freezes after several days 
> (memory/resource leak?)
> -
>
> Key: SPARK-17380
> URL: https://issues.apache.org/jira/browse/SPARK-17380
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Xeto
> Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png
>
>
> Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 
> shards).
> Used memory keeps growing all the time according to Ganglia.
> The application works properly for about 3.5 days till all free memory has 
> been used.
> Then, micro batches start queuing up but none is served.
> Spark freezes. You can see in Ganglia that some memory is being freed but it 
> doesn't help the job to recover.
> Is it a memory/resource leak?
> The job uses back pressure and Kryo.
> The code has a mapToPair(), groupByKey(),  flatMap(), 
> persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then 
> storing to s3 using foreachRDD()
> Cluster size: 20 machines
> Spark cofiguration:
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO 
> -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M 
> -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.master yarn-cluster
> spark.executor.instances 19
> spark.executor.cores 7
> spark.executor.memory 7500M
> spark.driver.memory 7500M
> spark.default.parallelism 133
> spark.yarn.executor.memoryOverhead 2950
> spark.yarn.driver.memoryOverhead 2950
> spark.eventLog.enabled false
> spark.eventLog.dir hdfs:///spark-logs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17380) Spark streaming with a multi shard Kinesis freezes after several days (memory/resource leak?)

2016-11-19 Thread Udit Mehrotra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680270#comment-15680270
 ] 

Udit Mehrotra commented on SPARK-17380:
---

We came across this Memory Leak in the executor logs, by using the JVM option 
'-Dio.netty.leakDetectionLevel=advanced', which seems like a good evidence of 
memory leak, and tells the location where the buffer is created.

16/11/09 06:03:28 ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not 
called before it's garbage-collected. See 
http://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 0
Created at:
io.netty.buffer.CompositeByteBuf.(CompositeByteBuf.java:103)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:335)
io.netty.buffer.Unpooled.wrappedBuffer(Unpooled.java:247)

org.apache.spark.util.io.ChunkedByteBuffer.toNetty(ChunkedByteBuffer.scala:69)

org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:1161)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:976)

org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)

org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)

org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:700)

org.apache.spark.streaming.receiver.BlockManagerBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:80)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)

org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)

org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:282)

org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:352)

org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)

org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)

org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

Can we please have some action on this JIRA ?

> Spark streaming with a multi shard Kinesis freezes after several days 
> (memory/resource leak?)
> -
>
> Key: SPARK-17380
> URL: https://issues.apache.org/jira/browse/SPARK-17380
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: Xeto
> Attachments: exec_Leak_Hunter.zip, memory-after-freeze.png, memory.png
>
>
> Running Spark Streaming 2.0.0 on AWS EMR 5.0.0 consuming from Kinesis (125 
> shards).
> Used memory keeps growing all the time according to Ganglia.
> The application works properly for about 3.5 days till all free memory has 
> been used.
> Then, micro batches start queuing up but none is served.
> Spark freezes. You can see in Ganglia that some memory is being freed but it 
> doesn't help the job to recover.
> Is it a memory/resource leak?
> The job uses back pressure and Kryo.
> The code has a mapToPair(), groupByKey(),  flatMap(), 
> persist(StorageLevel.MEMORY_AND_DISK_SER_2()) and repartition(19); Then 
> storing to s3 using foreachRDD()
> Cluster size: 20 machines
> Spark cofiguration:
> spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC 
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 
> -XX:PermSize=256M -XX:MaxPermSize=256M -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.driver.extraJavaOptions -Dspark.driver.log.level=INFO 
> -XX:+UseConcMarkSweepGC -XX:PermSize=256M -XX:MaxPermSize=256M 
> -XX:OnOutOfMemoryError='kill -9 %p' 
> spark.master yarn-cluster
> spark.executor.instances 19
> spark.executor.cores 7
> spark.executor.memory 7500M
> spark.driver.memory 7500M
> spark.default.parallelism 133
> spark.yarn.executor.memoryOverhead 2950
> spark.yarn.driver.memoryOverhead 2950
> spark.eventLog.enabled false
> spark.eventLog.dir hdfs:///spark-logs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17062) Add --conf to mesos dispatcher process

2016-11-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17062.

   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.2.0

> Add --conf to mesos dispatcher process
> --
>
> Key: SPARK-17062
> URL: https://issues.apache.org/jira/browse/SPARK-17062
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
> Fix For: 2.2.0
>
>
> Sometimes we simply need to add a property in Spark Config for the Mesos 
> Dispatcher. The only option right now is to created a property file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680080#comment-15680080
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite and they 
all passed, any ideas on how to address these?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15680080#comment-15680080
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 11/19/16 11:59 PM:
---

Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite locally 
and they all passed, any ideas on how to address these?


was (Author: kanjilal):
Ok guess I spoke too soon :), onto the next set of challenges, jenkins build 
report is here:  
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68897/


I ran each of these tests individually as well as together as a suite and they 
all passed, any ideas on how to address these?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679902#comment-15679902
 ] 

Apache Spark commented on SPARK-18447:
--

User 'aditya1702' has created a pull request for this issue:
https://github.com/apache/spark/pull/15940

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18447:


Assignee: (was: Apache Spark)

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18447:


Assignee: Apache Spark

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-19 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679880#comment-15679880
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok fixed the unit test, didnt have to resort to using Sets, was able to compare 
the contents of each of the lists to certify the tests, pull request is here:  
https://github.com/apache/spark/pull/15848

Once pull request passes I will start working on fixing all the examples and 
the python code.  Let me know next steps

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-19 Thread Aditya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679826#comment-15679826
 ] 

Aditya commented on SPARK-18447:


Yes, I will work on this. I had some college lab exams hence couldnt complete. 
Will let people know if I cant so that it can be solved by someone as sooner as 
possible.

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Heji Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679667#comment-15679667
 ] 

Heji Kim commented on SPARK-18506:
--

We have spent two weeks trying different configurations and stripping 
everything down.  The only thing we have not tried is a different cloud 
provider- we are using GCE. Since previous versions work properly as does the 
"latest" offset setting, we did not think the problem was in the infrastructure 
layer.

Where does Databricks  do the spark cluster regression testing? I thought it 
might be AWS?  If you have a working example of multiple partitions that has 
been tested on an actual cluster that you use for regression testing,  we would 
be grateful for any pointers.  

We have upgraded our drivers since Spark 1.2 (partly on AWS, and GCP/GCE since 
1.6)  and this is the first time we have had such a blocker.) I do want the 
Spark team to know that our team tried our absolute best to verify that there 
was nothing wrong with our system configuration and have spent more than 100+ 
hours before posting this issue.  
 

> kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a 
> single partition on a multi partition topic
> ---
>
> Key: SPARK-18506
> URL: https://issues.apache.org/jira/browse/SPARK-18506
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Problem occurs both in Hadoop/YARN 2.7.3 and Spark 
> standalone mode 2.0.2 
> with Kafka 0.10.1.0.   
>Reporter: Heji Kim
>
> Our team is trying to upgrade to Spark 2.0.2/Kafka 
> 0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
> drivers to read all partitions of a single stream when kafka 
> auto.offset.reset=earliest running on a real cluster(separate VM nodes). 
> When we run our drivers with auto.offset.reset=latest ingesting from a single 
> kafka topic with multiple partitions (usually 10 but problem shows up  with 
> only 3 partitions), the driver reads correctly from all partitions.  
> Unfortunately, we need "earliest" for exactly once semantics.
> In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
> spark-streaming-kafka-0-8_2.11 with the prior setting 
> auto.offset.reset=smallest runs correctly.
> We have tried the following configurations in trying to isolate our problem 
> but it is only auto.offset.reset=earliest on a "real multi-machine cluster" 
> which causes this problem.
> 1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  
> instead of YARN 2.7.3. Single partition read problem persists both cases. 
> Please note this problem occurs on an actual cluster of separate VM nodes 
> (but not when our engineer runs in as a cluster on his own Mac.)
> 2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
> 3. Turned off checkpointing. Problem persists with or without checkpointing.
> 4. Turned off backpressure. Problem persists with or without backpressure.
> 5. Tried both partition.assignment.strategy RangeAssignor and 
> RoundRobinAssignor. Broken with both.
> 6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
> both.
> 7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
> Broken with both.
> 8. Tried increasing GCE capacity for cluster but already we were highly 
> overprovisioned for cores and memory. Also tried ramping up executors and 
> cores.  Since driver works with auto.offset.reset=latest, we have ruled out 
> GCP cloud infrastructure issues.
> When we turn on the debug logs, we sometimes see partitions being set to 
> different offset configuration even though the consumer config correctly 
> indicates auto.offset.reset=earliest. 
> {noformat}
> 8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
>  to broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 9 TRACE Sending ListOffsetRequest 
> {replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
>  to broker 10.102.20.13:9092 (id: 13 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)
> 8 TRACE Received ListOffsetResponse 
> {responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
>  from broker 10.102.20.12:9092 (id: 12 rack: null) 
> (org.apache.kafka.clients.consumer.internals.Fetcher)

[jira] [Updated] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Heji Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heji Kim updated SPARK-18506:
-
Description: 
Our team is trying to upgrade to Spark 2.0.2/Kafka 
0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
drivers to read all partitions of a single stream when kafka 
auto.offset.reset=earliest running on a real cluster(separate VM nodes). 

When we run our drivers with auto.offset.reset=latest ingesting from a single 
kafka topic with multiple partitions (usually 10 but problem shows up  with 
only 3 partitions), the driver reads correctly from all partitions.  
Unfortunately, we need "earliest" for exactly once semantics.

In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
spark-streaming-kafka-0-8_2.11 with the prior setting 
auto.offset.reset=smallest runs correctly.

We have tried the following configurations in trying to isolate our problem but 
it is only auto.offset.reset=earliest on a "real multi-machine cluster" which 
causes this problem.

1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  instead 
of YARN 2.7.3. Single partition read problem persists both cases. Please note 
this problem occurs on an actual cluster of separate VM nodes (but not when our 
engineer runs in as a cluster on his own Mac.)

2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
3. Turned off checkpointing. Problem persists with or without checkpointing.
4. Turned off backpressure. Problem persists with or without backpressure.
5. Tried both partition.assignment.strategy RangeAssignor and 
RoundRobinAssignor. Broken with both.
6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
both.
7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
Broken with both.
8. Tried increasing GCE capacity for cluster but already we were highly 
overprovisioned for cores and memory. Also tried ramping up executors and 
cores.  Since driver works with auto.offset.reset=latest, we have ruled out GCP 
cloud infrastructure issues.


When we turn on the debug logs, we sometimes see partitions being set to 
different offset configuration even though the consumer config correctly 
indicates auto.offset.reset=earliest. 
{noformat}
8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 TRACE Sending ListOffsetRequest 
{replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
 to broker 10.102.20.12:9092 (id: 12 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 TRACE Sending ListOffsetRequest 
{replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
 to broker 10.102.20.13:9092 (id: 13 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 TRACE Received ListOffsetResponse 
{responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
 from broker 10.102.20.12:9092 (id: 12 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 TRACE Received ListOffsetResponse 
{responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
 from broker 10.102.20.13:9092 (id: 13 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
(org.apache.kafka.clients.consumer.internals.Fetcher)
{noformat}

I've enclosed below the completely stripped down trivial test driver that shows 
this behavior.  After spending 2 weeks trying all combinations with a really 
stripped down driver, we think either there might be a bug in the kafka spark 
integration or if the kafka 0.10/spark upgrade needs special configuration, it 
should be fantastic if it was clearer in the documentation. But currently we 
cannot upgrade.

{code}
package com.x.labs.analytics.diagnostics.spark.drivers

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies
import org.apache.spark.streaming.kafka010.ConsumerStrategies


/**
  *
  * This driver is only for pulling data from the stream and logging to output 
just to isolate single partition bug
  */
object SimpleKafkaLoggingDriver {
  def main(args: Array[String]) {
if (args.length != 4) {
  System.err.println("Usage: SimpleTestDriver  
  ")
  System.exit(1)
}

val Array(brokers, topic, groupId, 

[jira] [Updated] (SPARK-18506) kafka 0.10 with Spark 2.02 auto.offset.reset=earliest will only read from a single partition on a multi partition topic

2016-11-19 Thread Heji Kim (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heji Kim updated SPARK-18506:
-
Description: 
Our team is trying to upgrade to Spark 2.0.2/Kafka 
0.10.1.0/spark-streaming-kafka-0-10_2.11 (v 2.0.2) and we cannot get our 
drivers to read all partitions of a single stream when kafka 
auto.offset.reset=earliest.  
When we run our drivers with auto.offset.reset=latest ingesting from a single 
kafka topic with multiple partitions (usually 10 but problem shows up  with 
only 3 partitions), the driver reads correctly from all partitions.  
Unfortunately, we need "earliest" for exactly once semantics.

In the same kafka 0.10.1.0/spark 2.x setup, our legacy driver using 
spark-streaming-kafka-0-8_2.11 with the prior setting 
auto.offset.reset=smallest runs correctly.

We have tried the following configurations in trying to isolate our problem but 
it is only auto.offset.reset=earliest on a "real multi-machine cluster" which 
causes this problem.

1. Ran with spark standalone cluster(4 Debian nodes, 8vCPU/30GB each)  instead 
of YARN 2.7.3. Single partition read problem persists both cases. Please note 
this problem occurs on an actual cluster of separate VM nodes (but not when our 
engineer runs in as a cluster on his own Mac.)

2. Ran with spark 2.1 nightly build for the last 10 days. Problem persists.
3. Turned off checkpointing. Problem persists with or without checkpointing.
4. Turned off backpressure. Problem persists with or without backpressure.
5. Tried both partition.assignment.strategy RangeAssignor and 
RoundRobinAssignor. Broken with both.
6. Tried both LocationStrategies (PreferConsistent/PreferFixed). Broken with 
both.
7. Tried the simplest scala driver that only logs.  (Our team uses java.) 
Broken with both.
8. Tried increasing GCE capacity for cluster but already we were highly 
overprovisioned for cores and memory. Also tried ramping up executors and 
cores.  Since driver works with auto.offset.reset=latest, we have ruled out GCP 
cloud infrastructure issues.


When we turn on the debug logs, we sometimes see partitions being set to 
different offset configuration even though the consumer config correctly 
indicates auto.offset.reset=earliest. 
{noformat}
8 DEBUG Resetting offset for partition simple_test-8 to earliest offset. 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 DEBUG Resetting offset for partition simple_test-9 to latest offset. 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 TRACE Sending ListOffsetRequest 
{replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=8,timestamp=-2}]}]}
 to broker 10.102.20.12:9092 (id: 12 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 TRACE Sending ListOffsetRequest 
{replica_id=-1,topics=[{topic=simple_test,partitions=[{partition=9,timestamp=-1}]}]}
 to broker 10.102.20.13:9092 (id: 13 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 TRACE Received ListOffsetResponse 
{responses=[{topic=simple_test,partition_responses=[{partition=8,error_code=0,timestamp=-1,offset=0}]}]}
 from broker 10.102.20.12:9092 (id: 12 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 TRACE Received ListOffsetResponse 
{responses=[{topic=simple_test,partition_responses=[{partition=9,error_code=0,timestamp=-1,offset=66724}]}]}
 from broker 10.102.20.13:9092 (id: 13 rack: null) 
(org.apache.kafka.clients.consumer.internals.Fetcher)
8 DEBUG Fetched {timestamp=-1, offset=0} for partition simple_test-8 
(org.apache.kafka.clients.consumer.internals.Fetcher)
9 DEBUG Fetched {timestamp=-1, offset=66724} for partition simple_test-9 
(org.apache.kafka.clients.consumer.internals.Fetcher)
{noformat}

I've enclosed below the completely stripped down trivial test driver that shows 
this behavior.  After spending 2 weeks trying all combinations with a really 
stripped down driver, we think either there might be a bug in the kafka spark 
integration or if the kafka 0.10/spark upgrade needs special configuration, it 
should be fantastic if it was clearer in the documentation. But currently we 
cannot upgrade.

{code}
package com.x.labs.analytics.diagnostics.spark.drivers

import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies
import org.apache.spark.streaming.kafka010.ConsumerStrategies


/**
  *
  * This driver is only for pulling data from the stream and logging to output 
just to isolate single partition bug
  */
object SimpleKafkaLoggingDriver {
  def main(args: Array[String]) {
if (args.length != 4) {
  System.err.println("Usage: SimpleTestDriver  
  ")
  System.exit(1)
}

val Array(brokers, topic, groupId, offsetReset) = args
val preferredHosts = 

[jira] [Commented] (SPARK-3359) `sbt/sbt unidoc` doesn't work with Java 8

2016-11-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679611#comment-15679611
 ] 

Apache Spark commented on SPARK-3359:
-

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15939

> `sbt/sbt unidoc` doesn't work with Java 8
> -
>
> Key: SPARK-3359
> URL: https://issues.apache.org/jira/browse/SPARK-3359
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It seems that Java 8 is stricter on JavaDoc. I got many error messages like
> {code}
> [error] 
> /Users/meng/src/spark-mengxr/core/target/java/org/apache/hadoop/mapred/SparkHadoopMapRedUtil.java:2:
>  error: modifier private not allowed here
> [error] private abstract interface SparkHadoopMapRedUtil {
> [error]  ^
> {code}
> This is minor because we can always use Java 6/7 to generate the doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18073:
--
Description: 
Per 
http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
 , let's consider migrating all wiki pages to documents at 
github.com/apache/spark-website (i.e. spark.apache.org).

Some reasons:
* No pull request system or history for changes to the wiki
* Separate, not-so-clear system for granting write access to wiki
* Wiki doesn't change much
* One less place to maintain or look for docs

The idea would be to then update all wiki pages with a message pointing to the 
new home of the information (or message saying it's obsolete).

Here are the current wikis and my general proposal for what to do with the 
content:

* Additional Language Bindings -> roll this into Third Party Projects
* Committers -> Migrate to a new /committers.html page; wiki already linked 
under Communitiy menu
* Contributing to Spark -> a new /contributing.html page under Community menu
** Jira Permissions Scheme -> obsolete
** Spark Code Style Guide -> roll this into new contributing.html page
* Development Discussions -> obsolete
* Powered By Spark -> new /powered-by.html linked by the existing Communnity 
menu item
* Preparing Spark Releases -> see below; roll into "versioning policy"
* Profiling Spark Applications -> roll into Useful Developer Tools
** Profiling Spark Applications Using YourKit -> roll into Useful Developer 
Tools
* Spark Internals -> new page under Developer menu
** Java API Internals
** PySpark Internals
** Shuffle Internals
** Spark SQL Internals
** Web UI Internals
* Spark QA Infrastructure -> new element in new Developer menu
* Spark Versioning Policy -> new page living under Developer menu
** spark-ec2 AMI list and install file version mappings -> obsolete
** Spark-Shark version mapping -> obsolete
* Third Party Projects -> new page; already linked as Community menu item
* Useful Developer Tools -> new page under new Developer menu 
** Jenkins -> obsolete, remove


Of course, another outcome is to just remove outdated wikis, migrate some, 
leave the rest.

Thoughts?

  was:
Per 
http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
 , let's consider migrating all wiki pages to documents at 
github.com/apache/spark-website (i.e. spark.apache.org).

Some reasons:
* No pull request system or history for changes to the wiki
* Separate, not-so-clear system for granting write access to wiki
* Wiki doesn't change much
* One less place to maintain or look for docs

The idea would be to then update all wiki pages with a message pointing to the 
new home of the information (or message saying it's obsolete).

Here are the current wikis and my general proposal for what to do with the 
content:

* Additional Language Bindings -> roll this into wherever Third Party Projects 
ends up
* Committers -> Migrate to a new /committers.html page, linked under Community 
menu (alread exists)
* Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
/contributing.html page under Community menu
** Jira Permissions Scheme -> obsolete
** Spark Code Style Guide -> roll this into new contributing.html page
* Development Discussions -> obsolete?
* Powered By Spark -> Make into new /powered-by.html linked by the existing 
Commnunity menu item
* Preparing Spark Releases -> see below; roll into where "versioning policy" 
goes?
* Profiling Spark Applications -> roll into where Useful Developer Tools goes
** Profiling Spark Applications Using YourKit -> ditto
* Spark Internals -> all of these look somewhat to very stale; remove?
** Java API Internals
** PySpark Internals
** Shuffle Internals
** Spark SQL Internals
** Web UI Internals
* Spark QA Infrastructure -> tough one. Good info to document; does it belong 
on the website? we can just migrate it
* Spark Versioning Policy -> new page living under Community (?) that documents 
release policy and process (better menu?)
** spark-ec2 AMI list and install file version mappings -> obsolete
** Spark-Shark version mapping -> obsolete
* Third Party Projects -> new Community menu item
* Useful Developer Tools -> new page under new Developer menu? Community?
** Jenkins -> obsolete, remove


Of course, another outcome is to just remove outdated wikis, migrate some, 
leave the rest.

Thoughts?


> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: 

[jira] [Updated] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18073:
--
Assignee: Sean Owen
Priority: Critical  (was: Major)

WIP: https://github.com/apache/spark-website/pull/19

> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> Per 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
>  , let's consider migrating all wiki pages to documents at 
> github.com/apache/spark-website (i.e. spark.apache.org).
> Some reasons:
> * No pull request system or history for changes to the wiki
> * Separate, not-so-clear system for granting write access to wiki
> * Wiki doesn't change much
> * One less place to maintain or look for docs
> The idea would be to then update all wiki pages with a message pointing to 
> the new home of the information (or message saying it's obsolete).
> Here are the current wikis and my general proposal for what to do with the 
> content:
> * Additional Language Bindings -> roll this into wherever Third Party 
> Projects ends up
> * Committers -> Migrate to a new /committers.html page, linked under 
> Community menu (alread exists)
> * Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
> /contributing.html page under Community menu
> ** Jira Permissions Scheme -> obsolete
> ** Spark Code Style Guide -> roll this into new contributing.html page
> * Development Discussions -> obsolete?
> * Powered By Spark -> Make into new /powered-by.html linked by the existing 
> Commnunity menu item
> * Preparing Spark Releases -> see below; roll into where "versioning policy" 
> goes?
> * Profiling Spark Applications -> roll into where Useful Developer Tools goes
> ** Profiling Spark Applications Using YourKit -> ditto
> * Spark Internals -> all of these look somewhat to very stale; remove?
> ** Java API Internals
> ** PySpark Internals
> ** Shuffle Internals
> ** Spark SQL Internals
> ** Web UI Internals
> * Spark QA Infrastructure -> tough one. Good info to document; does it belong 
> on the website? we can just migrate it
> * Spark Versioning Policy -> new page living under Community (?) that 
> documents release policy and process (better menu?)
> ** spark-ec2 AMI list and install file version mappings -> obsolete
> ** Spark-Shark version mapping -> obsolete
> * Third Party Projects -> new Community menu item
> * Useful Developer Tools -> new page under new Developer menu? Community?
> ** Jenkins -> obsolete, remove
> Of course, another outcome is to just remove outdated wikis, migrate some, 
> leave the rest.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472
 ] 

Cody Koeninger edited comment on SPARK-18475 at 11/19/16 4:02 PM:
--

Yes, an RDD does have an ordering guarantee, it's an iterator per partition, 
same as Kafka.  Yes, that guarantee is part of the Kafka data model (Burak, if 
you don't believe me, go reread 
http://kafka.apache.org/documentation.html#introduction  search for "order").  
Because the direct stream (and the structured stream that uses the same model) 
has a 1:1 correspondence between kafka partition and spark partition, that 
guarantee is preserved.  The existing distortions between the Kafka model and 
the direct stream / structured stream are enough as it is, we don't need to add 
more.



was (Author: c...@koeninger.org):
Yes, an RDD does have an ordering guarantee, it's an iterator per partition, 
same as Kafka.  Yes, that guarantee is part of the Kafka data model (Burak, if 
you don't believe me, go reread 
http://kafka.apache.org/documentation.html#introduction  search for "order").  
Because the direct stream (and the structured stream that uses the same model) 
has a 1:! correspondence between kafka partition and spark partition, that 
guarantee is preserved.  The existing distortions between the Kafka model and 
the direct stream / structured stream are enough as it is, we don't need to add 
more.


> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-19 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679472#comment-15679472
 ] 

Cody Koeninger commented on SPARK-18475:


Yes, an RDD does have an ordering guarantee, it's an iterator per partition, 
same as Kafka.  Yes, that guarantee is part of the Kafka data model (Burak, if 
you don't believe me, go reread 
http://kafka.apache.org/documentation.html#introduction  search for "order").  
Because the direct stream (and the structured stream that uses the same model) 
has a 1:! correspondence between kafka partition and spark partition, that 
guarantee is preserved.  The existing distortions between the Kafka model and 
the direct stream / structured stream are enough as it is, we don't need to add 
more.


> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679303#comment-15679303
 ] 

Sean Owen commented on SPARK-17436:
---

Opening a pull request is available to everyone. Maybe you are trying to push a 
branch to apache/spark, which you can't do. You push branches to your own fork. 
In any event, can you say anything at all about the proposed change here?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18509.
---
Resolution: Not A Problem

Yes, best to post the question to amplab/spark-ec2

> spark-ec2 init.sh requests .tgz files not available at 
> http://s3.amazonaws.com/spark-related-packages
> -
>
> Key: SPARK-18509
> URL: https://issues.apache.org/jira/browse/SPARK-18509
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.3, 2.0.1
> Environment: AWS EC2, AWS Linux, OS X 10.12.x (local)
>Reporter: Peter B. Pearman
>  Labels: beginner
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> When I run the spark-ec2 script in a local spark-1.6.3 installation, the 
> error 'ERROR: Unknown Spark version' is generated:
> Initializing spark
> --2016-11-18 22:33:06--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-11-18 22:33:06 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 00s
> I think this happens when init.sh executes these lines:
>   if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
>   elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
>   else
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
>   fi
>   if [ $? != 0 ]; then
> echo "ERROR: Unknown Spark version"
> return -1
>   fi
> spark-1.6.3-bin-hadoop1.tgz does not exist on 
> 
> Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exist at that 
> location. So with these versions, if in init.sh [ "$HADOOP_MAJOR_VERSION" == 
> "1" ] evaluates to True, spark installation on the EC2 cluster will fail. 
> Related (perhaps a different bug?) is: I have installed 
> spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
> it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
> version would be requested from 
> . I am not experienced 
> enough to verify this. My installed hadoop version should be 2.6.  Please 
> tell me if this should be a different bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-18509:
---

> spark-ec2 init.sh requests .tgz files not available at 
> http://s3.amazonaws.com/spark-related-packages
> -
>
> Key: SPARK-18509
> URL: https://issues.apache.org/jira/browse/SPARK-18509
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.3, 2.0.1
> Environment: AWS EC2, AWS Linux, OS X 10.12.x (local)
>Reporter: Peter B. Pearman
>  Labels: beginner
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> When I run the spark-ec2 script in a local spark-1.6.3 installation, the 
> error 'ERROR: Unknown Spark version' is generated:
> Initializing spark
> --2016-11-18 22:33:06--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-11-18 22:33:06 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 00s
> I think this happens when init.sh executes these lines:
>   if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
>   elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
>   else
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
>   fi
>   if [ $? != 0 ]; then
> echo "ERROR: Unknown Spark version"
> return -1
>   fi
> spark-1.6.3-bin-hadoop1.tgz does not exist on 
> 
> Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exist at that 
> location. So with these versions, if in init.sh [ "$HADOOP_MAJOR_VERSION" == 
> "1" ] evaluates to True, spark installation on the EC2 cluster will fail. 
> Related (perhaps a different bug?) is: I have installed 
> spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
> it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
> version would be requested from 
> . I am not experienced 
> enough to verify this. My installed hadoop version should be 2.6.  Please 
> tell me if this should be a different bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Peter B. Pearman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter B. Pearman closed SPARK-18509.

Resolution: Fixed

Not an Apache project bucket

> spark-ec2 init.sh requests .tgz files not available at 
> http://s3.amazonaws.com/spark-related-packages
> -
>
> Key: SPARK-18509
> URL: https://issues.apache.org/jira/browse/SPARK-18509
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.3, 2.0.1
> Environment: AWS EC2, AWS Linux, OS X 10.12.x (local)
>Reporter: Peter B. Pearman
>  Labels: beginner
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> When I run the spark-ec2 script in a local spark-1.6.3 installation, the 
> error 'ERROR: Unknown Spark version' is generated:
> Initializing spark
> --2016-11-18 22:33:06--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-11-18 22:33:06 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 00s
> I think this happens when init.sh executes these lines:
>   if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
>   elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
>   else
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
>   fi
>   if [ $? != 0 ]; then
> echo "ERROR: Unknown Spark version"
> return -1
>   fi
> spark-1.6.3-bin-hadoop1.tgz does not exist on 
> 
> Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exist at that 
> location. So with these versions, if in init.sh [ "$HADOOP_MAJOR_VERSION" == 
> "1" ] evaluates to True, spark installation on the EC2 cluster will fail. 
> Related (perhaps a different bug?) is: I have installed 
> spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
> it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
> version would be requested from 
> . I am not experienced 
> enough to verify this. My installed hadoop version should be 2.6.  Please 
> tell me if this should be a different bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Peter B. Pearman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679219#comment-15679219
 ] 

Peter B. Pearman commented on SPARK-18509:
--

So, I guess I should post an issue to AmpiLab, yes?

> spark-ec2 init.sh requests .tgz files not available at 
> http://s3.amazonaws.com/spark-related-packages
> -
>
> Key: SPARK-18509
> URL: https://issues.apache.org/jira/browse/SPARK-18509
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.3, 2.0.1
> Environment: AWS EC2, AWS Linux, OS X 10.12.x (local)
>Reporter: Peter B. Pearman
>  Labels: beginner
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> When I run the spark-ec2 script in a local spark-1.6.3 installation, the 
> error 'ERROR: Unknown Spark version' is generated:
> Initializing spark
> --2016-11-18 22:33:06--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-11-18 22:33:06 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 00s
> I think this happens when init.sh executes these lines:
>   if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
>   elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
>   else
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
>   fi
>   if [ $? != 0 ]; then
> echo "ERROR: Unknown Spark version"
> return -1
>   fi
> spark-1.6.3-bin-hadoop1.tgz does not exist on 
> 
> Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exist at that 
> location. So with these versions, if in init.sh [ "$HADOOP_MAJOR_VERSION" == 
> "1" ] evaluates to True, spark installation on the EC2 cluster will fail. 
> Related (perhaps a different bug?) is: I have installed 
> spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
> it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
> version would be requested from 
> . I am not experienced 
> enough to verify this. My installed hadoop version should be 2.6.  Please 
> tell me if this should be a different bug report.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Peter B. Pearman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter B. Pearman updated SPARK-18509:
-
Environment: AWS EC2, AWS Linux, OS X 10.12.x (local)  (was: AWS EC2, 
Amazon Linux, OS X 10.12.x)
Description: 
When I run the spark-ec2 script in a local spark-1.6.3 installation, the error 
'ERROR: Unknown Spark version' is generated:

Initializing spark
--2016-11-18 22:33:06--  
http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-11-18 22:33:06 ERROR 404: Not Found.

ERROR: Unknown Spark version
spark/init.sh: line 137: return: -1: invalid option
return: usage: return [n]
Unpacking Spark
tar (child): spark-*.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove `spark-*.tgz': No such file or directory
mv: missing destination file operand after `spark'
Try `mv --help' for more information.
[timing] spark init:  00h 00m 00s

I think this happens when init.sh executes these lines:
  if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
  else
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
  fi
  if [ $? != 0 ]; then
echo "ERROR: Unknown Spark version"
return -1
  fi

spark-1.6.3-bin-hadoop1.tgz does not exist on 

Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exist at that location. 
So with these versions, if in init.sh [ "$HADOOP_MAJOR_VERSION" == "1" ] 
evaluates to True, spark installation on the EC2 cluster will fail. 

Related (perhaps a different bug?) is: I have installed 
spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
version would be requested from 
. I am not experienced enough 
to verify this. My installed hadoop version should be 2.6.  Please tell me if 
this should be a different bug report.

  was:
In spark-1.6.3, an ERROR: Unknown Spark version is generated, probably when 
init.sh executes these lines:
  if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
  else
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
  fi
  if [ $? != 0 ]; then
echo "ERROR: Unknown Spark version"
return -1
  fi

The error I got on running the spark-ec2 script locally was:
Initializing spark
--2016-11-18 22:33:06--  
http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-11-18 22:33:06 ERROR 404: Not Found.

ERROR: Unknown Spark version
spark/init.sh: line 137: return: -1: invalid option
return: usage: return [n]
Unpacking Spark
tar (child): spark-*.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove `spark-*.tgz': No such file or directory
mv: missing destination file operand after `spark'
Try `mv --help' for more information.
[timing] spark init:  00h 00m 00s

spark-1.6.3-bin-hadoop1.tgz does not exist on 

Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exit at that location. 
So if [ "$HADOOP_MAJOR_VERSION" == "1" ] evaluates to True, spark installation 
of these (and maybe other) versions on the EC2 cluster will fail.

Related (perhaps a different bug?) is: I installed 
spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
version would be requested from 
. I am not experienced enough 
to verify this. My installed hadoop version should be 2.6.  Please tell me if 
this should be a different request.


> spark-ec2 init.sh requests .tgz files not 

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679204#comment-15679204
 ] 

Ran Haim edited comment on SPARK-17436 at 11/19/16 12:44 PM:
-

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data in a partitioned folder you 
lose sorting, and this is unacceptable when thinking on read performance and 
data on disk size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.


was (Author: ran.h...@optimalplus.com):
Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679204#comment-15679204
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
When you want to write your data to orc files or perquet files,
even if the dataframe is partitioned correctly, you have to tell the writer how 
to partition the data.
This means that when you want to write your data partitioned you lose sorting, 
and this is unacceptable when thinking on read performance and data on disk 
size.

I already changed the code locally, and it works as excpeted - but I have no 
permissions to create a PR, and I do not know how to get it.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679205#comment-15679205
 ] 

Sean Owen commented on SPARK-18509:
---

This script is not part of Spark, and the project does not deploy any releases 
to the bucket you're referencing. 

> spark-ec2 init.sh requests .tgz files not available at 
> http://s3.amazonaws.com/spark-related-packages
> -
>
> Key: SPARK-18509
> URL: https://issues.apache.org/jira/browse/SPARK-18509
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.3, 2.0.1
> Environment: AWS EC2, Amazon Linux, OS X 10.12.x
>Reporter: Peter B. Pearman
>  Labels: beginner
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In spark-1.6.3, an ERROR: Unknown Spark version is generated, probably when 
> init.sh executes these lines:
>   if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
>   elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
>   else
> wget 
> http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
>   fi
>   if [ $? != 0 ]; then
> echo "ERROR: Unknown Spark version"
> return -1
>   fi
> The error I got on running the spark-ec2 script locally was:
> Initializing spark
> --2016-11-18 22:33:06--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-11-18 22:33:06 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> [timing] spark init:  00h 00m 00s
> spark-1.6.3-bin-hadoop1.tgz does not exist on 
> 
> Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exit at that location. 
> So if [ "$HADOOP_MAJOR_VERSION" == "1" ] evaluates to True, spark 
> installation of these (and maybe other) versions on the EC2 cluster will fail.
> Related (perhaps a different bug?) is: I installed 
> spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
> it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
> version would be requested from 
> . I am not experienced 
> enough to verify this. My installed hadoop version should be 2.6.  Please 
> tell me if this should be a different request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-11-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679196#comment-15679196
 ] 

Steve Loughran commented on SPARK-2984:
---

That sounds like a separate issue...could you open a new JIRA?

Also, are you really using EMR 5.01 and an s3a URL? Interesting

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> {noformat}
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. 
> Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files.
>   at 
> 

[jira] [Created] (SPARK-18509) spark-ec2 init.sh requests .tgz files not available at http://s3.amazonaws.com/spark-related-packages

2016-11-19 Thread Peter B. Pearman (JIRA)
Peter B. Pearman created SPARK-18509:


 Summary: spark-ec2 init.sh requests .tgz files not available at 
http://s3.amazonaws.com/spark-related-packages
 Key: SPARK-18509
 URL: https://issues.apache.org/jira/browse/SPARK-18509
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 2.0.1, 1.6.3
 Environment: AWS EC2, Amazon Linux, OS X 10.12.x
Reporter: Peter B. Pearman


In spark-1.6.3, an ERROR: Unknown Spark version is generated, probably when 
init.sh executes these lines:
  if [[ "$HADOOP_MAJOR_VERSION" == "1" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop1.tgz
  elif [[ "$HADOOP_MAJOR_VERSION" == "2" ]]; then
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-cdh4.tgz
  else
wget 
http://s3.amazonaws.com/spark-related-packages/spark-$SPARK_VERSION-bin-hadoop2.4.tgz
  fi
  if [ $? != 0 ]; then
echo "ERROR: Unknown Spark version"
return -1
  fi

The error I got on running the spark-ec2 script locally was:
Initializing spark
--2016-11-18 22:33:06--  
http://s3.amazonaws.com/spark-related-packages/spark-1.6.3-bin-hadoop1.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.1.3
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.1.3|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2016-11-18 22:33:06 ERROR 404: Not Found.

ERROR: Unknown Spark version
spark/init.sh: line 137: return: -1: invalid option
return: usage: return [n]
Unpacking Spark
tar (child): spark-*.tgz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
rm: cannot remove `spark-*.tgz': No such file or directory
mv: missing destination file operand after `spark'
Try `mv --help' for more information.
[timing] spark init:  00h 00m 00s

spark-1.6.3-bin-hadoop1.tgz does not exist on 

Similarly, a spark-2.0.1-bin-hadoop1.tgz also does not exit at that location. 
So if [ "$HADOOP_MAJOR_VERSION" == "1" ] evaluates to True, spark installation 
of these (and maybe other) versions on the EC2 cluster will fail.

Related (perhaps a different bug?) is: I installed 
spark-1.6.3-bin-hadoop2.6.tgz, but if the error is generated by init.sh, then 
it appears that HADOOP_MAJOR_VERSION ==1 is True, otherwise a different spark 
version would be requested from 
. I am not experienced enough 
to verify this. My installed hadoop version should be 2.6.  Please tell me if 
this should be a different request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679106#comment-15679106
 ] 

Sean Owen commented on SPARK-17436:
---

[~ran.h...@optimalplus.com] I'm not sure how to proceed on this. Above you say 
that sort-then-partition doesn't preserve the sort, which is correct, it does 
not. For example, if I sort people by name, then partition by age, there's no 
way in general that they can stay sorted by name. Younger people may have names 
alphabetically after older people. Here you seem to be talking about 
partitioning then sorting. That leaves it sorted, though may change the 
partitioning. Do you mean partitioning by the same value you sort on? or 
partitioning one way and sorting within partitions another way?

If you're not able to open a PR, can you give a short example illustrating the 
issue?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18353:
--
Priority: Minor  (was: Critical)

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18353.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15833
[https://github.com/apache/spark/pull/15833]

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
> Fix For: 2.1.0
>
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18353:
-

Assignee: Sean Owen

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 2.1.0
>
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18447) Fix `Note:`/`NOTE:`/`Note that` across Python API documentation

2016-11-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679092#comment-15679092
 ] 

Sean Owen commented on SPARK-18447:
---

[~aditya1702] if you're 'claiming' this -- are you able to finish it in the 
next day or two? 2.1 is imminent and we merged the Scala change already. It 
won't block the release or anything, but would be nice to get both in. If 
several people want to work on it, I'd go with whoever can make a PR soonest.

> Fix `Note:`/`NOTE:`/`Note that` across Python API documentation
> ---
>
> Key: SPARK-18447
> URL: https://issues.apache.org/jira/browse/SPARK-18447
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Aditya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18445) Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18445.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15889
[https://github.com/apache/spark/pull/15889]

> Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation
> ---
>
> Key: SPARK-18445
> URL: https://issues.apache.org/jira/browse/SPARK-18445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> Same issue with the parent but the scope is within Scala/Java API 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18445) Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18445:
--
Assignee: Hyukjin Kwon

> Fix `Note:`/`NOTE:`/`Note that` across Scala/Java API documentation
> ---
>
> Key: SPARK-18445
> URL: https://issues.apache.org/jira/browse/SPARK-18445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> Same issue with the parent but the scope is within Scala/Java API 
> documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18448) SparkSession should implement java.lang.AutoCloseable like JavaSparkContext

2016-11-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679084#comment-15679084
 ] 

Apache Spark commented on SPARK-18448:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15938

> SparkSession should implement java.lang.AutoCloseable like JavaSparkContext
> ---
>
> Key: SPARK-18448
> URL: https://issues.apache.org/jira/browse/SPARK-18448
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Andrew Ash
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
>
> https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html
> This makes using cleaning up SparkSessions in Java easier, but may introduce 
> a Java 8 dependency if applied directly.  JavaSparkContext uses 
> java.io.Closeable I think to avoid this Java 8 dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18503) Pre 2.0 spark driver/executor memory default unit is bytes, post 2.0 default unit is MB

2016-11-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679064#comment-15679064
 ] 

Sean Owen commented on SPARK-18503:
---

Yes, I don't think that's a problem per se. The various memory properties now 
default to interpreting this value as megabytes, and in any event I don't think 
we'd change the behavior in a minor release as it would be a behavior change. 
It's best to specify your units anyway for clarity.

CC [~jerryshao] given 
https://github.com/apache/spark/pull/11603/files#diff-6bdad48cfc34314e89599655442ff210R38

> Pre 2.0 spark driver/executor memory default unit is bytes, post 2.0 default 
> unit is MB
> ---
>
> Key: SPARK-18503
> URL: https://issues.apache.org/jira/browse/SPARK-18503
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Chris McCubbin
>Priority: Minor
>
> Prior to v2.0, Spark's executor memory default unit was in bytes. i.e. if one 
> set "spark.executor.memory" to "100" this was interpreted as 1,000,000 
> bytes if no unit was supplied (like "1m"). This is consistent with the JVM.
> Changes introduced by SPARK-12343 in 2.0 made this default unit MB, so 
> "100" is interpreted as 1,000,000 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18485) Underlying integer overflow when create ChunkedByteBufferOutputStream in MemoryStore

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18485:
--
Fix Version/s: (was: 1.6.4)
   (was: 2.0.3)
   (was: 2.1.0)

> Underlying integer overflow when create ChunkedByteBufferOutputStream in 
> MemoryStore
> 
>
> Key: SPARK-18485
> URL: https://issues.apache.org/jira/browse/SPARK-18485
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Genmao Yu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18474) Add StreamingQuery.status in python

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18474:
--
Fix Version/s: (was: 2.1.0)

> Add StreamingQuery.status in python
> ---
>
> Key: SPARK-18474
> URL: https://issues.apache.org/jira/browse/SPARK-18474
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7712) Window Function Improvements

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7712:
-
Assignee: Herman van Hovell

> Window Function Improvements
> 
>
> Key: SPARK-7712
> URL: https://issues.apache.org/jira/browse/SPARK-7712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Critical
> Fix For: 2.0.0
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> This is an umbrella ticket for Window Function Improvements targetted at 
> SPARK-1.5.0/1.6.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-14171:
---

> UDAF aggregates argument object inspector not parsed correctly
> --
>
> Key: SPARK-14171
> URL: https://issues.apache.org/jira/browse/SPARK-14171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>Priority: Critical
>
> For example, when using percentile_approx and count distinct together, it 
> raises an error complaining the argument is not constant. We have a test case 
> to reproduce. Could you help look into a fix of this? This was working in 
> previous version (Spark 1.4 + Hive 0.13). Thanks!
> {code}--- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with 
> TestHiveSingleton with SQLTestUtils {
>  checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM 
> src LIMIT 1"),
>sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
> +
> +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
> key) FROM src LIMIT 1"),
> +  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
> }
>test("UDFIntegerToString") {
> {code}
> When running the test suite, we can see this error:
> {code}
> - Generic UDAF aggregates *** FAILED ***
>   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: 
> hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   ...
>   Cause: java.lang.reflect.InvocationTargetException:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
>   ...
>   Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
> argument must be a constant, but double was passed instead.
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
>   at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> 

[jira] [Resolved] (SPARK-14171) UDAF aggregates argument object inspector not parsed correctly

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14171.
---
   Resolution: Duplicate
Fix Version/s: (was: 2.1.0)

> UDAF aggregates argument object inspector not parsed correctly
> --
>
> Key: SPARK-14171
> URL: https://issues.apache.org/jira/browse/SPARK-14171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jianfeng Hu
>Priority: Critical
>
> For example, when using percentile_approx and count distinct together, it 
> raises an error complaining the argument is not constant. We have a test case 
> to reproduce. Could you help look into a fix of this? This was working in 
> previous version (Spark 1.4 + Hive 0.13). Thanks!
> {code}--- 
> a/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> +++ 
> b/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
> @@ -148,6 +148,9 @@ class HiveUDFSuite extends QueryTest with 
> TestHiveSingleton with SQLTestUtils {
>  checkAnswer(sql("SELECT percentile_approx(100.0, array(0.9, 0.9)) FROM 
> src LIMIT 1"),
>sql("SELECT array(100, 100) FROM src LIMIT 1").collect().toSeq)
> +
> +checkAnswer(sql("SELECT percentile_approx(key, 0.9), count(distinct 
> key) FROM src LIMIT 1"),
> +  sql("SELECT max(key), 1 FROM src LIMIT 1").collect().toSeq)
> }
>test("UDFIntegerToString") {
> {code}
> When running the test suite, we can see this error:
> {code}
> - Generic UDAF aggregates *** FAILED ***
>   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: 
> hiveudaffunction(HiveFunctionWrapper(org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox,org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox@6e1dc6a7),key#51176,0.9,false,0,0)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:238)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter.org$apache$spark$sql$catalyst$analysis$DistinctAggregationRewriter$$patchAggregateFunctionChildren$1(DistinctAggregationRewriter.scala:148)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:192)
>   at 
> org.apache.spark.sql.catalyst.analysis.DistinctAggregationRewriter$$anonfun$15.apply(DistinctAggregationRewriter.scala:190)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   ...
>   Cause: java.lang.reflect.InvocationTargetException:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:368)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1$$anonfun$apply$12.apply(TreeNode.scala:367)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$makeCopy$1.apply(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
>   ...
>   Cause: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: The second 
> argument must be a constant, but double was passed instead.
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox.getEvaluator(GenericUDAFPercentileApprox.java:147)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector$lzycompute(hiveUDFs.scala:598)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.functionAndInspector(hiveUDFs.scala:596)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector$lzycompute(hiveUDFs.scala:606)
>   at 
> org.apache.spark.sql.hive.HiveUDAFFunction.returnInspector(hiveUDFs.scala:606)
>   at org.apache.spark.sql.hive.HiveUDAFFunction.(hiveUDFs.scala:654)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 

[jira] [Updated] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18172:
--
Assignee: Song Jun

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>Assignee: Song Jun
> Fix For: 2.0.3, 2.1.0
>
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> 

[jira] [Assigned] (SPARK-18448) SparkSession should implement java.lang.AutoCloseable like JavaSparkContext

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-18448:
-

Assignee: Sean Owen

> SparkSession should implement java.lang.AutoCloseable like JavaSparkContext
> ---
>
> Key: SPARK-18448
> URL: https://issues.apache.org/jira/browse/SPARK-18448
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Andrew Ash
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.1.0
>
>
> https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html
> This makes using cleaning up SparkSessions in Java easier, but may introduce 
> a Java 8 dependency if applied directly.  JavaSparkContext uses 
> java.io.Closeable I think to avoid this Java 8 dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18464) Spark SQL fails to load tables created without providing a schema

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18464:
--
Assignee: Wenchen Fan

> Spark SQL fails to load tables created without providing a schema
> -
>
> Key: SPARK-18464
> URL: https://issues.apache.org/jira/browse/SPARK-18464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 2.1.0
>
>
> I have a old table that was created without providing a schema. Seems branch 
> 2.1 fail to load it and says that the schema is corrupt. 
> With {{spark.sql.debug}} enabled, I get the metadata by using {{describe 
> formatted}}.
> {code}
> [col,array,from deserializer]
> [,,]
> [# Detailed Table Information,,]
> [Database:,mydb,]
> [Owner:,root,]
> [Create Time:,Fri Jun 17 11:55:07 UTC 2016,]
> [Last Access Time:,Thu Jan 01 00:00:00 UTC 1970,]
> [Location:,mylocation,]
> [Table Type:,EXTERNAL,]
> [Table Parameters:,,]
> [  transient_lastDdlTime,1466164507,]
> [  spark.sql.sources.provider,parquet,]
> [,,]
> [# Storage Information,,]
> [SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
> [InputFormat:,org.apache.hadoop.mapred.SequenceFileInputFormat,]
> [OutputFormat:,org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat,]
> [Compressed:,No,]
> [Storage Desc Parameters:,,]
> [  path,/myPatch,]
> [  serialization.format,1,]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18469) Cannot make MLlib model predictions in Spark streaming with checkpointing

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18469:
--
Target Version/s:   (was: 1.6.2)

> Cannot make MLlib model predictions in Spark streaming with checkpointing
> -
>
> Key: SPARK-18469
> URL: https://issues.apache.org/jira/browse/SPARK-18469
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, MLlib
>Affects Versions: 1.6.2
>Reporter: Alex Jarman
>
> Enabling checkpointing whilst trying to produce predictions with an offline 
> MLlib model in Spark Streaming throws up the following error: 
> "Exception: It appears that you are attempting to reference SparkContext from 
> a broadcast variable, action, or transformation. SparkContext can only be 
> used on the driver, not in code that it run on workers. For more information, 
> see SPARK-5063." 
> The line in my code that appears to cause this is when calling the transform  
> transformation on the DStream with model.predictAll i.e. : 
> predictions = ratings.map(lambda r: 
> (int(r[1]),int(r[2]))).transform(lambda rdd: 
> model.predictAll(rdd)).map(lambda r: (r[0], r[1], r[2])) 
> See 
> http://stackoverflow.com/questions/40566492/spark-streaming-model-predictions-with-checkpointing-reference-sparkcontext-fr
>  for a fuller description.. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18448) SparkSession should implement java.lang.AutoCloseable like JavaSparkContext

2016-11-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18448.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15932
[https://github.com/apache/spark/pull/15932]

> SparkSession should implement java.lang.AutoCloseable like JavaSparkContext
> ---
>
> Key: SPARK-18448
> URL: https://issues.apache.org/jira/browse/SPARK-18448
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
>Reporter: Andrew Ash
>Priority: Trivial
> Fix For: 2.1.0
>
>
> https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html
> This makes using cleaning up SparkSessions in Java easier, but may introduce 
> a Java 8 dependency if applied directly.  JavaSparkContext uses 
> java.io.Closeable I think to avoid this Java 8 dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org